benton@BIO.NLM.NIH.GOV (02/12/91)
This message contains the summary report of the second meeting of the
Joint Informatics Task Force. The next message will contain the detailed
minutes of that meeting.
Summary Report
The Joint Informatics Task Force met for the second time on
November 30 and December 1. This meeting was specifically
addressed to the needs of the Human Genome Project for public
databases containing mapping and sequence data. The discussions
were, therefore, organized around presentations from extant map
and sequence databases:
Dr. Peter Pearson and Mr. Richard Lucier presented the
current status and future development plans of the Genome
Data Base, located in the Welch Medical Library at the Johns
Hopkins University and currently supported by the Howard
Hughes Medical Institute. They emphasized the importance of
building the database content through the active involvement
of the scientific community in the validation process.
Dr. Elbert Branscomb presented a description of the physical
mapping database developed in the Lawrence Livermore
National Laboratory Genome Center and the access tools
developed there. He demonstrated a relatively low-cost
general approach to providing access to multiple remote
databases and described the design and implementation
requirements this approach imposes on the databases.
Dr. James Fickett described the data analysis and database
systems implemented by the Genome Center at Los Alamos
National Laboratory and provided the Task Force with his
observations on both the immediate and longer-term database
needs of the HGP and the appropriate architectures to meet
those needs.
Dr. Scott Tingey of DuPont Corporation described the
database needs and approaches of the Molecular Breeding
Program (Plants), described some solutions to the problems
of storing vast quantities of laboratory data in accessible
forms.
Dr. Brian Hauge of Massachusetts General Hospital described
the Arabidopsis genome mapping program's progress and
discussed its database needs.
Dr. David Lipman described the evolution of the requirements
analysis and design of the GenInfo Backbone sequence
database, to be operated by the National Center for
Biotechnology Information.
Dr. Paul Gilna and Mr. Michael Cinkosky of LANL presented
the progress of the GenBank project toward implementing the
"Electronic Data Publishing" model they described. They
emphasized the future importance of this model, predicting
that most sequence data will not be published in scientific
journals in the near future, and pointed out that the
central databases have an important role to play in
presenting an integrated (or "consensus") view of
disparately published (and unpublished) data.
Dr. Minoru Kanehisa of Kyoto University presented a general
overview of the Japanese Human Genome Project and the plans
for its Informatics Project. Dr. Kanehisa will head the
Priority Area Research project on Genome Informatics, funded
by the Ministry of Education, Science, and Culture. While
the informatics project includes some research on using new
data models, the intention is to focus on data analysis and
building knowledge bases rather than on building large
databases.
Graham Cameron of the EMBL Data Library reported very
briefly on the current status and services offered by the
Data Library and raised several issues of importance for
sequence databases in the future, for example, the change in
scale of the databases also introduces additional
representational complexity. Mr. Cameron emphasized that
public sequence databases must be capable of providing (or
supporting) different views of the data, since a single view
(or level of integration) is not appropriate, or even
valid, for all applications.
In discussion, the Task Force arrived at consensus observations
which it considers as important guidelines for establishing
genome data resources. Some of these were:
Mapping databases are most naturally organized as organism-
specific consensus map databases, containing all genetic and
physical mapping data which is significantly useful to the
biomedical community.
Any centralized, consensus databases should provide access,
either direct or indirect, to the supporting data.
Unless there is good justification for doing otherwise, both
central databases and any supporting project databases
should be implemented using software and hardware systems
that adhere to industry standards. Currently, that means
commercial client-server architecture relational database
management systems running on Posix-compliant computers
connected to the research Internet and supporting
communications with the TCP/IP protocol.
Public-use databases must provide a stable, documented
Application Program Interface (API) so that third parties
may develop interface software to the data.
Public-use databases should use a standard system for
representing typographic information (e.g., italics,
superscripts, etc) where these have important scientific
meaning. SGML is one such standard.
Public-use databases must be designed to support
differential (among authorized users) accessibility of
data.
Data suppliers should be encouraged to estimate confidence
limits of data or consensus element and these should be
represented in the database.
The databases should maintain a history of changes to the
database (an audit trail or set of editorial citations).
The four working groups of the JITF reported briefly on their
work. The following action items were included in the reports or
resulted from them:
The Data Requirements Working Group reported that it had
decided that, since the data requirements for sequence
databases are under investigation by other groups, this
working group would concentrate on mapping data. Dr. Lipman
recommended that the working group extend its inquiries to
the groups mapping the model organism genomes. HUGO has
established a committee on physical mapping data and Dr.
Branscomb was appointed as liaison to that committee. The
working group recommended that JITF meet with the
well-developed mapping groups in the near future.
The Connectivity and Infrastructure Working Group reported
that its aim is to foster capability and not to mandate what
is done or how. It recognizes, however, that currently the
Internet TCP/IP protocol suite is the standard for
connectivity in the U.S. The working group recommended that
all genome centers and genome data resources should be
Internet accessible and that the funding agencies should
provide guidance and support for these connections. The
group pointed out that the availability of such resources
on the network would create a second cycle of demand from
individual researchers who will need access to the
resources. The NIH and DOE should be aware of this expected
increase in demand and be prepared to provide for it. The
JITF asked that the topic of connectivity be put on the
agenda for the next meeting.
The Training Working Group has narrowed its focus to two
short-term areas of interest: (1) development of a visible,
high-level summer course in genome informatics for those
whose training is primarily in biology; and (2) the
institution of fellowships in genome informatics. Although
DOE, NIH, and NSF all have genome and/or computation
fellowships, the working group argued the Genome Project has
an opportunity to have a real impact on interdisciplinary
training in computation and biology by specifically
designing a fellowship that would be available at a number
of levels: predoctoral, postdoctoral, and mid-career.
The Long-Term Needs Working Group reported that particular
needs were seen for development of analytical tools and for
training in genome informatics. It was noted that the NSF
has taken the lead in training in biocomputing. Mr. Olken
noted that fundamental advances in database theory and
practice are required and that the HGP should support basic
research in database theory. Dr. Lipman suggested that the
amount of the HGP budget devoted to basic research on
database theory and technology should be less than the cost
of one genome center, but that the HGP needs to be aware of
the research being done and needs to find a way to get
computer scientists to devote themselves to solving the
database problems posed by the HGP.
The next JITF meeting is tentatively scheduled for March 14-15,
1991.
submitted by: David Benton
National Center for Human Genome Research
National Institutes of Health
Robert Robbins
Database Activities Program
National Science Foundation