[bionet.molbio.genome-program] Joint Informatics Task Force Meeting Report

benton@BIO.NLM.NIH.GOV (02/12/91)

This message contains the summary report of the second meeting of the
Joint Informatics Task Force.  The next message will contain the detailed
minutes of that meeting.


Summary Report

The Joint Informatics Task Force met for the second time on
November 30 and December 1.  This meeting was specifically
addressed to the needs of the Human Genome Project for public
databases containing mapping and sequence data.  The discussions
were, therefore, organized around presentations from extant map
and sequence databases:

     Dr. Peter Pearson and Mr. Richard Lucier presented the
     current status and future development plans of the Genome
     Data Base, located in the Welch Medical Library at the Johns
     Hopkins University and currently supported by the Howard
     Hughes Medical Institute.  They emphasized the importance of
     building the database content through the active involvement
     of the scientific community in the validation process.

     Dr. Elbert Branscomb presented a description of the physical
     mapping database developed in the Lawrence Livermore
     National Laboratory Genome Center and the access tools
     developed there.  He demonstrated a relatively low-cost
     general approach to providing access to multiple remote
     databases and described the design and implementation
     requirements this approach imposes on the databases.

     Dr. James Fickett described the data analysis and database
     systems implemented by the Genome Center at Los Alamos
     National Laboratory and provided the Task Force with his
     observations on both the immediate and longer-term database
     needs of the HGP and the appropriate architectures to meet
     those needs.

     Dr. Scott Tingey of DuPont Corporation described the
     database needs and approaches of the Molecular Breeding
     Program (Plants), described some solutions to the problems
     of storing vast quantities of laboratory data in accessible
     forms.

     Dr. Brian Hauge of Massachusetts General Hospital described
     the Arabidopsis genome mapping program's progress and
     discussed its database needs.

     Dr. David Lipman described the evolution of the requirements
     analysis and design of the GenInfo Backbone sequence
     database, to be operated by the National Center for
     Biotechnology Information.

     Dr. Paul Gilna and Mr. Michael Cinkosky of LANL presented
     the progress of the GenBank project toward implementing the
     "Electronic Data Publishing" model they described.  They
     emphasized the future importance of this model, predicting
     that most sequence data will not be published in scientific
     journals in the near future, and pointed out that the
     central databases have an important role to play in
     presenting an integrated (or "consensus") view of
     disparately published (and unpublished) data.

     Dr. Minoru Kanehisa of Kyoto University presented a general
     overview of the Japanese Human Genome Project and the plans
     for its Informatics Project.  Dr. Kanehisa will head the
     Priority Area Research project on Genome Informatics, funded
     by the Ministry of Education, Science, and Culture.  While
     the informatics project includes some research on using new
     data models, the intention is to focus on data analysis and
     building knowledge bases rather than on building large
     databases.

     Graham Cameron of the EMBL Data Library reported very
     briefly on the current status and services offered by the
     Data Library and raised several issues of importance for
     sequence databases in the future, for example, the change in
     scale of the databases also introduces additional
     representational complexity.  Mr. Cameron emphasized that
     public sequence databases must be capable of providing (or
     supporting) different views of the data, since a single view
     (or level of integration) is not appropriate, or even
     valid, for all applications.

In discussion, the Task Force arrived at consensus observations
which it considers as important guidelines for establishing
genome data resources.  Some of these were:

     Mapping databases are most naturally organized as organism-
     specific consensus map databases, containing all genetic and
     physical mapping data which is significantly useful to the
     biomedical community.

     Any centralized, consensus databases should provide access,
     either direct or indirect, to the supporting data.

     Unless there is good justification for doing otherwise, both
     central databases and any supporting project databases
     should be implemented using software and hardware systems
     that adhere to industry standards.  Currently, that means
     commercial client-server architecture relational database
     management systems running on Posix-compliant computers
     connected to the research Internet and supporting
     communications with the TCP/IP protocol.

     Public-use databases must provide a stable, documented
     Application Program Interface (API) so that third parties
     may develop interface software to the data.

     Public-use databases should use a standard system for
     representing typographic information (e.g., italics,
     superscripts, etc) where these have important scientific
     meaning.  SGML is one such standard.

     Public-use databases must be designed to support 
     differential (among authorized users) accessibility of
     data.

     Data suppliers should be encouraged to estimate confidence
     limits of data or consensus element and these should be
     represented in the database.

     The databases should maintain a history of changes to the
     database (an audit trail or set of editorial citations).


The four working groups of the JITF reported briefly on their
work.  The following action items were included in the reports or
resulted from them:

     The Data Requirements Working Group reported that it had
     decided that, since the data requirements for sequence
     databases are under investigation by other groups, this
     working group would concentrate on mapping data.  Dr. Lipman
     recommended that the working group extend its inquiries to
     the groups mapping the model organism genomes.  HUGO has
     established a committee on physical mapping data and Dr.
     Branscomb was appointed as liaison to that committee.  The
     working group recommended that JITF meet with the
     well-developed mapping groups in the near future.

     The Connectivity and Infrastructure Working Group reported
     that its aim is to foster capability and not to mandate what
     is done or how.  It recognizes, however, that currently the 
     Internet TCP/IP protocol suite is the standard for
     connectivity in the U.S.  The working group recommended that
     all genome centers and genome data resources should be
     Internet accessible and that the funding agencies should
     provide guidance and support for these connections.  The
     group pointed out that the availability of such resources
     on the network would create a second cycle of demand from
     individual researchers who will need access to the
     resources.  The NIH and DOE should be aware of this expected
     increase in demand and be prepared to provide for it.  The
     JITF asked that the topic of connectivity be put on the
     agenda for the next meeting.

     The Training Working Group has narrowed its focus to two
     short-term areas of interest: (1) development of a visible,
     high-level summer course in genome informatics for those
     whose training is primarily in biology; and (2) the
     institution of fellowships in genome informatics.  Although
     DOE, NIH, and NSF all have genome and/or computation
     fellowships, the working group argued the Genome Project has
     an opportunity to have a real impact on interdisciplinary
     training in computation and biology by specifically
     designing a fellowship that would be available at a number
     of levels: predoctoral, postdoctoral, and mid-career.  

     The Long-Term Needs Working Group reported that particular
     needs were seen for development of analytical tools and for
     training in genome informatics.  It was noted that the NSF
     has taken the lead in training in biocomputing.  Mr. Olken
     noted that fundamental advances in database theory and
     practice are required and that the HGP should support basic
     research in database theory.  Dr. Lipman suggested that the
     amount of the HGP budget devoted to basic research on
     database theory and technology should be less than the cost
     of one genome center, but that the HGP needs to be aware of
     the research being done and needs to find a way to get
     computer scientists to devote themselves to solving the
     database problems posed by the HGP.


The next JITF meeting is tentatively scheduled for March 14-15,
1991.

submitted by:       David Benton
                    National Center for Human Genome Research
                    National Institutes of Health

                    Robert Robbins
                    Database Activities Program
                    National Science Foundation