benton@BIO.NLM.NIH.GOV (02/12/91)
This message contains the summary report of the second meeting of the Joint Informatics Task Force. The next message will contain the detailed minutes of that meeting. Summary Report The Joint Informatics Task Force met for the second time on November 30 and December 1. This meeting was specifically addressed to the needs of the Human Genome Project for public databases containing mapping and sequence data. The discussions were, therefore, organized around presentations from extant map and sequence databases: Dr. Peter Pearson and Mr. Richard Lucier presented the current status and future development plans of the Genome Data Base, located in the Welch Medical Library at the Johns Hopkins University and currently supported by the Howard Hughes Medical Institute. They emphasized the importance of building the database content through the active involvement of the scientific community in the validation process. Dr. Elbert Branscomb presented a description of the physical mapping database developed in the Lawrence Livermore National Laboratory Genome Center and the access tools developed there. He demonstrated a relatively low-cost general approach to providing access to multiple remote databases and described the design and implementation requirements this approach imposes on the databases. Dr. James Fickett described the data analysis and database systems implemented by the Genome Center at Los Alamos National Laboratory and provided the Task Force with his observations on both the immediate and longer-term database needs of the HGP and the appropriate architectures to meet those needs. Dr. Scott Tingey of DuPont Corporation described the database needs and approaches of the Molecular Breeding Program (Plants), described some solutions to the problems of storing vast quantities of laboratory data in accessible forms. Dr. Brian Hauge of Massachusetts General Hospital described the Arabidopsis genome mapping program's progress and discussed its database needs. Dr. David Lipman described the evolution of the requirements analysis and design of the GenInfo Backbone sequence database, to be operated by the National Center for Biotechnology Information. Dr. Paul Gilna and Mr. Michael Cinkosky of LANL presented the progress of the GenBank project toward implementing the "Electronic Data Publishing" model they described. They emphasized the future importance of this model, predicting that most sequence data will not be published in scientific journals in the near future, and pointed out that the central databases have an important role to play in presenting an integrated (or "consensus") view of disparately published (and unpublished) data. Dr. Minoru Kanehisa of Kyoto University presented a general overview of the Japanese Human Genome Project and the plans for its Informatics Project. Dr. Kanehisa will head the Priority Area Research project on Genome Informatics, funded by the Ministry of Education, Science, and Culture. While the informatics project includes some research on using new data models, the intention is to focus on data analysis and building knowledge bases rather than on building large databases. Graham Cameron of the EMBL Data Library reported very briefly on the current status and services offered by the Data Library and raised several issues of importance for sequence databases in the future, for example, the change in scale of the databases also introduces additional representational complexity. Mr. Cameron emphasized that public sequence databases must be capable of providing (or supporting) different views of the data, since a single view (or level of integration) is not appropriate, or even valid, for all applications. In discussion, the Task Force arrived at consensus observations which it considers as important guidelines for establishing genome data resources. Some of these were: Mapping databases are most naturally organized as organism- specific consensus map databases, containing all genetic and physical mapping data which is significantly useful to the biomedical community. Any centralized, consensus databases should provide access, either direct or indirect, to the supporting data. Unless there is good justification for doing otherwise, both central databases and any supporting project databases should be implemented using software and hardware systems that adhere to industry standards. Currently, that means commercial client-server architecture relational database management systems running on Posix-compliant computers connected to the research Internet and supporting communications with the TCP/IP protocol. Public-use databases must provide a stable, documented Application Program Interface (API) so that third parties may develop interface software to the data. Public-use databases should use a standard system for representing typographic information (e.g., italics, superscripts, etc) where these have important scientific meaning. SGML is one such standard. Public-use databases must be designed to support differential (among authorized users) accessibility of data. Data suppliers should be encouraged to estimate confidence limits of data or consensus element and these should be represented in the database. The databases should maintain a history of changes to the database (an audit trail or set of editorial citations). The four working groups of the JITF reported briefly on their work. The following action items were included in the reports or resulted from them: The Data Requirements Working Group reported that it had decided that, since the data requirements for sequence databases are under investigation by other groups, this working group would concentrate on mapping data. Dr. Lipman recommended that the working group extend its inquiries to the groups mapping the model organism genomes. HUGO has established a committee on physical mapping data and Dr. Branscomb was appointed as liaison to that committee. The working group recommended that JITF meet with the well-developed mapping groups in the near future. The Connectivity and Infrastructure Working Group reported that its aim is to foster capability and not to mandate what is done or how. It recognizes, however, that currently the Internet TCP/IP protocol suite is the standard for connectivity in the U.S. The working group recommended that all genome centers and genome data resources should be Internet accessible and that the funding agencies should provide guidance and support for these connections. The group pointed out that the availability of such resources on the network would create a second cycle of demand from individual researchers who will need access to the resources. The NIH and DOE should be aware of this expected increase in demand and be prepared to provide for it. The JITF asked that the topic of connectivity be put on the agenda for the next meeting. The Training Working Group has narrowed its focus to two short-term areas of interest: (1) development of a visible, high-level summer course in genome informatics for those whose training is primarily in biology; and (2) the institution of fellowships in genome informatics. Although DOE, NIH, and NSF all have genome and/or computation fellowships, the working group argued the Genome Project has an opportunity to have a real impact on interdisciplinary training in computation and biology by specifically designing a fellowship that would be available at a number of levels: predoctoral, postdoctoral, and mid-career. The Long-Term Needs Working Group reported that particular needs were seen for development of analytical tools and for training in genome informatics. It was noted that the NSF has taken the lead in training in biocomputing. Mr. Olken noted that fundamental advances in database theory and practice are required and that the HGP should support basic research in database theory. Dr. Lipman suggested that the amount of the HGP budget devoted to basic research on database theory and technology should be less than the cost of one genome center, but that the HGP needs to be aware of the research being done and needs to find a way to get computer scientists to devote themselves to solving the database problems posed by the HGP. The next JITF meeting is tentatively scheduled for March 14-15, 1991. submitted by: David Benton National Center for Human Genome Research National Institutes of Health Robert Robbins Database Activities Program National Science Foundation