benton@BIO.NLM.NIH.GOV (02/12/91)
JITF Second Meeting Nov 30 - Dec 1, 1990 Crystal City Marriott Hotel Arlington, VA MINUTES The Joint Informatics Task Force met for the second time on Nov 30 and Dec 1. The list of attendees is attached. Abstracts and other materials provided by the speakers will be included with the minutes in hardcopy form. This meeting specifically addressed the needs of the Human Genome Project for public databases containing mapping and sequence data. The discussions were, therefore, organized around presentations from extant map and sequence databases. EXECUTIVE SESSION During the opening executive session, the following points and questions were raised as issues to be discussed in the course of the meeting: - What kinds of data should be in a genetic map database? - What data models are appropriate to these data? - How much supporting information should be in mapping databases? Can supporting data be provided through distributed databases? - Since the cost of the database operation is related to its size and complexity, how much complexity is cost-effective? - To what extent should each genome center develop its own code to meet generic data management requirements? Since it costs more and takes longer, how much portability should be required in local database designs and software implementations? - Centers should have their own local databases, but should not get into the database distribution business; but ... - Centers can serve a broader community just by giving access to users outside the center. User support may be a problem for many centers. - Since it is not cost-effective for each center to develop a separate database design, grant proposals which include new database design efforts should discuss existing efforts and give the rationale for creating a new one. - Inter-database communication is not viewed as a serious load on the networks, since the amount of data that should usefully be transferred is small (although large amounts of data could be usefully cross-accessed). - One approach to coordinating multiple databases (for communications) is to use vendor-supplied tools. This requires standardizing on hardware/DBMS/OS/etc. The alternative is to concentrate on developing platform-independent transaction protocols. - A final issue (which reappeared in multiple guises multiple times) was the question of data release: how long should the originator be allowed to hold on to his data? The presentations given in the two open sessions are summarized very briefly below. Consult the speakers' handouts for more information. MAPPING DATABASES Dr. Peter Pearson and Mr. Richard Lucier presented the current status and future development plans of the Genome Data Base (GDB), located in the Welch Medical Library at the Johns Hopkins University and currently supported by the Howard Hughes Medical Institute. The database currently contains information on 2190 genes, 4928 DNA segments, 113 fragile sites, and ca. 15,000 probes. The on-line system (currently the only public access mode) has 2878 registered users and has averaged 1779 accesses per month since beginning operation in September, 1990. International distribution through several national nodes is planned. Dr. Pearson emphasized the importance of building the database content through the active involvement of the scientific community in the validation process. GDB uses scientific editors for curation of the database: currently 5 Johns Hopkins editors and 7 off-site editors are affiliated with the project. The editors are assisted by a total of 11 FTEs of editorial support staff. The database required (in Sybase) about 100MB when installed in August, 1990 and now requires 135MB, although this is probably not a good measure of database growth, due to reorganization of the 240 Sybase tables, etc. Several expansions of scope of the database are planned for 1991, inclusion of the CEPH raw segregation data, for example. Dr. Elbert Branscomb presented a description of the physical mapping database developed in the Lawerence Livermore National Laboratory Genome Center and the access tools developed there. He demonstrated a relatively low-cost general approach to providing access to multiple remote databases and described the design and implementation requirements this approach imposes on the databases. The "tin standard", advocated by Dr. Branscomb, is outlined below. Dr. James Fickett described the data analysis and database systems implemented by the Genome Center at Los Alamos National Laboratory and provided the Task Force with his observations on both the immediate and longer-term database needs of the HGP and the appropriate architectures to meet those needs. Dr. Fickett proposed that, in the long run, the proper overall design is of a "Commonwealth Information Management" in which many small collections are operated by local experts, providing high quality, curated data and tied together by networks and sophisticated access software. In the near-term, however, what is required is a central database for consensus maps of all types. At a minimum it should contain all published maps. The central database should provide a uniform interface to all the data; later multi-database access can be provided by distributed user-interface software. Dr. Fickett listed four community needs the JITF could meet: 1. Encourage development and distribution of general-purpose tools or services. 2. Establish a policy which requires all mapping efforts to include a plan for data deposition in their charter. 3. Establish a central mapping database. 4. Recognize and recommend that, while tool development for laboratory data management should be in the context of a local sequencing or mapping project, the user interface tools (for general use) should not be part of a local lab project, nor should running the central mapping database. Dr. Scott Tingey of DuPont Corporation described the database needs and approaches of the Molecular Breeding Program (Plants) and described some solutions to the problems of storing vast quantities of laboratory data in accessible forms. Dr. Brian Hauge of Massachusetts General Hospital described the Arabidopsis genome mapping program's progress and problems and its database needs. DISCUSSION - MAPPING DATABASES Should the central database contain model organism data? The human map database should contain mouse synteny information, but separate databases for C. elegans, E. coli, Drosophila, yeast, etc. are appropriate, since the only connection among them is the sequence -- not the map. Should the design of the human map database be capable of representing maps of other organisms? Yes, to the extent that generalizing the design does not impose unreasonable costs. The agencies (or JITF) should facilitate communication (e.g., through a workshop) between builders of databases for the various species. For the reasons given above, the JITF did not recommend that a single mapping database for all species be included in the Human Genome Project. No class of data should be excluded as a matter of principle from the mapping database: any physical characterization that can be mapped to a chromosome can be used as a marker. Consensus data on genetic maps and ordered clones should be included. This database should be regarded as a (the) means of reaching consensus on maps. The central database should (somehow) provide program access to the primary data. The key to being able to satisfy this requirement is that each database of supporting (primary) data have a well-documented, clearly specified Application Program Interface (API). Priority should be given first to the database, second to the API and network accessibility, and last to the user interface. Network accessibility should be provided through the Internet, using well-recognized standards (at the moment, this means TCP/IP). Any central map database must be able to represent partially-specified maps and its user interface needs to represent the partial state of the data to the user. The following consensus observations were recommended as as important guidelines for establishing genome data resources. 1. Mapping databases are most naturally organized as organism-specific consensus map databases, containing all the genetic and physical mapping data which is significantly useful to the biomedical community. 2. The centralized consensus databases should provide access to the supporting data. 3. Unless there is good justification for doing otherwise, both the central databases and any supporting project databases should be implemented using commercial client-server architecture relational database management systems running on Posix-compliant computers connected to the research Internet and supporting communications with the TCP/IP protocol. 4. The databases must provide a well-specified Application Program Interface (API) supporting query/retrieval, at a minimum. 5. The databases must use a consistent, well-recognized standard for typographical representation (e.g., SGML). 6. The databases must support the capability of differential (among authorized users) accessibility of data. 7. Data suppliers should be encouraged to estimate confidence limits of data or consensus element and these should be represented in the database. 8. The databases should maintain a history of changes to the database (an audit trail or set of editorial citations). Concerning laboratory support databases the following points were made: 1. Before its next meeting, the JITF will organize a workshop on laboratory support databases and associated software. The goal of this workshop will be to develop a requirements specification for a general laboratory support tool and a description of the current state of the art. A suggested approach to the problem is to evaluate extant "electronic laboratory notebooks", identify essential functionality and commonality among these, and attempt to describe experimental protocols and data manipulation methods at a sufficient level of abstraction to allow specification of general tools. 2. Existing grants should be supplemented to support further development, documentation, distribution, and support of laboratory data management software already being developed in the projects. 3. The HGP should support research in areas of unresolved problems (e.g., error representation). SEQUENCE DATABASES Dr. David Lipman presented the plans of the National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM) for "transitioning GenBank" and the evolution of the requirements analysis and design of the next generation GenBank sequence database, to be managed by the NCBI. The NCBI is now implementing the GenInfo Backbone database to serve as a stable archive of published sequence data, both nucleic acid and protein. The GenInfo Backbone will provide the literature scanning component of the next generation GenBank, while a group external to the NLM will provide the direct submission component. NLM will handle online access and distribution. The bulk of the work in populating the GenInfo Backbone is now done by six indexers, who are coping with about 80% of the available data -- about 20 journals. Four additional indexers will be hired in the near future. This should permit all the published data to be entered with some margin of safety. In addition to "full mark-up" sequences, the database will contain "index sequences": short sequences which may not be appropriate for "full markup" but, through sequence similarity, may be useful for retrieval of relevant literature citations. (Pilot tests have indicated that 17 bp was the useful lower end for retrieval; all the PCR primers looked at in the pilot study were more than sufficient for accurate retrieval.) One relational database has been implemented whose schema is optimized for database building and tracking; another is being designed to optimize retrieval speed. The ASN.1 distribution is the important public view of the database, not the relational schema. NCBI will make the API to the on-line system accessible to a restricted set of users. GenBank and PIR data will be available in the ASN.1 (ASCII-encoding) format soon and the output of the six indexers (the NLM journal scans and data entry) will be available in the spring of 1991. In later discussion, several questions about the NCBI database plans and progress were raised. In discussing the reasons the GenInfo sequence database chose not to use the GenBank relational schema, Dr. Lipman pointed out that the GenBank RDBMS was not ready when the GenInfo project began, and the GenBank schema did not cover some aspects of GenInfo, e.g., some DNA/Protein relationships, amino acid sequences from peptide sequencing, author's use of "organism" and "gene name" terminology. Questions were raised as to the relative coverage of sequences by GenBank and GenInfo. Dr. Lipman stated that GenInfo would be a superset of GenBank in that it would contain all sequences from the literature that GenBank considers "Full Markup" as well as amino acid sequences (e.g. from N-terminal sequencing), and "Index" sequences. Further questions arose regarding comparisons between GenBank and GenInfo. Dr. Lipman stated that the GenBank Advisory Board has advised NCBI and GenBank to coordinate their activities on this and other issues and they will meet soon to initiate this. With this incipient collaboration in mind, the JITF suggested that Dr. Cassatt ask the GenBank Advisors to include Dr. Lipman as an author of the article they are writing on the future of GenBank and sequence databases. Dr. Lipman commented that quality control is the only component of GIBBDB which is not yet in production mode. The current system involves a NCBI staff biologist reviewing every entry and Dr. Lipman reviewing every review, which is impractical for a production system. In the next few weeks, decisions will be made on production oriented quality control procedures. Once implemented, output of the database will be made available for beta users. Dr. Paul Gilna and Mr. Michael Cinkosky of LANL presented the progress of the GenBank project toward implementing the "Electronic Data Publishing" model they described. They emphasized the future importance of this model, predicting that most sequence data will not be published in scientific journals in the near future, and pointed out that the central databases have an important role to play in presenting an integrated (or "consensus") view of disparately published (and unpublished) data. The central database can promote electronic data publishing by working with researchers, performing quality checks on the data, and providing efficient distribution and data submission software. Dr. Minoru Kanehisa of Kyoto University presented a general overview of the Japanese Human Genome Project and the plans for its Informatics Project. Dr. Kanehisa will head the Priority Area Research project on Genome Informatics, funded by the Ministry of Education, Science, and Culture. While the informatics project include some research on using new data models, since the Japanese HGP will not be generating large quantities of data (by comparison with the U.S. program), the intention is to focus on data analysis and building knowledge bases rather than on building large databases. The DDBJ will continue with funding from the MESC and a human genome map database will be established at the Human Genome Analysis Center at Tokyo University. Graham Cameron of the EMBL Data Library reported very briefly on the current status and services offered by the Data Library and raised several issues of importance for sequence databases in the future, for example, the change in scale of the databases also introduces additional representational complexity. Mr. Cameron observed that the databases must be capable of providing (or supporting) different views of the data, since a single view (or level of integration) is not appropriate (or even valid) for all applications. EMBL is building links with the Drosophila and C. elegans mapping databases (both at Cambridge University) and numerous other Affiliated Data Units. EMBL's experience is that the research community has shown real willingness to take responsibility for the quality of the data and higher-level views represented in the database. Several other projects of the EMBL Data Library were described (see handout). Mr. Cameron mentioned that the CEFIC study ("Bio-informatics in Europe 2. Strategy for a European biotechnology information infrastructure") was generally favorable to the establishment of a European Bioinformatics Institute. Dr. Lipman then expressed concern that because the CEFIC study consistently pointed to the inexpensive and "subsidized" nature of Medline as a negative feature, and because of the general approach of the EEC, there is a risk that the Europeans will have more restrictive policies of data distribution than the US and that this could complicate the US-European cooperation on the sequence databases. Dr. Lipman thus urged the JITF to strongly endorse the current US policy of making the data freely available to anyone (either commercial or academic) to use in any way they see fit. DISCUSSION Dr. Lipman requested that the JITF appoint a member to attend the meetings of the GenBank Advisory Committee and the NCBI Board of Scientific Counselors (BOSC). There was some discussion as to whether the JITF should take a role in establishing transaction protocol standards (through a working group for transaction protocols, for example). For the sequence databases, NCBI and GenBank will be working together to come up with a definition of what the sequence database should contain and the mechanism of data exchange. It was recommended that NCBI and GenBank present a description of and timeline for this project at the next JITF meeting. The JITF recommended that HGP policy should be that all data developed in the HGP should be in the public domain and freely distributed. This point should be made to the EC, EMBL, etc. The four working groups of the JITF reported briefly on their work. The Long-Term Needs Working Group reported that particular needs were seen for development of analytical tools and for training in genome informatics. It was noted that the NSF has taken the lead in training in biocomputing. Mr. Olken noted that fundamental advances in database theory and practice are required and that the HGP should support basic research in database theory. Dr. Lipman suggested that the amount of the HGP budget devoted to basic research on database theory and technology should be less than the cost of one genome center, but that the HGP needs to be aware of the research being done and needs to find a way to get computer scientists to devote themselves to solving the database problems posed by the HGP. The Data Requirements Working Group reported that it had decided that, since the data requirements for sequence databases are under investigation by other groups, this working group would concentrate on mapping data. Dr. Lipman recommended that the working group extend its inquiries to the groups mapping the C. elegans, Drosophila, etc. genomes. HUGO has established a committee on physical mapping data and Dr. Branscomb is appointed as liaison to that committee. The working group recommended that JITF meet with the well-developed mapping groups in the near future. The Training Working Group has met informally at a number of meeting sites in the last six months. The group has narrowed its focus to two short-term areas of interest: (1) development of a visible, high-level summer course in genome informatics for those whose training is primarily in biology; and (2) the institution of fellowships in genome informatics. For the course, there is no attempt to reinvent the wheel. Several academic genome informatics experts (Lake at UCLA, Brutlag at Stanford, and Bishop at the MRC) have agreed to send course materials, reading lists, etc. which they have used in recent courses. The decision to limit course attendance to those with training primarily in biology was the result of a perception that the two types of students (biologists and computer scientists) would have a minimum amount of overlap. This is, of course, open to further discussion by the JITF. For the course, the working group would want to have financial support for the course faculty and for some student support. In this area, the working group would appreciate additional input from the JITF on (a) appropriate faculty; and (b) appropriate level of budget support. The site for this course would require hardware and network connections most likely found at national labs, some industry training sites, academic computing centers, and the supercomputing centers. Dr. Lipman pointed out that the NCBI is doing such a course now and will be offering it again next year. Its current format is approximately 15 2.5-hour lectures and is offered in the Lister Hill auditorium. He stated that the laboratory part of such a course needs to be separate from the lecture. The DOE, NIH, and NSF all have genome and/or computation fellowships. However, the working group believes the Genome Project has an opportunity to have a real impact on interdisciplinary training in computation and biology by specifically designing a fellowship that would be available at a number of levels: predoctoral, postdoctoral, and mid-career. In addition, the expertise and diversity of members of the JITF offers a pool of possible reviewers for the proposals that would result from such an offering. The working group asked the JITF to consider this suggestion, particularly whether it would be willing to serve as reviewers for such fellowships. The working group also asked for discussion of the number of fellows who could be supported, since this year's budget discussions indicate that Genome monies for such purposes are limited. The Connectivity and Infrastructure Working Group reported that its aim is to foster capability and not to mandate what is done or how. It recognizes, however, that currently Internet is the standard for connectivity in the U.S. The working group recommended that all genome centers and genome data resources should be Internet accessible and that the funding agencies should provide guidance and support for these connections. The group pointed out that the availability of these resources on the network would create a second cycle of demand from individual researchers (not resources or centers) who need access to the resources. The NIH and DOE should be aware of this demand and prepared to provide for it. Finally, the group urged the NIH to join the Internet consortium. There is anecdotal evidence that NIH-funded researchers have been refused network connections on Internet-connected campuses and given the reason that their research is not funded by an Internet-member agency. [Editor's note: John Wooley, NSF, assures me that that is not a valid reason: the Internet agreement requires all connected campuses to "work with" all researchers on campus to provide them with Internet connections. The term "work with" has been subject to a variety of interpretations at various institutions.] Dr. Lipman pointed out that DHHS is "one of the players" in the High Performance Computing Initiative and that the NLM is the DHHS representative at the HPCI discussions. The JITF asked that the topic of connectivity be put on the agenda for the next meeting. The next JITF meeting is tentatively scheduled for March 14-15, 1991. The JITF members were urged to use e-mail to discuss the issues facing the task force prior to the next meeting so that the issues are clearly defined when the meeting starts. JITF Membership in attendance at second meeting, Nov 30 - Dec 1. CHAIRMAN Dr. Dieter Soll Dept. of Molecular Biophysics and Biochemistry Yale University P. O. Box 6666 260 Whitney Avenue New Haven, CT 06511 Tel: 203-432-6200 Fax: 203-432-6202 E-Mail: soll@yalemed.bitnet TASK FORCE MEMBERS Dr. George Bell Los Alamos National Laboratory Group T-10 MS K-710 Los Alamos, NM 87545 Tel: 505-665-3805 Fax: 505-665-3493 E-Mail: gib%life@lanl.gov Dr. Elbert Branscomb Lawrence Livermore National Lab Biomedical Science Division P. O. Box 5507, L-452 Livermore, CA 94550 Tel: 415-422-5681 Fax: 415-422-2282 E-Mail: elbert@elbert.llnl.gov Dr. John Devereux Genetics Computer Group Suite B 575 Science Drive Madison, WI 53711 Tel: 608-231-5200 Fax: 608-241-5202 E-Mail: devereux@gcg.com Mr. Gregory Hamm Molecular Biology Computing Lab Waksman Institute Rutgers University P. O. Box 759 Piscataway, NJ 08855 Tel: 201-932-4864 Fax: 201-932-5735 E-Mail: hamm@mbcl.rutgers.edu Dr. Thomas Marr Cold Spring Harbor Laboratory P. O. Box 100 Cold Spring Harbor, NY 11724 Tel: 516-367-8393 Fax: 516-367-8389 E-Mail: marr@cshlab.bitnet Mr. Frank Olken Lawrence Berkeley Laboratory 1 Cyclotron Road M/S 50B-3238 Berkeley, CA 94720 Tel: 415-486-5891 Fax: 415-486-6363 E-Mail: olken@lbl.gov Dr. Mark Pearson E.I. DuPont de Nemours & Co. Central Research & Development Experimental Station Building 328, Room 251 P. O. Box 80328 Wilmington, DE 19880-0328 Tel: 302-695-2140 Fax: 302-695-4162 E-Mail: pearson%esvax%dupont.com@relay.cs.net Dr. Sylvia Spengler Human Genome Center Lawrence Berkeley Laboratory 459 Donner Berkeley, CA 94720 Tel: 415-486-5874 Fax: 415-486-5717 E-Mail: sylviaj@violet.berkeley.edu Dr. Mike Waterman Department of Mathematics University of Southern California University Park Los Angeles, CA 90089-1113 Tel: 213-740-2408 Fax: 213-740-2437 E-Mail: msw@msw.usc.edu LIAISON MEMBERS Dr. David Benton National Center for Human Genome Research NIH Building 38A, Room 610 Bethesda, MD 20892 Tel: 301-496-7531 Fax: 301-480-2770 E-Mail: benton@bio.nlm.nih.gov Dr. James Cassatt National Institute of General Medical Sciences NIH Westwood Building krussell@umdars.bitnet Room 907 Bethesda, MD 20892 Tel: 301-496-7253 Fax: 301-402-0019 E-Mail: czj@nihcu.bitnet Ms. Diane Hinton Genome Project Howard Hughes Medical Institute 6701 Rockledge Drive Bethesda, MD 20817 Tel: 301-571-0282 Fax: 302-571-0573 Dr. Elke Jordan National Center for Human Genome Research NIH Building 38A, Room 605 Bethesda, MD 20892 Tel: 301-496-0844 Fax: 301-402-0837 E-Mail: elj@cu.nih.gov Dr. David Lipman National Center for Biotechnology Information National Library Medicine Building 38A, Room 8S806 Bethesda, MD 20894 Tel: 301-496-2475 Fax: 301-480-9241 E-Mail: lipman@ncbi.nlm.nih.gov Dr. Robert Robbins National Science Foundation 1800 G Street, N.W. Washington, D. C. 20550 Tel: 202-357-9880 Fax: 202-357-7568 E-Mail: rrobbins@note.nsf.gov Mr. Keith Russell National Agricultural Library Rm. 100 Beltsville, MD 20705 Tel: 301-344-3834 Fax: 301-344-5472 E-Mail: krussell@umdars.bitnet