benton@BIO.NLM.NIH.GOV (02/09/91)
REPORT OF NIH-DOE AD HOC COMMITTEE ON FUTURE HUMAN GENOME MAP DATABASE On 17 December, a group of scientists (David Ledbetter, Robert Sparkes, Sue Naylor, Elbert Branscomb, Martin Bishop, and Philip Green) met to provide advice to the government with regard to the development of a central database to hold all major human genome data, except for bulk sequences. During the all-day meeting, several themes were stressed repeatedly: First, the group stated most forcefully that the work will involve a major service component. The genome and human gene mapping communities will be relying upon such a database to provide access to much crucial information. Therefore, the group noted that it is imperative for the project be implemented and managed in such a way that responsiveness to present and future community needs will be ensured. Second, the group stated that it must be recognized that the work will involve an essential research component. Shorter term research will be required to determine the best way to implement many of the current needs for such a database and longer term research will be required to ensure that the database can migrate to different and improved hardware and software platforms over time. Therefore, the group noted that it is imperative for the project be implemented and managed in such a way that high quality research will be supported and facilitated. Third, the group felt that database developer should provide, in some real sense, an intellectual focus for the interpretation of genomic data. In particular, they noted that the project should act as center for the synthesis, integration, and editorial control for mapping information of all kinds. Fourth, the group noted that all of the data and knowledge residing in such a database will form an irreplaceable intellectual resource for the scientific community. Therefore, the group noted that federal support for such a database should be contingent upon guarantees that all of the information in the database will be freely available to the scientific community, now and in the future. Fifth, the group urged that the government support such a project using a funding vehicle and management plan that would be designed to provide enough freedom to ensure the success of the research component, while also providing enough oversight and control to ensure the adequacy of the service component. In addition to these general themes, the group offered specific advice pertinent to both the research and the service components of the project. Responsiveness to Community Needs The Developer should devise means for maintaining contact with and responding to the changing needs of the human genome research community. This might well be accomplished by establishing one or more advisory committees that could serve as a documented conduit for community advice to the database developers. An oversight mechanism should be established to monitor the degree to which the database developers respond to community needs as presented by the advisory committee(s). A method should be developed for ensuring close interactions among representatives of the supporting agencies and between the agencies and the developer. This might be accomplished by establishing a steering committee, composed of representatives of the cooperating funding agencies, who could interact to give overall administrative direction to the project. To the extent that the database contains data derived from or references data in external databases, the developer should propose plans for establishing and maintaining collaborations with those databases as required to maintain the validity and integrity of the data in the Human Genome Map Database. The developer should propose plans for establishing and maintaining contact with organizations and institutions responsible for or recognized as authoritative for nomenclature of the various data items within the database. The developer should provide community education and training. Examples of appropriate activities in this plan include: - Formal short courses on the database structure and its use - Workshop sessions, demonstrations, and presentations at scientific meetings - Seminars Present Community Needs, Implementation 1. Contents. The database should contain human genomic map data describing the following genetic markers (at a minimum): - Sequence Tagged Sites (STSs) - Arbitrary DNA segments (AKA, "anonymous DNA markers") - Chromosome breakpoints (including translocations, deletions, etc.) - Fragile sites - Sites of known crossovers For each marker, the database should contain the useful and appropriate attributes, for example: - Approved locus name and gene symbol - Chromosomal and sub-chromosomal band localization - Chromosomal location relative to chromosome length - Marker identification - Marker order - Recombination frequency, linkage distance, and likelihood support - Polymorphism data - STS sequence - Clone and probe identification - Bibliographic information on sources of factual data In addition, the database should contain higher order derivations from the base data, for example: - Consensus cytogenetic map - CEPH consortium maps - Consensus linkage map - Contig maps - Long-range restriction maps The database should also contain high-priority supporting data not otherwise available in public databases, for example: - Genotypic information on CEPH families (required) - Genotypic information on other defined families (e.g., Venezuelan and disease families) 2. Cross references. The database should contain cross-references to the following databases (where available): - GenBank Nucleotide Sequence Database - Mendelian Inheritance in Man (MIM) - Cell and/or probe repository catalog number(s) - Genetic map databases (e.g., Gbase) for species showing significant synteny with the human - Chromosome breakpoint database 3. Database structure. To ensure the development of a robust and stable production quality system, the database should be developed using well known, commercially available, proven software. At the present time, this implies that the database should be designed as a normalized relational database (RDB). Primary entities ("unit records") should be identified by public, unchanging unique identifiers (UIDs). 4. Data dictionary. The database structure should be described in a data dictionary or repository which would be available to database users. In addition to documenting the database schema and providing prose descriptions of the database tables and data elements (fields), the data dictionary should document integrity constraints implemented in the database (or in layered software) and the rationale for the database design. Copies of the data dictionary should be available to users at a nominal cost. 5. Software. All software developed under this contract should be designed and implemented for maximal portability consistent with timely and cost-effective delivery of service. To this end, the developers should maintain the database using a commercially available relational database management system (RDBMS). All supporting systems and application software should be written using commercially available programming language compilers and libraries. 6. Computing facilities. To facilitate user and developer access, the database should be maintained on a computer (or network of computers) which uses the Unix operating system and which is connected to the U.S. Research Internet by communications lines which operate at a minimum of 1.54 Mbits/sec. 7. Performance. The database should respond to user queries in a reasonable time, and the database developers should regularly monitor the system efficiency and response times and should take corrective steps whenever response degradation becomes significant. 8. Documentation. The developer should maintain complete documentation and source code for all software developed under this contract and complete documentation for all database design and implementation. This information should be made available to the government and to others to ensure the long term functionality of the system and to ensure long term access by the scientific community, even beyond the termination of the developer's involvement in the project. 9. Security. The database will represent an irreplaceable resource for the community. Therefore, the developer must take care to ensure that the data and programs are protected against loss. Present Community Needs, Data Collection and Database Maintenance 1. Scope of data. The database should represent scientific reports of the specific data items listed above. "Scientific reports" includes journal articles identified in Medline as containing the required data; and direct submissions to the database which contain the required data. New data should be incorporated into the database and made available on a timely basis. The developer should devise methods to encourage the submission of data which might otherwise remain unpublished and uncommunicated. 2. Data collection. The developer should promote direct investigator submission of data and should incorporate investigator-submitted data into the database in a timely fashion. The developer should propose and construct software tools to facilitate direct data submission. The developer should also promote collaborations with other databases in order to incorporate relevant data collected by those databases. The developer may also wish to propose use of other organizations (Human Gene Mapping Workshop committees and the single-chromosome workshop committees, e.g.) for data collection and validation. The developer should provide plans for ensuring the accuracy of the data contained in the database. This might well be done by identifying curators for various subcomponents of the database. Off-line copies of the database should be distributed regularly, perhaps quarterly, in standard formats such as nine-track tapes and/or CD-ROM. Adequate user documentation, including descriptions of the distribution files and formats, should be provided for recipients of the off-line distributions. The developer should design and implement a cost-effective interactive, on-line user interface which provides an integrated and comprehensive user view of the database. Data records should be accessible by, at a minimum, locus name, gene symbol, author name, journal citation, chromosome, band localization, and map position (as applicable). The on-line accessible database should be updated daily. Access to the on-line database should be provided through the Internet and through either (or both) a Public Data Network (PDN) or direct dial-up phone lines. Initially, PDN or dial-up connections should be provided for at least three simultaneous users. The developer should monitor the usage of these lines and should add lines as demand requires. The connection to the Internet should operate at a minimum of 1.54 Mbits/sec. Adequate user documentation should be provided for users of the on-line interactive system. The developer should provide an electronic mail server which could automatically provide unit records from the on-line database in response to electronic mail queries containing the record UID. The developer should also provide text-file versions of the database and its documentation (Data Dictionary and system documentation) for anonymous FTP (File Transfer Protocol) transfer. The developer should cooperate with institutions which wish to establish remote (read-only) copies of the master public database. The developer should supply cooperating institutions with adequate documentation covering database structure and installation and transaction format and processing. The developer should collaborate with qualified investigators and developers whose research and development interest is in the further development of data models and presentation methods for genomic map data. Research, Short Term The following were considered appropriate areas of research to be conducted in the context of development of this database: add capability of representing and storing data for nonhuman species investigate the possibility of representing mammalian gene order and synteny explore the abstract and generic nature of mapping data and develop generic representation systems in anticipation of adding mapping data generated by as yet undiscovered mapping techniques develop methods for defining and controlling differential access to the data develop a means for providing an audit trail, or other historical record, of all changes to the database investigate methods for facilitating interdatabase interactions and connections develop a stable, documented application program interface (API) to the database develop a method for representing variations in data quality and for recording uncertainty develop means for integration of physical mapping data with genetic and cytogenetic maps develop means for providing ready user access to underlying supporting data (maintained in remote laboratory databases) through the database on-line user interface develop improvements in data presentation, including graphical representation of maps. Research, Long Term investigate new, nonrelational database systems and new data models monitor advances in hardware improvement and develop plans for using new hardware to improve the quality of the database All research projects should include specific plans for production of prototype systems and for their acceptance testing by the appropriate user communities.