[bionet.molbio.genome-program] Report on Genome Map Database

benton@BIO.NLM.NIH.GOV (02/09/91)

REPORT OF 
NIH-DOE AD HOC COMMITTEE ON FUTURE HUMAN GENOME MAP DATABASE


On 17 December, a group of scientists (David Ledbetter, Robert
Sparkes, Sue Naylor, Elbert Branscomb, Martin Bishop, and Philip
Green) met to provide advice to the government with regard to the
development of a central database to hold all major human genome
data, except for bulk sequences.  During the all-day meeting,
several themes were stressed repeatedly:

   First, the group stated most forcefully that the work will
   involve a major service component.  The genome and human gene
   mapping communities will be relying upon such a database to
   provide access to much crucial information.  Therefore, the
   group noted that it is imperative for the project be implemented
   and managed in such a way that responsiveness to present and
   future community needs will be ensured. 

   Second, the group stated that it must be recognized that the
   work will involve an essential research component.  Shorter term
   research will be required to determine the best way to implement
   many of the current needs for such a database and longer term
   research will be required to ensure that the database can
   migrate to different and improved hardware and software
   platforms over time.  Therefore, the group noted that it is
   imperative for the project be implemented and managed in such
   a way that high quality research will be supported and
   facilitated.

   Third, the group felt that database developer should provide,
   in some real sense, an intellectual focus for the interpretation
   of genomic data.  In particular, they noted that the project
   should act as center for the synthesis, integration, and
   editorial control for mapping information of all kinds.

   Fourth, the group noted that all of the data and knowledge
   residing in such a database will form an irreplaceable
   intellectual resource for the scientific community.  Therefore,
   the group noted that federal support for such a database should
   be contingent upon guarantees that all of the information in the
   database will be freely available to the scientific community,
   now and in the future.

   Fifth, the group urged that the government support such a
   project using a funding vehicle and management plan that would
   be designed to provide enough freedom to ensure the success of
   the research component, while also providing enough oversight
   and control to ensure the adequacy of the service component.

In addition to these general themes, the group offered specific
advice pertinent to both the research and the service components
of the project.

   Responsiveness to Community Needs

      The Developer should devise means for maintaining contact
      with and responding to the changing needs of the human genome
      research community.  This might well be accomplished by
      establishing one or more advisory committees that could serve
      as a documented conduit for community advice to the database
      developers.  An oversight mechanism should be established to
      monitor the degree to which the database developers respond
      to community needs as presented by the advisory committee(s).

      A method should be developed for ensuring close interactions
      among representatives of the supporting agencies and between
      the agencies and the developer.  This might be accomplished
      by establishing a steering committee, composed of
      representatives of the cooperating funding agencies, who
      could interact to give overall administrative direction to
      the project. 

      To the extent that the database contains data derived from
      or references data in external databases, the developer
      should propose plans for establishing and maintaining
      collaborations with those databases as required to maintain
      the validity and integrity of the data in the Human Genome
      Map Database.

      The developer should propose plans for establishing and
      maintaining contact with organizations and institutions
      responsible for or recognized as authoritative for
      nomenclature of the various data items within the database.

      The developer should provide community education and
      training.  Examples of appropriate activities in this plan
      include:

         -  Formal short courses on the database structure and its
            use
         -  Workshop sessions, demonstrations, and presentations
            at scientific meetings
         -  Seminars


   Present Community Needs, Implementation

      1.  Contents.  The database should contain human genomic map
      data describing the following genetic markers (at a minimum):

         -  Sequence Tagged Sites (STSs)
         -  Arbitrary DNA segments (AKA, "anonymous DNA markers")
         -  Chromosome breakpoints (including translocations,
            deletions, etc.)
         -  Fragile sites
         -  Sites of known crossovers

      For each marker, the database should contain the useful and
      appropriate attributes, for example:
         -  Approved locus name and gene symbol
         -  Chromosomal and sub-chromosomal band localization
         -  Chromosomal location relative to chromosome length
         -  Marker identification
         -  Marker order
         -  Recombination frequency, linkage distance, and
            likelihood support
         -  Polymorphism data
         -  STS sequence
         -  Clone and probe identification
         -  Bibliographic information on sources of factual data

      In addition, the database should contain higher order
      derivations from the base data, for example:
         -  Consensus cytogenetic map
         -  CEPH consortium maps
         -  Consensus linkage map
         -  Contig maps
         -  Long-range restriction maps

      The database should also contain high-priority supporting
      data not otherwise available in public databases, for
      example:
         -  Genotypic information on CEPH families (required)
         -  Genotypic information on other defined families (e.g.,
            Venezuelan and disease families)

      2.  Cross references.  The database should contain
      cross-references to the following databases (where
      available):
         -  GenBank Nucleotide Sequence Database
         -  Mendelian Inheritance in Man (MIM)
         -  Cell and/or probe repository catalog number(s)
         -  Genetic map databases (e.g., Gbase) for species showing
            significant synteny with the human
         -  Chromosome breakpoint database

      3.  Database structure.  To ensure the development of a
      robust and stable production quality system, the database
      should be developed using well known, commercially available,
      proven software.  At the present time, this implies that the
      database should be designed as a normalized relational
      database (RDB).  Primary entities ("unit records") should be
      identified by public, unchanging unique identifiers (UIDs). 
      

      4.  Data dictionary.  The database structure should be
      described in a data dictionary or repository which would be
      available to database users.  In addition to documenting the
      database schema and providing prose descriptions of the
      database tables and data elements (fields), the data
      dictionary should document integrity constraints implemented
      in the database (or in layered software) and the rationale
      for the database design.  Copies of the data dictionary
      should be available to users at a nominal cost.

      5.  Software.  All software developed under this contract
      should be designed and implemented for maximal portability
      consistent with timely and cost-effective delivery of
      service.  To this end, the developers should maintain the
      database using a commercially available relational database
      management system (RDBMS).  All supporting systems and
      application software should be written using commercially
      available programming language compilers and libraries.  

      6.  Computing facilities.  To facilitate user and developer
      access, the database should be maintained on a computer (or
      network of computers) which uses the Unix operating system
      and which is connected to the U.S.  Research Internet by
      communications lines which operate at a minimum of 1.54
      Mbits/sec.

      7.  Performance.  The database should respond to user queries
      in a reasonable time, and the database developers should
      regularly monitor the system efficiency and response times
      and should take corrective steps whenever response
      degradation becomes significant.

      8.  Documentation.  The developer should maintain complete
      documentation and source code for all software developed
      under this contract and complete documentation for all
      database design and implementation.  This information
      should be made available to the government and to others
      to ensure the long term functionality of the system and
      to ensure long term access by the scientific community,
      even beyond the termination of the developer's involvement
      in the project.

      9.  Security.  The database will represent an irreplaceable
      resource for the community.  Therefore, the developer must
      take care to ensure that the data and programs are protected
      against loss.

   Present Community Needs, Data Collection and Database
   Maintenance

      1.  Scope of data.  The database should represent scientific
      reports of the specific data items listed above.  "Scientific
      reports" includes journal articles identified in Medline as
      containing the required data; and direct submissions to the
      database which contain the required data.

      New data should be incorporated into the database and made
      available on a timely basis.  The developer should devise
      methods to encourage the submission of data which might
      otherwise remain unpublished and uncommunicated.

      2.  Data collection.  The developer should promote direct
      investigator submission of data and should incorporate
      investigator-submitted data into the database in a timely
      fashion.  The developer should propose and construct software
      tools to facilitate direct data submission.  The developer
      should also promote collaborations with other databases in
      order to incorporate relevant data collected by those
      databases.  The developer may also wish to propose use of
      other organizations (Human Gene Mapping Workshop committees
      and the single-chromosome workshop committees, e.g.) for data
      collection and validation.

      The developer should provide plans for ensuring the accuracy
      of the data contained in the database.  This might well be
      done by identifying curators for various subcomponents of the
      database.

      Off-line copies of the database should be distributed
      regularly, perhaps quarterly, in standard formats such as
      nine-track tapes and/or CD-ROM.  Adequate user documentation,
      including descriptions of the distribution files and formats,
      should be provided for recipients of the off-line
      distributions. 

      The developer should design and implement a cost-effective
      interactive, on-line user interface which provides an
      integrated and comprehensive user view of the database.  Data
      records should be accessible by, at a minimum, locus name,
      gene symbol, author name, journal citation, chromosome, band
      localization, and map position (as applicable).  The on-line
      accessible database should be updated daily.  Access to the
      on-line database should be provided through the Internet and
      through either (or both) a Public Data Network (PDN) or
      direct dial-up phone lines.  Initially, PDN or dial-up
      connections should be provided for at least three
      simultaneous users.  The developer should monitor the usage
      of these lines and should add lines as demand requires.  The
      connection to the Internet should operate at a minimum of
      1.54 Mbits/sec.  Adequate user documentation should be
      provided for users of the on-line interactive system. 

      The developer should provide an electronic mail server which
      could automatically provide unit records from the on-line
      database in response to electronic mail queries containing
      the record UID.  The developer should also provide text-file
      versions of the database and its documentation (Data
      Dictionary and system documentation) for anonymous FTP (File
      Transfer Protocol) transfer. 

      The developer should cooperate with institutions which wish
      to establish remote (read-only) copies of the master public
      database.  The developer should supply cooperating
      institutions with adequate documentation covering database
      structure and installation and transaction format and
      processing.

      The developer should collaborate with qualified investigators
      and developers whose research and development interest is in
      the further development of data models and presentation
      methods for genomic map data.

   Research, Short Term

      The following were considered appropriate areas of research to
      be conducted in the context of development of this database:

      add capability of representing and storing data for nonhuman
      species

      investigate the possibility of representing mammalian gene 
      order and synteny

      explore the abstract and generic nature of mapping data and
      develop generic representation systems in anticipation of
      adding mapping data generated by as yet undiscovered mapping
      techniques

      develop methods for defining and controlling differential
      access to the data

      develop a means for providing an audit trail, or other
      historical record, of all changes to the database

      investigate methods for facilitating interdatabase
      interactions and connections

      develop a stable, documented application program interface
      (API) to the database

      develop a method for representing variations in data quality
      and for recording uncertainty

      develop means for integration of physical mapping data with
      genetic and cytogenetic maps

      develop means for providing ready user access to underlying
      supporting data (maintained in remote laboratory databases)
      through the database on-line user interface

      develop improvements in data presentation, including
      graphical representation of maps.


   Research, Long Term

      investigate new, nonrelational database systems and new data
      models

      monitor advances in hardware improvement and develop plans
      for using new hardware to improve the quality of the database


    All research projects should include specific plans for
    production of prototype systems and for their acceptance
    testing by the appropriate user communities.