benton@BIO.NLM.NIH.GOV (02/09/91)
REPORT OF
NIH-DOE AD HOC COMMITTEE ON FUTURE HUMAN GENOME MAP DATABASE
On 17 December, a group of scientists (David Ledbetter, Robert
Sparkes, Sue Naylor, Elbert Branscomb, Martin Bishop, and Philip
Green) met to provide advice to the government with regard to the
development of a central database to hold all major human genome
data, except for bulk sequences. During the all-day meeting,
several themes were stressed repeatedly:
First, the group stated most forcefully that the work will
involve a major service component. The genome and human gene
mapping communities will be relying upon such a database to
provide access to much crucial information. Therefore, the
group noted that it is imperative for the project be implemented
and managed in such a way that responsiveness to present and
future community needs will be ensured.
Second, the group stated that it must be recognized that the
work will involve an essential research component. Shorter term
research will be required to determine the best way to implement
many of the current needs for such a database and longer term
research will be required to ensure that the database can
migrate to different and improved hardware and software
platforms over time. Therefore, the group noted that it is
imperative for the project be implemented and managed in such
a way that high quality research will be supported and
facilitated.
Third, the group felt that database developer should provide,
in some real sense, an intellectual focus for the interpretation
of genomic data. In particular, they noted that the project
should act as center for the synthesis, integration, and
editorial control for mapping information of all kinds.
Fourth, the group noted that all of the data and knowledge
residing in such a database will form an irreplaceable
intellectual resource for the scientific community. Therefore,
the group noted that federal support for such a database should
be contingent upon guarantees that all of the information in the
database will be freely available to the scientific community,
now and in the future.
Fifth, the group urged that the government support such a
project using a funding vehicle and management plan that would
be designed to provide enough freedom to ensure the success of
the research component, while also providing enough oversight
and control to ensure the adequacy of the service component.
In addition to these general themes, the group offered specific
advice pertinent to both the research and the service components
of the project.
Responsiveness to Community Needs
The Developer should devise means for maintaining contact
with and responding to the changing needs of the human genome
research community. This might well be accomplished by
establishing one or more advisory committees that could serve
as a documented conduit for community advice to the database
developers. An oversight mechanism should be established to
monitor the degree to which the database developers respond
to community needs as presented by the advisory committee(s).
A method should be developed for ensuring close interactions
among representatives of the supporting agencies and between
the agencies and the developer. This might be accomplished
by establishing a steering committee, composed of
representatives of the cooperating funding agencies, who
could interact to give overall administrative direction to
the project.
To the extent that the database contains data derived from
or references data in external databases, the developer
should propose plans for establishing and maintaining
collaborations with those databases as required to maintain
the validity and integrity of the data in the Human Genome
Map Database.
The developer should propose plans for establishing and
maintaining contact with organizations and institutions
responsible for or recognized as authoritative for
nomenclature of the various data items within the database.
The developer should provide community education and
training. Examples of appropriate activities in this plan
include:
- Formal short courses on the database structure and its
use
- Workshop sessions, demonstrations, and presentations
at scientific meetings
- Seminars
Present Community Needs, Implementation
1. Contents. The database should contain human genomic map
data describing the following genetic markers (at a minimum):
- Sequence Tagged Sites (STSs)
- Arbitrary DNA segments (AKA, "anonymous DNA markers")
- Chromosome breakpoints (including translocations,
deletions, etc.)
- Fragile sites
- Sites of known crossovers
For each marker, the database should contain the useful and
appropriate attributes, for example:
- Approved locus name and gene symbol
- Chromosomal and sub-chromosomal band localization
- Chromosomal location relative to chromosome length
- Marker identification
- Marker order
- Recombination frequency, linkage distance, and
likelihood support
- Polymorphism data
- STS sequence
- Clone and probe identification
- Bibliographic information on sources of factual data
In addition, the database should contain higher order
derivations from the base data, for example:
- Consensus cytogenetic map
- CEPH consortium maps
- Consensus linkage map
- Contig maps
- Long-range restriction maps
The database should also contain high-priority supporting
data not otherwise available in public databases, for
example:
- Genotypic information on CEPH families (required)
- Genotypic information on other defined families (e.g.,
Venezuelan and disease families)
2. Cross references. The database should contain
cross-references to the following databases (where
available):
- GenBank Nucleotide Sequence Database
- Mendelian Inheritance in Man (MIM)
- Cell and/or probe repository catalog number(s)
- Genetic map databases (e.g., Gbase) for species showing
significant synteny with the human
- Chromosome breakpoint database
3. Database structure. To ensure the development of a
robust and stable production quality system, the database
should be developed using well known, commercially available,
proven software. At the present time, this implies that the
database should be designed as a normalized relational
database (RDB). Primary entities ("unit records") should be
identified by public, unchanging unique identifiers (UIDs).
4. Data dictionary. The database structure should be
described in a data dictionary or repository which would be
available to database users. In addition to documenting the
database schema and providing prose descriptions of the
database tables and data elements (fields), the data
dictionary should document integrity constraints implemented
in the database (or in layered software) and the rationale
for the database design. Copies of the data dictionary
should be available to users at a nominal cost.
5. Software. All software developed under this contract
should be designed and implemented for maximal portability
consistent with timely and cost-effective delivery of
service. To this end, the developers should maintain the
database using a commercially available relational database
management system (RDBMS). All supporting systems and
application software should be written using commercially
available programming language compilers and libraries.
6. Computing facilities. To facilitate user and developer
access, the database should be maintained on a computer (or
network of computers) which uses the Unix operating system
and which is connected to the U.S. Research Internet by
communications lines which operate at a minimum of 1.54
Mbits/sec.
7. Performance. The database should respond to user queries
in a reasonable time, and the database developers should
regularly monitor the system efficiency and response times
and should take corrective steps whenever response
degradation becomes significant.
8. Documentation. The developer should maintain complete
documentation and source code for all software developed
under this contract and complete documentation for all
database design and implementation. This information
should be made available to the government and to others
to ensure the long term functionality of the system and
to ensure long term access by the scientific community,
even beyond the termination of the developer's involvement
in the project.
9. Security. The database will represent an irreplaceable
resource for the community. Therefore, the developer must
take care to ensure that the data and programs are protected
against loss.
Present Community Needs, Data Collection and Database
Maintenance
1. Scope of data. The database should represent scientific
reports of the specific data items listed above. "Scientific
reports" includes journal articles identified in Medline as
containing the required data; and direct submissions to the
database which contain the required data.
New data should be incorporated into the database and made
available on a timely basis. The developer should devise
methods to encourage the submission of data which might
otherwise remain unpublished and uncommunicated.
2. Data collection. The developer should promote direct
investigator submission of data and should incorporate
investigator-submitted data into the database in a timely
fashion. The developer should propose and construct software
tools to facilitate direct data submission. The developer
should also promote collaborations with other databases in
order to incorporate relevant data collected by those
databases. The developer may also wish to propose use of
other organizations (Human Gene Mapping Workshop committees
and the single-chromosome workshop committees, e.g.) for data
collection and validation.
The developer should provide plans for ensuring the accuracy
of the data contained in the database. This might well be
done by identifying curators for various subcomponents of the
database.
Off-line copies of the database should be distributed
regularly, perhaps quarterly, in standard formats such as
nine-track tapes and/or CD-ROM. Adequate user documentation,
including descriptions of the distribution files and formats,
should be provided for recipients of the off-line
distributions.
The developer should design and implement a cost-effective
interactive, on-line user interface which provides an
integrated and comprehensive user view of the database. Data
records should be accessible by, at a minimum, locus name,
gene symbol, author name, journal citation, chromosome, band
localization, and map position (as applicable). The on-line
accessible database should be updated daily. Access to the
on-line database should be provided through the Internet and
through either (or both) a Public Data Network (PDN) or
direct dial-up phone lines. Initially, PDN or dial-up
connections should be provided for at least three
simultaneous users. The developer should monitor the usage
of these lines and should add lines as demand requires. The
connection to the Internet should operate at a minimum of
1.54 Mbits/sec. Adequate user documentation should be
provided for users of the on-line interactive system.
The developer should provide an electronic mail server which
could automatically provide unit records from the on-line
database in response to electronic mail queries containing
the record UID. The developer should also provide text-file
versions of the database and its documentation (Data
Dictionary and system documentation) for anonymous FTP (File
Transfer Protocol) transfer.
The developer should cooperate with institutions which wish
to establish remote (read-only) copies of the master public
database. The developer should supply cooperating
institutions with adequate documentation covering database
structure and installation and transaction format and
processing.
The developer should collaborate with qualified investigators
and developers whose research and development interest is in
the further development of data models and presentation
methods for genomic map data.
Research, Short Term
The following were considered appropriate areas of research to
be conducted in the context of development of this database:
add capability of representing and storing data for nonhuman
species
investigate the possibility of representing mammalian gene
order and synteny
explore the abstract and generic nature of mapping data and
develop generic representation systems in anticipation of
adding mapping data generated by as yet undiscovered mapping
techniques
develop methods for defining and controlling differential
access to the data
develop a means for providing an audit trail, or other
historical record, of all changes to the database
investigate methods for facilitating interdatabase
interactions and connections
develop a stable, documented application program interface
(API) to the database
develop a method for representing variations in data quality
and for recording uncertainty
develop means for integration of physical mapping data with
genetic and cytogenetic maps
develop means for providing ready user access to underlying
supporting data (maintained in remote laboratory databases)
through the database on-line user interface
develop improvements in data presentation, including
graphical representation of maps.
Research, Long Term
investigate new, nonrelational database systems and new data
models
monitor advances in hardware improvement and develop plans
for using new hardware to improve the quality of the database
All research projects should include specific plans for
production of prototype systems and for their acceptance
testing by the appropriate user communities.