[bionet.molbio.genome-program] JITF II Minutes

benton@BIO.NLM.NIH.GOV (02/12/91)
JITF Second Meeting
Nov 30 - Dec 1, 1990
Crystal City Marriott Hotel
Arlington, VA

MINUTES

The Joint Informatics Task Force met for the second time on Nov 30 and
Dec 1.  The list of attendees is attached.  Abstracts and other
materials provided by the speakers will be included with the minutes in
hardcopy form. 

This meeting specifically addressed the needs of the Human Genome
Project for public databases containing mapping and sequence data.  The
discussions were, therefore, organized around presentations from extant
map and sequence databases. 


EXECUTIVE SESSION

During the opening executive session, the following points and questions
were raised as issues to be discussed in the course of the meeting:

- What kinds of data should be in a genetic map database?

- What data models are appropriate to these data?

- How much supporting information should be in mapping databases?  Can
supporting data be provided through distributed databases?

- Since the cost of the database operation is related to its size and
complexity, how much complexity is cost-effective?

- To what extent should each genome center develop its own code to meet
generic data management requirements?  Since it costs more and takes
longer, how much portability should be required in local database
designs and software implementations?

- Centers should have their own local databases, but should not get into
the database distribution business; but ...
- Centers can serve a broader community just by giving access to users
outside the center.  User support may be a problem for many centers.

- Since it is not cost-effective for each center to develop a separate
database design, grant proposals which include new database design
efforts should discuss existing efforts and give the rationale for
creating a new one.

- Inter-database communication is not viewed as a serious load on the
networks, since the amount of data that should usefully be transferred
is small (although large amounts of data could be usefully
cross-accessed).

- One approach to coordinating multiple databases (for communications)
is to use vendor-supplied tools.  This requires standardizing on
hardware/DBMS/OS/etc.  The alternative is to concentrate on developing
platform-independent transaction protocols.

- A final issue (which reappeared in multiple guises multiple times) was
the question of data release: how long should the originator be allowed
to hold on to his data?

The presentations given in the two open sessions are summarized very
briefly below.  Consult the speakers' handouts for more information.


MAPPING DATABASES

Dr. Peter Pearson and Mr. Richard Lucier presented the current status
and future development plans of the Genome Data Base (GDB), located in
the Welch Medical Library at the Johns Hopkins University and currently
supported by the Howard Hughes Medical Institute.  The database
currently contains information on 2190 genes, 4928 DNA segments, 113
fragile sites, and ca. 15,000 probes.  The on-line system (currently
the only public access mode) has 2878 registered users and has averaged
1779 accesses per month since beginning operation in September, 1990. 
International distribution through several national nodes is planned.
Dr. Pearson emphasized the importance of building the database content
through the active involvement of the scientific community in the
validation process.  GDB uses scientific editors for curation of the
database: currently 5 Johns Hopkins editors and 7 off-site editors are
affiliated with the project.  The editors are assisted by a total of 11
FTEs of editorial support staff.  The database required (in Sybase)
about 100MB when installed in August, 1990 and now requires 135MB,
although this is probably not a good measure of database growth, due to
reorganization of the 240 Sybase tables, etc.  Several expansions of
scope of the database are planned for 1991, inclusion of the CEPH raw
segregation data, for example.

Dr. Elbert Branscomb presented a description of the physical mapping
database developed in the Lawerence Livermore National Laboratory
Genome Center and the access tools developed there.  He demonstrated a
relatively low-cost general approach to providing access to multiple
remote databases and described the design and implementation
requirements this approach imposes on the databases.  The "tin
standard", advocated by Dr. Branscomb, is outlined below.

Dr. James Fickett described the data analysis and database systems
implemented by the Genome Center at Los Alamos National Laboratory and
provided the Task Force with his observations on both the immediate and
longer-term database needs of the HGP and the appropriate architectures
to meet those needs.  Dr. Fickett proposed that, in the long run, the
proper overall design is of a "Commonwealth Information Management" in
which many small collections are operated by local experts, providing
high quality, curated data and tied together by networks and
sophisticated access software.  In the near-term, however, what is
required is a central database for consensus maps of all types.  At a
minimum it should contain all published maps.  The central database
should provide a uniform interface to all the data; later multi-database
access can be provided by distributed user-interface software.  Dr.
Fickett listed four community needs the JITF could meet:

1. Encourage development and distribution of general-purpose tools or
services.

2. Establish a policy which requires all mapping efforts to include a
plan for data deposition in their charter.

3. Establish a central mapping database.

4.  Recognize and recommend that, while tool development for laboratory
data management should be in the context of a local sequencing or
mapping project, the user interface tools (for general use) should not
be part of a local lab project, nor should running the central mapping
database. 

Dr. Scott Tingey of DuPont Corporation described the database needs and
approaches of the Molecular Breeding Program (Plants) and described some
solutions to the problems of storing vast quantities of laboratory data
in accessible forms. 

Dr. Brian Hauge of Massachusetts General Hospital described the
Arabidopsis genome mapping program's progress and problems and its
database needs.


DISCUSSION - MAPPING DATABASES

Should the central database contain model organism data?  The human map
database should contain mouse synteny information, but separate
databases for C. elegans, E. coli, Drosophila, yeast, etc. are
appropriate, since the only connection among them is the sequence -- not
the map.

Should the design of the human map database be capable of representing
maps of other organisms?  Yes, to the extent that generalizing the
design does not impose unreasonable costs.

The agencies (or JITF) should facilitate communication (e.g., through
a workshop) between builders of databases for the various species.
For the reasons given above, the JITF did not recommend that a single
mapping database for all species be included in the Human Genome
Project.

No class of data should be excluded as a matter of principle from the
mapping database: any physical characterization that can be mapped to a
chromosome can be used as a marker.  Consensus data on genetic maps and
ordered clones should be included.  This database should be regarded as
a (the) means of reaching consensus on maps.

The central database should (somehow) provide program access to the
primary data.  The key to being able to satisfy this requirement is that
each database of supporting (primary) data have a well-documented,
clearly specified Application Program Interface (API).

Priority should be given first to the database, second to the API and
network accessibility, and last to the user interface.  Network
accessibility should be provided through the Internet, using
well-recognized standards (at the moment, this means TCP/IP).

Any central map database must be able to represent partially-specified
maps and its user interface needs to represent the partial state of the
data to the user.

The following consensus observations were recommended as as important
guidelines for establishing genome data resources. 

1. Mapping databases are most naturally organized as organism-specific
consensus map databases, containing all the genetic and physical mapping
data which is significantly useful to the biomedical community.

2. The centralized consensus databases should provide access to the
supporting data.

3. Unless there is good justification for doing otherwise, both the
central databases and any supporting project databases should be
implemented using commercial client-server architecture relational
database management systems running on Posix-compliant computers
connected to the research Internet and supporting communications with the
TCP/IP protocol. 

4. The databases must provide a well-specified Application Program
Interface (API) supporting query/retrieval, at a minimum.

5. The databases must use a consistent, well-recognized standard for
typographical representation (e.g., SGML).

6. The databases must support the capability of differential (among
authorized users) accessibility of data.

7.  Data suppliers should be encouraged to estimate confidence limits of
data or consensus element and these should be represented in the
database. 

8. The databases should maintain a history of changes to the database
(an audit trail or set of editorial citations).

Concerning laboratory support databases the following points were made:

1.  Before its next meeting, the JITF will organize a workshop on
laboratory support databases and associated software.  The goal of this
workshop will be to develop a requirements specification for a general
laboratory support tool and a description of the current state of the
art.  A suggested approach to the problem is to evaluate extant
"electronic laboratory notebooks", identify essential functionality and
commonality among these, and attempt to describe experimental protocols
and data manipulation methods at a sufficient level of abstraction to
allow specification of general tools. 

2. Existing grants should be supplemented to support further
development, documentation, distribution, and support of laboratory
data management software already being developed in the projects.

3. The HGP should support research in areas of unresolved problems
(e.g., error representation).


SEQUENCE DATABASES

Dr. David Lipman presented the plans of the National Center for
Biotechnology Information (NCBI) of the National Library of Medicine
(NLM) for "transitioning GenBank" and the evolution of the
requirements analysis and design of the next generation GenBank
sequence database, to be managed by the NCBI.  The NCBI is now
implementing the GenInfo Backbone database to serve as a stable
archive of published sequence data, both nucleic acid and protein.
The GenInfo Backbone will provide the literature scanning component of
the next generation GenBank, while a group external to the NLM will
provide the direct submission component.  NLM will handle online access
and distribution.

The bulk of the work in populating the GenInfo Backbone is now done by
six indexers, who are coping with about 80% of the available data --
about 20 journals.  Four additional indexers will be hired in the near
future.  This should permit all the published data to be entered with
some margin of safety.  In addition to "full mark-up" sequences, the
database will contain "index sequences": short sequences which may not
be appropriate for "full markup" but, through sequence similarity, may
be useful for retrieval of relevant literature citations.  (Pilot 
tests have indicated that 17 bp was the useful lower end for retrieval; all
the PCR primers looked at in the pilot study were more than sufficient
for accurate retrieval.)  One relational database has been implemented
whose schema is optimized for database building and tracking; another
is being designed to optimize retrieval speed.  The ASN.1 distribution
is the important public view of the database, not the relational
schema.  NCBI will make the API to the on-line system accessible to a
restricted set of users.  GenBank and PIR data will be available in
the ASN.1 (ASCII-encoding) format soon and the output of the six
indexers (the NLM journal scans and data entry) will be available in
the spring of 1991.

In later discussion, several questions about the NCBI database plans
and progress were raised.  In discussing the reasons the GenInfo
sequence database chose not to use the GenBank relational schema, Dr.
Lipman pointed out that the GenBank RDBMS was not ready when the
GenInfo project began, and the GenBank schema did not cover some
aspects of GenInfo, e.g., some DNA/Protein relationships, amino acid
sequences from peptide sequencing, author's use of "organism" and
"gene name" terminology.  Questions were raised as to the relative
coverage of sequences by GenBank and GenInfo.  Dr. Lipman stated that
GenInfo would be a superset of GenBank in that it would contain all
sequences from the literature that GenBank considers "Full Markup" as
well as amino acid sequences (e.g. from N-terminal sequencing), and
"Index" sequences.  Further questions arose regarding comparisons
between GenBank and GenInfo.  Dr. Lipman stated that the GenBank
Advisory Board has advised NCBI and GenBank to coordinate their
activities on this and other issues and they will meet soon to
initiate this.  With this incipient collaboration in mind, the JITF
suggested that Dr. Cassatt ask the GenBank Advisors to include Dr.
Lipman as an author of the article they are writing on the future of
GenBank and sequence databases.  Dr. Lipman commented that quality
control is the only component of GIBBDB which is not yet in production
mode.  The current system involves a NCBI staff biologist reviewing
every entry and Dr. Lipman reviewing every review, which is
impractical for a production system.  In the next few weeks, decisions
will be made on production oriented quality control procedures.  Once
implemented, output of the database will be made available for beta
users.

Dr. Paul Gilna and Mr. Michael Cinkosky of LANL presented the progress
of the GenBank project toward implementing the "Electronic Data
Publishing" model they described.  They emphasized the future importance
of this model, predicting that most sequence data will not be published
in scientific journals in the near future, and pointed out that the
central databases have an important role to play in presenting an
integrated (or "consensus") view of disparately published (and
unpublished) data.  The central database can promote electronic data
publishing by working with researchers, performing quality checks on the
data, and providing efficient distribution and data submission software.

Dr. Minoru Kanehisa of Kyoto University presented a general overview
of the Japanese Human Genome Project and the plans for its Informatics
Project.  Dr. Kanehisa will head the Priority Area Research project on
Genome Informatics, funded by the Ministry of Education, Science, and
Culture.  While the informatics project include some research on using
new data models, since the Japanese HGP will not be generating large
quantities of data (by comparison with the U.S. program), the
intention is to focus on data analysis and building knowledge bases
rather than on building large databases.  The DDBJ will continue with
funding from the MESC and a human genome map database will be
established at the Human Genome Analysis Center at Tokyo University.

Graham Cameron of the EMBL Data Library reported very briefly on the
current status and services offered by the Data Library and raised
several issues of importance for sequence databases in the future, for
example, the change in scale of the databases also introduces additional
representational complexity.  Mr. Cameron observed that the databases
must be capable of providing (or supporting) different views of the
data, since a single view (or level of integration) is not appropriate
(or even valid) for all applications.  EMBL is building links with the
Drosophila and C. elegans mapping databases (both at Cambridge
University) and numerous other Affiliated Data Units.  EMBL's experience
is that the research community has shown real willingness to take
responsibility for the quality of the data and higher-level views
represented in the database.  Several other projects of the EMBL Data
Library were described (see handout).

Mr. Cameron mentioned that the CEFIC study ("Bio-informatics in Europe
2. Strategy for a European biotechnology information infrastructure")
was generally favorable to the establishment of a European
Bioinformatics Institute.  Dr. Lipman then expressed concern that
because the CEFIC study consistently pointed to the inexpensive and
"subsidized" nature of Medline as a negative feature, and because of
the general approach of the EEC, there is a risk that the Europeans
will have more restrictive policies of data distribution than the US
and that this could complicate the US-European cooperation on the
sequence databases.  Dr. Lipman thus urged the JITF to strongly
endorse the current US policy of making the data freely available to
anyone (either commercial or academic) to use in any way they see fit.


DISCUSSION

Dr. Lipman requested that the JITF appoint a member to attend the
meetings of the GenBank Advisory Committee and the NCBI Board of
Scientific Counselors (BOSC).

There was some discussion as to whether the JITF should take a role in
establishing transaction protocol standards (through a working group
for transaction protocols, for example).  For the sequence databases,
NCBI and GenBank will be working together to come up with a definition
of what the sequence database should contain and the mechanism of data
exchange.  It was recommended that NCBI and GenBank present a
description of and timeline for this project at the next JITF meeting.

The JITF recommended that HGP policy should be that all data developed
in the HGP should be in the public domain and freely distributed.  This
point should be made to the EC, EMBL, etc.


The four working groups of the JITF reported briefly on their work. 

The Long-Term Needs Working Group reported that particular needs were
seen for development of analytical tools and for training in genome
informatics.  It was noted that the NSF has taken the lead in training
in biocomputing.  Mr. Olken noted that fundamental advances in database
theory and practice are required and that the HGP should support basic
research in database theory.  Dr. Lipman suggested that the amount of
the HGP budget devoted to basic research on database theory and
technology should be less than the cost of one genome center, but that
the HGP needs to be aware of the research being done and needs to find a
way to get computer scientists to devote themselves to solving the
database problems posed by the HGP.

The Data Requirements Working Group reported that it had decided that,
since the data requirements for sequence databases are under
investigation by other groups, this working group would concentrate on
mapping data.  Dr. Lipman recommended that the working group extend its
inquiries to the groups mapping the C. elegans, Drosophila, etc.
genomes.  HUGO has established a committee on physical mapping data and
Dr. Branscomb is appointed as liaison to that committee.  The working
group recommended that JITF meet with the well-developed mapping groups
in the near future.

The Training Working Group has met informally at a number of meeting
sites in the last six months.  The group has narrowed its focus to two
short-term areas of interest: (1) development of a visible, high-level
summer course in genome informatics for those whose training is
primarily in biology; and (2) the institution of fellowships in genome
informatics.

For the course, there is no attempt to reinvent the wheel.  Several
academic genome informatics experts (Lake at UCLA, Brutlag at
Stanford, and Bishop at the MRC) have agreed to send course materials,
reading lists, etc. which they have used in recent courses.  The
decision to limit course attendance to those with training primarily
in biology was the result of a perception that the two types of
students (biologists and computer scientists) would have a minimum
amount of overlap.  This is, of course, open to further discussion by
the JITF.  For the course, the working group would want to have
financial support for the course faculty and for some student support.
In this area, the working group would appreciate additional input from
the JITF on (a) appropriate faculty; and (b) appropriate level of
budget support.  The site for this course would require hardware and
network connections most likely found at national labs, some industry
training sites, academic computing centers, and the supercomputing
centers.  Dr. Lipman pointed out that the NCBI is doing such a course
now and will be offering it again next year.  Its current format is
approximately 15 2.5-hour lectures and is offered in the Lister Hill
auditorium.  He stated that the laboratory part of such a course needs
to be separate from the lecture.

The DOE, NIH, and NSF all have genome and/or computation fellowships. 
However, the working group believes the Genome Project has an
opportunity to have a real impact on interdisciplinary training in
computation and biology by specifically designing a fellowship that
would be available at a number of levels: predoctoral, postdoctoral,
and mid-career.  In addition, the expertise and diversity of members of
the JITF offers a pool of possible reviewers for the proposals that
would result from such an offering.  The working group asked the JITF to
consider this suggestion, particularly whether it would be willing to
serve as reviewers for such fellowships.  The working group also asked
for discussion of the number of fellows who could be supported, since
this year's budget discussions indicate that Genome monies for such
purposes are limited.

The Connectivity and Infrastructure Working Group reported that its aim
is to foster capability and not to mandate what is done or how.  It
recognizes, however, that currently Internet is the standard for
connectivity in the U.S.  The working group recommended that all
genome centers and genome data resources should be Internet accessible
and that the funding agencies should provide guidance and support for
these connections.  The group pointed out that the availability of
these resources on the network would create a second cycle of demand
from individual researchers (not resources or centers) who need access
to the resources.  The NIH and DOE should be aware of this demand and
prepared to provide for it.  Finally, the group urged the NIH to join
the Internet consortium.  There is anecdotal evidence that NIH-funded
researchers have been refused network connections on
Internet-connected campuses and given the reason that their research
is not funded by an Internet-member agency.  [Editor's note: John
Wooley, NSF, assures me that that is not a valid reason: the Internet
agreement requires all connected campuses to "work with" all
researchers on campus to provide them with Internet connections.  The
term "work with" has been subject to a variety of interpretations at
various institutions.]  Dr. Lipman pointed out that DHHS is "one of
the players" in the High Performance Computing Initiative and that the
NLM is the DHHS representative at the HPCI discussions.  The JITF
asked that the topic of connectivity be put on the agenda for the next
meeting.


The next JITF meeting is tentatively scheduled for March 14-15, 1991.

The JITF members were urged to use e-mail to discuss the issues
facing the task force prior to the next meeting so that the issues are
clearly defined when the meeting starts.


	
JITF Membership in attendance at second meeting, Nov 30 - Dec 1.

CHAIRMAN

Dr. Dieter Soll
Dept. of Molecular Biophysics and Biochemistry
Yale University
P. O. Box 6666
260 Whitney Avenue
New Haven, CT  06511
Tel:  203-432-6200
Fax:  203-432-6202
E-Mail:  soll@yalemed.bitnet


TASK FORCE MEMBERS

Dr. George Bell
Los Alamos National Laboratory
Group T-10
MS K-710
Los Alamos, NM  87545
Tel:  505-665-3805
Fax:  505-665-3493
E-Mail:  gib%life@lanl.gov

Dr. Elbert Branscomb
Lawrence Livermore National Lab
Biomedical Science Division
P. O. Box 5507, L-452
Livermore, CA  94550
Tel:  415-422-5681
Fax:  415-422-2282
E-Mail:  elbert@elbert.llnl.gov

Dr. John Devereux
Genetics Computer Group
Suite B
575 Science Drive
Madison, WI  53711
Tel:  608-231-5200
Fax:  608-241-5202
E-Mail:  devereux@gcg.com

Mr. Gregory Hamm
Molecular Biology Computing Lab
Waksman Institute
Rutgers University
P. O. Box 759
Piscataway, NJ  08855
Tel:  201-932-4864
Fax:  201-932-5735
E-Mail:  hamm@mbcl.rutgers.edu

Dr. Thomas Marr
Cold Spring Harbor Laboratory
P. O. Box 100
Cold Spring Harbor, NY  11724
Tel:  516-367-8393
Fax:  516-367-8389
E-Mail:  marr@cshlab.bitnet

Mr. Frank Olken
Lawrence Berkeley Laboratory
1 Cyclotron Road
M/S 50B-3238
Berkeley, CA  94720
Tel:  415-486-5891
Fax:  415-486-6363
E-Mail:  olken@lbl.gov

Dr. Mark Pearson
E.I. DuPont de Nemours & Co.
Central Research & Development
Experimental Station
Building 328, Room 251
P. O. Box 80328
Wilmington, DE  19880-0328
Tel:  302-695-2140
Fax:  302-695-4162
E-Mail:  pearson%esvax%dupont.com@relay.cs.net

Dr. Sylvia Spengler
Human Genome Center
Lawrence Berkeley Laboratory
459 Donner
Berkeley, CA  94720
Tel:  415-486-5874
Fax:  415-486-5717
E-Mail:  sylviaj@violet.berkeley.edu

Dr. Mike Waterman
Department of Mathematics
University of Southern California
University Park
Los Angeles, CA  90089-1113
Tel:  213-740-2408
Fax:  213-740-2437
E-Mail:  msw@msw.usc.edu


LIAISON MEMBERS

Dr. David Benton
National Center for Human Genome Research
NIH Building 38A, Room 610
Bethesda, MD  20892
Tel:  301-496-7531
Fax:  301-480-2770
E-Mail:  benton@bio.nlm.nih.gov

Dr. James Cassatt
National Institute of General Medical Sciences
NIH
Westwood Building
krussell@umdars.bitnet
Room 907
Bethesda, MD  20892
Tel:  301-496-7253
Fax:  301-402-0019
E-Mail:  czj@nihcu.bitnet

Ms. Diane Hinton
Genome Project
Howard Hughes Medical Institute
6701 Rockledge Drive
Bethesda, MD  20817
Tel:  301-571-0282
Fax:  302-571-0573

Dr. Elke Jordan
National Center for Human Genome Research
NIH Building 38A, Room 605
Bethesda, MD  20892
Tel:  301-496-0844
Fax:  301-402-0837
E-Mail:  elj@cu.nih.gov

Dr. David Lipman
National Center for Biotechnology Information
National Library Medicine
Building 38A, Room 8S806
Bethesda, MD  20894
Tel:  301-496-2475
Fax:  301-480-9241
E-Mail:  lipman@ncbi.nlm.nih.gov

Dr. Robert Robbins
National Science Foundation
1800 G Street, N.W.
Washington, D. C.  20550
Tel:  202-357-9880
Fax:  202-357-7568
E-Mail:  rrobbins@note.nsf.gov

Mr. Keith Russell
National Agricultural Library
Rm. 100
Beltsville, MD  20705
Tel:  301-344-3834
Fax:  301-344-5472
E-Mail:  krussell@umdars.bitnet