[bionet.molbio.genbank] Data Exchange

bailey@hmivax.humgen.upenn.edu (05/30/91)

In article <9105281658.AA02439@histone.lanl.gov> on May 28, Paul Gilna writes:

>All data processed by DDBJ (with the exception of confidential data)
>are passed to GenBank on a regular basis and incorporated
>immediately into the on-line flatfile servers and RDBMS satellites;
>These data are in turn propagated to EMBL and their distribution nodes.

How rapidly does this integration of data occur?  We have been updating local
copies of nucleic acid sequence databases by FTP, and I'm curious whether
there's a sufficient lag to justify maintaining incremental updates to GenBank
and EMBL separately (I gather from recent discussion that there's no access at
present to DDBJ until GenBank gets the data; correct me if I'm wrong).

If anyone is familiar with the exchange rates or has looked at information
overlap in the updates between releases, I'd be interested to know what one can
expect.

Finally, the concept of confidential data in a database is new to me.  How
precisely does this work?  Is this unique to DDBJ, or might there be data
quietly floating around in other databases as well?  How does one officially
'spot' references to confidential data?

Thanks to all.  Apologies if any of the above questions are common knowledge to
the rest of the world.

					Charles Bailey

!-------------------------------------------------------------------------------
!          Dept. of Human Genetics / Howard Hughes Medical Institute
! University of Pennsylvania School of Medicine  Rm. 430 Clinical Research Bldg.
!     422 Curie Blvd.  Philadelphia, PA 19104 USA      Tel. (215) 898-1699
!          Internet: bailey@hmivax.humgen.upenn.edu  (IN 128.91.200.37)
!-------------------------------------------------------------------------------

goldman@mbcl.rutgers.edu (05/31/91)

In article <1991May29.185538.1@hmivax.humgen.upenn.edu>, bailey@hmivax.humgen.upenn.edu writes:
> In article <9105281658.AA02439@histone.lanl.gov> on May 28, Paul Gilna writes:
> 
>>All data processed by DDBJ (with the exception of confidential data)
>>are passed to GenBank on a regular basis and incorporated
>>immediately into the on-line flatfile servers and RDBMS satellites;
>>These data are in turn propagated to EMBL and their distribution nodes.
> 
> How rapidly does this integration of data occur?  We have been updating local
> copies of nucleic acid sequence databases by FTP, and I'm curious whether
> there's a sufficient lag to justify maintaining incremental updates to GenBank
> and EMBL separately (I gather from recent discussion that there's no access at
> present to DDBJ until GenBank gets the data; correct me if I'm wrong).
> 
> If anyone is familiar with the exchange rates or has looked at information
> overlap in the updates between releases, I'd be interested to know what one can
> expect.
I don't have any hard information on this, but we also take both the genbank
and the EMBL weekly updates.  We use the GCG suite of programs, and I run
AccessionNumbers on the Genbank data, and use that so that the "new" data
from Genbank and EMBL does not have duplicate entries.  From this, I would
estimate that (per week) the number of entries that are in EMBL but not yet in
Genbank is about a fifth of the total number of entries in the Genbank "new"
data.  (I just looked at the sizes of the em_new.ref and gb_new.ref files.)
I hope this helps....
> 
> Finally, the concept of confidential data in a database is new to me.  How
> precisely does this work?  Is this unique to DDBJ, or might there be data
> quietly floating around in other databases as well?  How does one officially
> 'spot' references to confidential data?
> 
> Thanks to all.  Apologies if any of the above questions are common knowledge to
> the rest of the world.
> 
> 
> 					Charles Bailey
> 
> !-------------------------------------------------------------------------------
> !          Dept. of Human Genetics / Howard Hughes Medical Institute
> ! University of Pennsylvania School of Medicine  Rm. 430 Clinical Research Bldg.
> !     422 Curie Blvd.  Philadelphia, PA 19104 USA      Tel. (215) 898-1699
> !          Internet: bailey@hmivax.humgen.upenn.edu  (IN 128.91.200.37)
> !-------------------------------------------------------------------------------

		Adrian
-- 
Adrian Goldman                         |  Internet:  Goldman@MBCL.Rutgers.Edu
Molecular Biology Computing Laboratory |  Bitnet:    Goldman@BioVAX
Waksman Insitute,                      |  Phone:     (908) 932-4864
Rutgers University,                    |  Fax:       (908) 932-5735
Piscataway, NJ 08855 USA               |

pgil@HISTONE.LANL.GOV (Paul Gilna) (05/31/91)

I'm rather glad you asked these questions!

on your first one:

"How rapidly does this integration of data occur? We have been updating local
copies of nucleic acid sequence databases by FTP, and I'm curious whether
there's a sufficient lag to justify maintaining incremental updates to GenBank
and EMBL separately (I gather from recent discussion that there's no access at
present to DDBJ until GenBank gets the data; correct me if I'm wrong)."

It is the stated goal of the collaboration between Genbank, EMBL and
DDBJ that the databases be viewed as functionally equivalent.

What this essentially means is that for a common set of data elements
(eg, sequence, reference, source, features) one should be able to view
one copy of the database (eg GenBank) and know that for these elements,
all data are equally represented in each database.

Now there are really a number of levels that approximations of this
equivalence already occur, and one level where it will absolutely
occur.

All new GenBank data are mounted on the EMBL servers within 24-48 hours
of their distribution from LANL, and all new data from EMBL are mounted
on the GOS within a similar timeframe, hence on both the GOS and the
EMBL servers, all new data from either databank should be available
within 24-48 hrs of their release.

All new GenBank data are translated by EMBL into their format and
mounted on their and our servers.

Until recently, this was not consistently the case for GenBank's
handling of EMBL data, and this is being rectified as we speak, where
the next week or so should see us catching up with what we now estimate
to be a 5% difference. Hence, within the next few weeks, all new EMBL data
should be available on the GOS or from the USENET distribution
in GenBank format, to the point where one need only look in one
database for all of the data.

However, in our view, that is not enough. The fact remains, that as long as we continue to exchange data using the flatfile as our currency, there
in no way we can ensure that changes at one database are routinely or robustly propagated to the other. For that one needs a way of saying, "take this
version of a data element (sequence) and replace it with the following
version".

To these ends, the databanks have designed a data exchange protocol
which should allow us to meet these goals, and in addition have some
significant impact on some of our journal submission protocols, where authors
would essentially be free to submit to the databank of their choice
regardless of the Journal's stated databank affiliations.

This protocol is now at the completion stages of implementation and will
be in test phase within the next month or so.

Finally all DDBJ data are incorporated into GenBank and from thence to the servers immediately we receive their data.

You also ask a second question:

Finally, the concept of confidential data in a database is new to me.
How precisely does this work? Is this unique to DDBJ, or might there
be data quietly floating around in other databases as well? How does
one officially 'spot' references to confidential data?

All the databanks currently handle confidential data. Submitting authors
(who now provide us with ~90% of our data) are offered the choice of having their data held in confidence until they appear in a publication, or until a supplied release date has expired.

There appear to be two main reasons why authors would wish to avail of this:
since we are asking that data be submitted before their appearence
in a publication, authors with data which could be of use to competitors
request that the data be held until the publication reporting them appear.
Patent issues may also play a role here.

The second issue is that authors may be uncomfortable with having data
associated with their name released to the public before they have
received the "imprimatur" of the journal peer-review system.

We believe that both of these reasons are largely unjustified.
The issues seem to revolve around citability in the first case and
credibility in the second. In the case of citability, the existence
of sequence data in a public release of the database is in itself
a citable event; for example one might cite a sequence in Genbank as

Lavorgna,G., Ueda,H., Clos,J. and Wu,C. (1991)
D.melanogaster FTZ-F1 mRNA, complete cds. GenBank, M63711

In the case of credibility, it is a myth that the data are in any sense subjected to the same level of peer review that the rest of a paper receive,
and indeed they should not: the purpose of the peer review system is
to evaluate the experimental methods and the scientific interpretation of
the data from those methods, but not to evaluate the data themselves.
However at GenBank, we are already placing an increasing number of
checks on the data (correct translation, presence of vector sequence, etc)
which are more capable of validating the data than can the peer review system.

For a more detailed view of these issues I reccommend you read our article
in this weeks (May 31) issue of Science.

Often, the fact that we maintain confidential data requires that we must
spot that the data have been published in our routine scanning of the
journals in order to release the data. Often, the accession number has
been presented in the journal (something we try to strongly encourage
with both the authors and the journals), and this is sufficient for us
to make the link between publication and submitted, confidential data,
thereby effecting their release.

Often however accession numbers do not appear, or we miss
the article because (as is increasingly the case) no sequence data
appear in the published article. Hence the data can remain
confidential until they are noticed elsewhere.

Our record is improving however, as we are now making use of advance
copies of the tables of contents for a number of our major
sequence-reporting journals, including JBC and Science, allowing us to meet
our goals of having the data available on line at the time of publication
even in the case of confidential data.

In addition we have recently installed a new procedure which we have been testing for some time which allows the community to assist us in this
spotting process; if a genbank user uses an accession number from a
journal to retrieve a sequence from the on-line servers or their local
USENET-updated copy of the database, and cannot find it, the chances are good that we have these data as confidential, as we are getting all other non confidential submissions out before publication.

By communicating the citation details (authors, title journal, vol pages)
and the accession number to a specific email address:

update@genome.lanl.gov,

we can use this information to release the data, even if we have yet to
see the actual journal. Results from our trial period have been very encouraging, and we are now ready to begin reccomending everyone to use this address.

As is often the case, the answer has become more complicated than the question, yet I thought it worth making use of the opportunity to discuss what are really some far-reaching and fundamental issues for the database and the community it
serves.

I hope this answers your questions.

Regards,

Paul Gilna,
Biology Domain Leader,
GenBank, Los Alamos.