[bionet.molbio.genbank] RESEND: Re: Data Exchange

pgil@HISTONE.LANL.GOV (Paul Gilna) (05/31/91)

I noticed some line length problems, so here it is again


--paul


----- Begin Included Message -----


I'm rather glad you asked these questions!

on your first one:

	"How rapidly does this integration of data occur?  We have been
	updating local copies of nucleic acid sequence databases by
	FTP, and I'm curious whether there's a sufficient lag to
	justify maintaining incremental updates to GenBank and EMBL
	separately (I gather from recent discussion that there's no
	access at present to DDBJ until GenBank gets the data; correct
	me if I'm wrong)."


It is the stated goal of the collaboration between Genbank, EMBL and
DDBJ that the databases be viewed as functionally equivalent.

What this essentially means is that for a common set of data elements
(eg, sequence, reference, source, features) one should be able to view
one copy of the database (eg GenBank) and know that for these elements,
all data are equally represented in each database.

Now there are really a number of levels that approximations of this
equivalence already occur, and one level where it will absolutely
occur.

All new GenBank data are mounted on the EMBL servers within 24-48 hours
of their distribution from LANL, and all new data from EMBL are mounted
on the GOS within a similar timeframe, hence on both the GOS and the
EMBL servers, all new data from either databank should be available
within 24-48 hrs of their release.

All new GenBank data are translated by EMBL into their format and
mounted on their and our servers.

Until recently, this was not consistently the case for GenBank's
handling of EMBL data, and this is being rectified as we speak, where
the next week or so should see us catching up with what we now estimate
to be a 5% difference. Hence, within the next few weeks, all new EMBL
data should be available on the GOS or from the USENET distribution in
GenBank format, to the point where one need only look in one database
for all of the data.

However, in our view, that is not enough. The fact remains, that as
long as we continue to exchange data using the flatfile as our
currency, there in no way we can ensure that changes at one database
are routinely or robustly propagated to the other. For that one needs a
way of saying, "take this version of a data element (sequence) and
replace it with the following version".

To these ends, the databanks have designed a data exchange protocol
which should allow us to meet these goals, and in addition have some
significant impact on some of our journal submission protocols, where
authors would essentially be free to submit to the databank of their
choice regardless of the Journal's stated databank affiliations.

This protocol is now at the completion stages of implementation and
will be in test phase within the next month or so.

Finally all DDBJ data are incorporated into GenBank and from thence to
the servers immediately we receive their data.


You also ask a second question:


	Finally, the concept of confidential data in a database is new
	to me.  How precisely does this work?  Is this unique to DDBJ,
	or might there be data quietly floating around in other
	databases as well?  How does one officially 'spot' references
	to confidential data?

All the databanks currently handle confidential data. Submitting
authors (who now provide us with ~90% of our data) are offered the
choice of having their data held in confidence until they appear in a
publication, or until a supplied release date has expired.

There appear to be two main reasons why authors would wish to avail of
this:  since we are asking that data be submitted before their
appearence in a publication, authors with data which could be of use to
competitors request that the data be held until the publication
reporting them appear.  Patent issues may also play a role here.

The second issue is that authors may be uncomfortable with having data
associated with their name released to the public before they have
received the "imprimatur" of the journal peer-review system.

We believe that both of these reasons are largely unjustified.  The
issues seem to revolve around citability in the first case and
credibility in the second. In the case of citability, the existence of
sequence data in a public release of the database is in itself a
citable event; for example one might cite a sequence in Genbank as

	Lavorgna,G., Ueda,H., Clos,J. and Wu,C. (1991) D.melanogaster
	FTZ-F1 mRNA, complete cds. GenBank, M63711

In the case of credibility, it is a myth that the data are in any sense
subjected to the same level of peer review that the rest of a paper
receive, and indeed they should not: the purpose of the peer review
system is to evaluate the experimental methods and the scientific
interpretation of the data from those methods, but not to evaluate the
data themselves.  However at GenBank, we are already placing an
increasing number of checks on the data (correct translation, presence
of vector sequence, etc) which are more capable of validating the data
than can the peer review system.

For a more detailed view of these issues I reccommend you read our
article in this weeks (May 31) issue of Science.

Often, the fact that we maintain confidential data requires that we
must spot that the data have been published in our routine scanning of
the journals in order to release the data. Often, the accession number
has been presented in the journal (something we try to strongly
encourage with both the authors and the journals), and this is
sufficient for us to make the link between publication and submitted,
confidential data, thereby effecting their release.

Often however accession numbers do not appear, or we miss the article
because (as is increasingly the case) no sequence data appear in the
published article. Hence the data can remain confidential until they
are noticed elsewhere.

Our record is improving however, as we are now making use of advance
copies of the tables of contents for a number of our major
sequence-reporting journals, including JBC and Science, allowing us to
meet our goals of having the data available on line at the time of
publication even in the case of confidential data.

In addition we have recently installed a new procedure which we have
been testing for some time which allows the community to assist us in
this spotting process; if a genbank user uses an accession number from
a journal to retrieve a sequence from the on-line servers or their
local USENET-updated copy of the database, and cannot find it, the
chances are good that we have these data as confidential, as we are
getting all other non confidential submissions out before publication.

By communicating the citation details (authors, title journal, vol
pages) and the accession number to a specific email address:

		update@genome.lanl.gov,

we can use this information to release the data, even if we have yet to
see the actual journal. Results from our trial period have been very
encouraging, and we are now ready to begin reccomending everyone to use
this address.

As is often the case, the answer has become more complicated than the
question, yet I thought it worth making use of the opportunity to
discuss what are really some far-reaching and fundamental issues for
the database and the community it serves.

I hope this answers your questions.

Regards,

Paul Gilna,
Biology Domain Leader,
GenBank, Los Alamos. 


----- End Included Message -----