bailey@hmivax.humgen.upenn.edu (05/30/91)
In article <9105281658.AA02439@histone.lanl.gov> on May 28, Paul Gilna writes: >All data processed by DDBJ (with the exception of confidential data) >are passed to GenBank on a regular basis and incorporated >immediately into the on-line flatfile servers and RDBMS satellites; >These data are in turn propagated to EMBL and their distribution nodes. How rapidly does this integration of data occur? We have been updating local copies of nucleic acid sequence databases by FTP, and I'm curious whether there's a sufficient lag to justify maintaining incremental updates to GenBank and EMBL separately (I gather from recent discussion that there's no access at present to DDBJ until GenBank gets the data; correct me if I'm wrong). If anyone is familiar with the exchange rates or has looked at information overlap in the updates between releases, I'd be interested to know what one can expect. Finally, the concept of confidential data in a database is new to me. How precisely does this work? Is this unique to DDBJ, or might there be data quietly floating around in other databases as well? How does one officially 'spot' references to confidential data? Thanks to all. Apologies if any of the above questions are common knowledge to the rest of the world. Charles Bailey !------------------------------------------------------------------------------- ! Dept. of Human Genetics / Howard Hughes Medical Institute ! University of Pennsylvania School of Medicine Rm. 430 Clinical Research Bldg. ! 422 Curie Blvd. Philadelphia, PA 19104 USA Tel. (215) 898-1699 ! Internet: bailey@hmivax.humgen.upenn.edu (IN 128.91.200.37) !-------------------------------------------------------------------------------
goldman@mbcl.rutgers.edu (05/31/91)
In article <1991May29.185538.1@hmivax.humgen.upenn.edu>, bailey@hmivax.humgen.upenn.edu writes: > In article <9105281658.AA02439@histone.lanl.gov> on May 28, Paul Gilna writes: > >>All data processed by DDBJ (with the exception of confidential data) >>are passed to GenBank on a regular basis and incorporated >>immediately into the on-line flatfile servers and RDBMS satellites; >>These data are in turn propagated to EMBL and their distribution nodes. > > How rapidly does this integration of data occur? We have been updating local > copies of nucleic acid sequence databases by FTP, and I'm curious whether > there's a sufficient lag to justify maintaining incremental updates to GenBank > and EMBL separately (I gather from recent discussion that there's no access at > present to DDBJ until GenBank gets the data; correct me if I'm wrong). > > If anyone is familiar with the exchange rates or has looked at information > overlap in the updates between releases, I'd be interested to know what one can > expect. I don't have any hard information on this, but we also take both the genbank and the EMBL weekly updates. We use the GCG suite of programs, and I run AccessionNumbers on the Genbank data, and use that so that the "new" data from Genbank and EMBL does not have duplicate entries. From this, I would estimate that (per week) the number of entries that are in EMBL but not yet in Genbank is about a fifth of the total number of entries in the Genbank "new" data. (I just looked at the sizes of the em_new.ref and gb_new.ref files.) I hope this helps.... > > Finally, the concept of confidential data in a database is new to me. How > precisely does this work? Is this unique to DDBJ, or might there be data > quietly floating around in other databases as well? How does one officially > 'spot' references to confidential data? > > Thanks to all. Apologies if any of the above questions are common knowledge to > the rest of the world. > > > Charles Bailey > > !------------------------------------------------------------------------------- > ! Dept. of Human Genetics / Howard Hughes Medical Institute > ! University of Pennsylvania School of Medicine Rm. 430 Clinical Research Bldg. > ! 422 Curie Blvd. Philadelphia, PA 19104 USA Tel. (215) 898-1699 > ! Internet: bailey@hmivax.humgen.upenn.edu (IN 128.91.200.37) > !------------------------------------------------------------------------------- Adrian -- Adrian Goldman | Internet: Goldman@MBCL.Rutgers.Edu Molecular Biology Computing Laboratory | Bitnet: Goldman@BioVAX Waksman Insitute, | Phone: (908) 932-4864 Rutgers University, | Fax: (908) 932-5735 Piscataway, NJ 08855 USA |
pgil@HISTONE.LANL.GOV (Paul Gilna) (05/31/91)
I'm rather glad you asked these questions! on your first one: "How rapidly does this integration of data occur? We have been updating local copies of nucleic acid sequence databases by FTP, and I'm curious whether there's a sufficient lag to justify maintaining incremental updates to GenBank and EMBL separately (I gather from recent discussion that there's no access at present to DDBJ until GenBank gets the data; correct me if I'm wrong)." It is the stated goal of the collaboration between Genbank, EMBL and DDBJ that the databases be viewed as functionally equivalent. What this essentially means is that for a common set of data elements (eg, sequence, reference, source, features) one should be able to view one copy of the database (eg GenBank) and know that for these elements, all data are equally represented in each database. Now there are really a number of levels that approximations of this equivalence already occur, and one level where it will absolutely occur. All new GenBank data are mounted on the EMBL servers within 24-48 hours of their distribution from LANL, and all new data from EMBL are mounted on the GOS within a similar timeframe, hence on both the GOS and the EMBL servers, all new data from either databank should be available within 24-48 hrs of their release. All new GenBank data are translated by EMBL into their format and mounted on their and our servers. Until recently, this was not consistently the case for GenBank's handling of EMBL data, and this is being rectified as we speak, where the next week or so should see us catching up with what we now estimate to be a 5% difference. Hence, within the next few weeks, all new EMBL data should be available on the GOS or from the USENET distribution in GenBank format, to the point where one need only look in one database for all of the data. However, in our view, that is not enough. The fact remains, that as long as we continue to exchange data using the flatfile as our currency, there in no way we can ensure that changes at one database are routinely or robustly propagated to the other. For that one needs a way of saying, "take this version of a data element (sequence) and replace it with the following version". To these ends, the databanks have designed a data exchange protocol which should allow us to meet these goals, and in addition have some significant impact on some of our journal submission protocols, where authors would essentially be free to submit to the databank of their choice regardless of the Journal's stated databank affiliations. This protocol is now at the completion stages of implementation and will be in test phase within the next month or so. Finally all DDBJ data are incorporated into GenBank and from thence to the servers immediately we receive their data. You also ask a second question: Finally, the concept of confidential data in a database is new to me. How precisely does this work? Is this unique to DDBJ, or might there be data quietly floating around in other databases as well? How does one officially 'spot' references to confidential data? All the databanks currently handle confidential data. Submitting authors (who now provide us with ~90% of our data) are offered the choice of having their data held in confidence until they appear in a publication, or until a supplied release date has expired. There appear to be two main reasons why authors would wish to avail of this: since we are asking that data be submitted before their appearence in a publication, authors with data which could be of use to competitors request that the data be held until the publication reporting them appear. Patent issues may also play a role here. The second issue is that authors may be uncomfortable with having data associated with their name released to the public before they have received the "imprimatur" of the journal peer-review system. We believe that both of these reasons are largely unjustified. The issues seem to revolve around citability in the first case and credibility in the second. In the case of citability, the existence of sequence data in a public release of the database is in itself a citable event; for example one might cite a sequence in Genbank as Lavorgna,G., Ueda,H., Clos,J. and Wu,C. (1991) D.melanogaster FTZ-F1 mRNA, complete cds. GenBank, M63711 In the case of credibility, it is a myth that the data are in any sense subjected to the same level of peer review that the rest of a paper receive, and indeed they should not: the purpose of the peer review system is to evaluate the experimental methods and the scientific interpretation of the data from those methods, but not to evaluate the data themselves. However at GenBank, we are already placing an increasing number of checks on the data (correct translation, presence of vector sequence, etc) which are more capable of validating the data than can the peer review system. For a more detailed view of these issues I reccommend you read our article in this weeks (May 31) issue of Science. Often, the fact that we maintain confidential data requires that we must spot that the data have been published in our routine scanning of the journals in order to release the data. Often, the accession number has been presented in the journal (something we try to strongly encourage with both the authors and the journals), and this is sufficient for us to make the link between publication and submitted, confidential data, thereby effecting their release. Often however accession numbers do not appear, or we miss the article because (as is increasingly the case) no sequence data appear in the published article. Hence the data can remain confidential until they are noticed elsewhere. Our record is improving however, as we are now making use of advance copies of the tables of contents for a number of our major sequence-reporting journals, including JBC and Science, allowing us to meet our goals of having the data available on line at the time of publication even in the case of confidential data. In addition we have recently installed a new procedure which we have been testing for some time which allows the community to assist us in this spotting process; if a genbank user uses an accession number from a journal to retrieve a sequence from the on-line servers or their local USENET-updated copy of the database, and cannot find it, the chances are good that we have these data as confidential, as we are getting all other non confidential submissions out before publication. By communicating the citation details (authors, title journal, vol pages) and the accession number to a specific email address: update@genome.lanl.gov, we can use this information to release the data, even if we have yet to see the actual journal. Results from our trial period have been very encouraging, and we are now ready to begin reccomending everyone to use this address. As is often the case, the answer has become more complicated than the question, yet I thought it worth making use of the opportunity to discuss what are really some far-reaching and fundamental issues for the database and the community it serves. I hope this answers your questions. Regards, Paul Gilna, Biology Domain Leader, GenBank, Los Alamos.