cherry@frodo.mgh.harvard.edu (J. Michael Cherry) (12/06/90)
The number of new sequence entries to both the GenBank and EMBL nucleic acid databases have decreased starting around the first of October. The table below shows the number of new entries by week for GenBank and EMBL. The GenBank numbers were obtained by checking the weekly update files on genbank.bio.net (directory ~ftp/pub/db/gb-newdata) and the EMBL numbers were obtained from the weekly listing of new entries available from NETSERVER@EMBL.BITNET (send the message: GET NUC:NEWENTRIES.NDX). A couple of notable observations from these numbers. The total number of new sequences has decreased about three fold for the Oct/Nov period as compared to the Aug/Sep period. EMBL is now (Oct/Nov period) releasing about three times the number of new sequences as GenBank. However in the Aug/Sep period the two databases released about the same number. I would be interested in your thoughts as to what might have caused the decrease in new sequences. One that comes to mind is that around the first of October GenBank switched to their RDBM system. [ If the decrease is simply a result of a decreased flow from GenBank, assuming that the rate EMBL is entering new sequences is constant, then this suggests that in the past GenBank had entered the majority of the new sequences. The number of new sequences dropped three fold from Aug/Sep to Oct/Nov for EMBL, and seven fold for GenBank. ] Another possibility is simply that the number of new sequences being published has decreased between the two time periods, but I do not believe it would have decreased 3 fold. Prehaps the weekly updates on genbank.bio.net are not complete, but this must also apply to the mechanism used to transfer new sequences from GenBank to EMBL because of the large decrease in the EMBL numbers. Prehaps EMBL has also had problems, I am not as familiar with the EMBL setup as I am with that of GenBank (which is not that familiar to start with) so I generally look to GenBank first. Finally keeping an open mind I must suggest that my numbers contain an error that I do not know about. If you think this final possibility is true I would be very happy to learn where I am in error. In any event it would appear that anyone that is trusting GenBank to contain all the known new sequences should reconsider. EMBL appears to be currently adding more new sequences than GenBank. Not listed below are the number of new entries in the embl-newdata files on genbank.bio.net. There were 677 sequences in the embl-newdata files for the Oct/Nov period, as compared with 464 sequences for GenBank. Mike Cherry cherry@frodo.mgh.harvard.edu Week ending GenBank (gb newdata) EMBL (FileServer) Aug-13-1990 499 546 Aug-20-1990 690 713 Aug-27-1990 248 68 Sep-3-1990 298 473 Sep-10-1990 306 308 Sep-17-1990 278 263 Sep-24-1990 575 481 Oct-1-1990 231 162 Oct-8-1990 10 109 Oct-15-1990 19 52 Oct-22-1990 35 135 Oct-29-1990 70 105 Nov-5-1990 85 172 Nov-12-1990 71 271 Nov-19-1990 90 158 Nov-26-1990 12 74 Dec-2-1990 72 131 Aug & Sep total 3125 3014 Oct & Nov total 464 1207 Grand Total (Aug-Nov) 3589 4221
pgil%histone@LANL.GOV (Paul Gilna) (12/07/90)
J. Michael Cherry writes with concern on the recent drop in number of entries passed to the servers from the GenBank project. It is correct to assume that events surrounding the RDBMS conversion have led to an apparent drop in our output. However this drop is about to reverse dramatically, and this is an appropriate point to place the events of the past few weeks in perspective. Firstly some definitions. When we here at GenBank speak of the "conversion" to the RDBMS, we in fact are speaking of a number of conversions that occurred in parallel: 1. Conversion to internal maintenance of data in RDBMS format; this occurred by translation of the conventional flat file into the RDBMS tables. 2. Conversion to internal input of data into the RDBMS; while (1) could continue by simply passing in and out flat-files (and indeed is exactly how the previous two releases have been generated), we designed and implemented a new annotation interface to the RDBMS, the annotators workbench. 3. Conversion to external output of the RDBMS. Release 64 and 65 of GenBank were created by translating in the flatfile and then writing it out again in the new Feature Table format. Release 66 will consist of only flatfiles written from the database, where all data since release 65 have been entered only through the workbench. 4. Conversion to new external flatfile format, the new Features table. In contrast to the old FT format, the new format exists only as a report on the database, we do not enter the features as they are written. Our annotators know about the features, but do not have to work with the syntax. These four conversion factors are also presented in order of priority: it is important to note that we concentrated on input first, working on output second; the maxim, "garbage in, garbage out", helps clarify that particular development philosophy! Secondly, there are really two classes of "entry" passed to the servers (under which lie a number of sub-classes); new data; i.e., new sequences, and updated data, i.e., updates to existing, publicaly available entries, error corrections, citation updates, etc. In any conversion of this scale, a drop in productivity is inevitable. When we came towards the point of beta test and conversion, we really had two choices of approach. We could conduct a beta test in which we worked in parallel on the two databases (RDBMS and FLATFILE), repeating the work in both, where only the flatfile version continued to be released to the worls and where we waited until we were absolutely confident that we could throw away the old tools before true conversion. Alternatively, we could conduct a live beta test, in which we worked only once on the data within the RDBMS, but had some failsafe mechanisms in place in case things went wrong. The former mechanism had the disadvantage of extreme redundancy of work and a significant effect on production in addition to a prolonged learning curve for annotators. We chose the latter option, effectively throwing ourselves in at the deep end. Our failsafe mechanism was a temporary flatfile report from the database which could be both saved and used to continue distribution to the servers. It is the generation and handling of this flatfile that has created the primary effect on performance. To cut a long story short, the flatfiles had to be examined and tweaked to pass our flatfile integrity checking programs. To minimise the effect on performance we made the following call: only new data would be passed to the servers until we could distribute data directly to the servers from the RDBMS without recourse to the temporary flatfile; all modifications to existing data (e.g. citation updates to entries already avaliable, etc.,) would not be passed to the servers until this could be done automatically. About 4 out of 5 entries which we have handled over the past few weeks have consisted of citation updates to existing released submitted data, and hence have not been passed to the servers. That point of automation is about to occur: as of Monday of next week we will cease "tweaking" flatfiles. Throughout the next week, we will also release the 1000 or so entries that have not gone out to the server as a result of the above actions. Freedom from this manual flatfile work should also result in a marked increase in output of all entries over the ensuing weeks. Distribution of flatfiles (in the new Feature Table format) will now happen automatically each night. Dr. Cherry, you are correct to raise concern over your observations, and indeed we apologise for the events which have led to this concern. I hope, however, that the cause for such concern will have vanished over the course of the next few weeks. Regards, Paul Gilna, Ph.D., Biology Domain Leader GenBank, Los Alamos. ----- End Included Message -----