[bionet.molbio.genbank] Fewer new sequences in Oct and Nov

cherry@frodo.mgh.harvard.edu (J. Michael Cherry) (12/06/90)

The number of new sequence entries to both the GenBank and EMBL nucleic
acid databases have decreased starting around the first of October.  The
table below shows the number of new entries by week for GenBank and
EMBL.  The GenBank numbers were obtained by checking the weekly update
files on genbank.bio.net (directory ~ftp/pub/db/gb-newdata) and the EMBL
numbers were obtained from the weekly listing of new entries available
from NETSERVER@EMBL.BITNET (send the message: GET NUC:NEWENTRIES.NDX). 

A couple of notable observations from these numbers.  The total number
of new sequences has decreased about three fold for the Oct/Nov period
as compared to the Aug/Sep period.  EMBL is now (Oct/Nov period)
releasing about three times the number of new sequences as GenBank. 
However in the Aug/Sep period the two databases released about the same
number. 

I would be interested in your thoughts as to what might have caused the
decrease in new sequences.  One that comes to mind is that around the
first of October GenBank switched to their RDBM system.  [ If the
decrease is simply a result of a decreased flow from GenBank, assuming
that the rate EMBL is entering new sequences is constant, then this
suggests that in the past GenBank had entered the majority of the new
sequences.  The number of new sequences dropped three fold from Aug/Sep
to Oct/Nov for EMBL, and seven fold for GenBank.  ] Another possibility
is simply that the number of new sequences being published has decreased
between the two time periods, but I do not believe it would have
decreased 3 fold.  Prehaps the weekly updates on genbank.bio.net are not
complete, but this must also apply to the mechanism used to transfer new
sequences from GenBank to EMBL because of the large decrease in the EMBL
numbers.  Prehaps EMBL has also had problems, I am not as familiar with
the EMBL setup as I am with that of GenBank (which is not that familiar
to start with) so I generally look to GenBank first.  Finally keeping an
open mind I must suggest that my numbers contain an error that I do not
know about.  If you think this final possibility is true I would be very
happy to learn where I am in error. 

In any event it would appear that anyone that is trusting GenBank to
contain all the known new sequences should reconsider.  EMBL appears to
be currently adding more new sequences than GenBank.  Not listed below
are the number of new entries in the embl-newdata files on
genbank.bio.net.  There were 677 sequences in the embl-newdata files for
the Oct/Nov period, as compared with 464 sequences for GenBank. 

Mike Cherry
cherry@frodo.mgh.harvard.edu


Week ending	GenBank (gb newdata)	EMBL (FileServer)
Aug-13-1990		499			546
Aug-20-1990		690			713
Aug-27-1990		248			68
Sep-3-1990		298			473
Sep-10-1990		306			308
Sep-17-1990		278			263
Sep-24-1990		575			481
Oct-1-1990		231			162
Oct-8-1990		10			109
Oct-15-1990		19			52
Oct-22-1990		35			135
Oct-29-1990		70			105
Nov-5-1990		85			172
Nov-12-1990		71			271
Nov-19-1990		90			158
Nov-26-1990		12			74
Dec-2-1990		72			131
		
Aug & Sep total		3125			3014
Oct & Nov total		464			1207

Grand Total (Aug-Nov)	3589			4221

pgil%histone@LANL.GOV (Paul Gilna) (12/07/90)

J. Michael Cherry writes with concern on the recent drop in number
of entries passed to the servers from the GenBank project.


It is correct to assume that events surrounding the RDBMS conversion
have led to an apparent drop in our output. However this drop is about
to reverse dramatically, and this is an appropriate point to place the
events of the past few weeks in perspective.

Firstly some definitions. 

When we here at GenBank speak of the "conversion" to the RDBMS, we in
fact are speaking of a number of conversions that occurred in
parallel:

1.  Conversion to internal maintenance of data in RDBMS format; this
occurred by translation of the conventional flat file into the RDBMS
tables.

2.  Conversion to internal input of data into the RDBMS; while (1)
could continue by simply passing in and out flat-files (and indeed is
exactly how the previous two releases have been generated), we designed
and implemented a new annotation interface to the RDBMS, the annotators
workbench.

3.  Conversion to external output of the RDBMS. Release 64 and 65 of
GenBank were created by translating in the flatfile and then writing it
out again in the new Feature Table format. Release 66 will consist of
only flatfiles written from the database, where all data since release
65 have been entered only through the workbench.

4.  Conversion to new external flatfile format, the new Features
table.  In contrast to the old FT format, the new format exists only as
a report on the database, we do not enter the features as they are
written.  Our annotators know about the features, but do not have to
work with the syntax.

These four conversion factors are also presented in order of priority:
it is important to note that we concentrated on input first, working on
output second; the maxim, "garbage in, garbage out", helps clarify that
particular development philosophy!


Secondly, there are really two classes of "entry" passed to the servers
(under which lie a number of sub-classes); new data; i.e., new
sequences, and updated data, i.e., updates to existing,
publicaly available entries, error corrections, citation updates, etc.



In any conversion of this scale, a drop in productivity is inevitable.
When we came towards the point of beta test and conversion, we really
had two choices of approach. We could conduct a beta test in which we
worked in parallel on the two databases (RDBMS and FLATFILE), repeating
the work in both, where only the flatfile version continued to be released to the worls and where we waited until we were absolutely confident that
we could throw away the old tools before true conversion. 

Alternatively, we could conduct a live beta test, in which we worked
only once on the data within the RDBMS, but had some failsafe
mechanisms in place in case things went wrong. 

The former mechanism had the disadvantage of extreme redundancy of work
and a significant effect on production in addition to a prolonged
learning curve for annotators.

We chose the latter option, effectively throwing ourselves in at the
deep end. Our failsafe mechanism was a temporary flatfile report from
the database which could be both saved and used to continue
distribution to the servers.

It is the generation and handling of this flatfile that has created the
primary effect on performance. To cut a long story short, the flatfiles
had to be examined and tweaked to pass our flatfile integrity checking
programs. To minimise the effect on performance we made the following
call: only new data would be passed to the servers until we could
distribute data directly to the servers from the RDBMS without recourse
to the temporary flatfile; all modifications to existing data (e.g.
citation updates to entries already avaliable,  etc.,) would not be
passed to the servers until this could be done automatically.

About 4 out of 5 entries which we have handled over the past few weeks
have consisted of citation updates to existing released submitted
data, and hence have not been passed to the servers.

That point of automation is about to occur: as of Monday of next week
we will cease "tweaking" flatfiles. Throughout the next week, we will
also release the 1000 or so entries that have not gone out to the
server as a result of the above actions. Freedom from this manual
flatfile work should also result in a marked increase in output of all
entries over the ensuing weeks. Distribution of flatfiles (in the new
Feature Table format) will now happen automatically each night.

Dr. Cherry, you are correct to raise concern over your observations,
and indeed we apologise for the events which have led to this concern.
I hope, however, that the cause for such concern will have vanished
over the course of the next few weeks.

Regards,


Paul Gilna, Ph.D.,
Biology Domain Leader
GenBank,
Los Alamos.


----- End Included Message -----