[bionet.molbio.genbank] Updates of GenBank via USENET: comparison with V65.0.

smith@mcclb0.med.nyu.edu (10/30/90)

      We have checked the UPDATE banks we have accumulated over the last 3 
months with the new GenBank release, V65.0.  We found that all but 33 of the 
sequences appearing in the UPDATE bank with posting dates before 15th August 
(the drop-dead date for V65) appeared in the distributed bank on tape, and 
were therefore eliminated from the UPDATE bank.  This reduced the size of the 
UPDATE bank by about 50%.  This was the good news.

      Significantly, we found >1,000 sequences new to the tape release, which 
did not appear in the UPDATE bank and which were therefore never posted to
UPDATE, were lost en route to us, or were lost through errors in our
software.  Right now we are not able to determine where these sequences might
have been lost, or the reason for the loss. 

      We therefore would like strongly to recommend that a method of checking 
the update process be implemented: without such a scheme it will simply be 
impossible to be sure that the updates received are complete and accurate.
The apparent loss rate on the UPDATE feed is too high to ignore.

      Right now it seems that the best method for checking would be to 
distribute, with the UPDATE news feed, a list of the sequences posted to 
USENET in each week or two weeks.  This list should contain the accession
number, the locus name and the posting date and time.  We will try to write a
small program which will check the list of weekly deliveries against the list
of sequences on hand to determin what is missing, and then request the
missing sequences automatically from the server. 

      I think it is fair to say that the provision of these 'check' messages 
is not too much to ask for, and should be easy to provide.  It is hard to 
imagine that sequences are being posted to the net without any record being 
kept as to what was actually sent out, so that all that would need to be done 
would be to post these lists.
      
      We have a new version of the VMS software for processing the data feed.
It is available as an archive in two parts (NIGHTLY.1, NIGHTLY.2) on our
MAILSERVer, and also as a single file for anonymous/FTP.  The updated version 
fixes a problem associated with updating the bank with weekly batches FTPed 
from GenBank.

+---------------------------------------------------------------------------+
|Ross Smith, Cell Biology,  NYU Medical Center,  550 First Ave.,  NYC, 10016|
|Phone: (212) 340-5356: FAX: (212) 340-8139 (Alternate NYUMC) (212) 340-7190|
|E-Mail:  SMITH@NYUMED.BITNET (BITNET),  SMITH@MCCLB0.MED.NYU.EDU (Internet)|
+---------------------------------------------------------------------------+

pgil%histone@LANL.GOV (Paul Gilna) (10/30/90)

GenBank's collaboration with EMBL has reached a point where we are
beginning to implement a data exchange syntax that will allow updating
of data between the two databases. The goal in this process is to bring
the two databases to the point where they are functionally equivalent,
i.e., there will be no data in one database that are not represented
equally in the other, and all updates will be propagated. Hence it is
hoped that eventualy one need mount only one version of the database,
knowing that all of the data from the other are represented.

We would hope to have these mechanisms in place by early next year.

In the meantime we struggle with the current means of merging each
other's data. One aspect of this process at GenBank requires that we
take a tape release from EMBL, work out what is new or updated, convert
it to GenBank format and merge it with the GenBank database. In the
past, this was a time consuming process that required some manual
intervention by annotators. We have restrained ourselves from making
much needed improvements in this procedure as we believe that the data
exchange mechanisms will solve the current problems inherent in this
process. However in an attempt to speed up the process of release
merging, we automated the conversion step to the point where the
necessary intervention could occur after the merge, rather than
before.  This change involved "parking" the converted EMBL entries in the
unannotated division ( and currently, are the only class of entry that
enter that division, GenBank, for the most part has ceased creating
unannotated entries), and most are removed from there by the following
release. We reasoned that there was not much point in mounting these
entries on the servers, as they are merely limited versions of existing
data already present on the EMBL and GenBank-On-line servers.

It is thus very likely that the discrepancy between the USENET 
distributions and the tape release is accounted for by these entries.

Regards,

--paul

smith@mcclb0.med.nyu.edu (10/31/90)

In article <9010301510.AA03412@histone.lanl.gov>, pgil%histone@LANL.GOV (Paul Gilna) writes:

> GenBank's collaboration with EMBL has reached a point where we are
> beginning to implement a data exchange syntax that will allow updating
> of data between the two databases. The goal in this process is to bring
> the two databases to the point where they are functionally equivalent,
> ....
> In the meantime we struggle with the current means of merging each
> other's data. One aspect of this process at GenBank requires that we
> take a tape release from EMBL, work out what is new or updated, convert
> it to GenBank format and merge it with the GenBank database. ....
> ... We reasoned that there was not much point in mounting these
> entries on the servers, as they are merely limited versions of existing
> data already present on the EMBL and GenBank-On-line servers.
> 
> It is thus very likely that the discrepancy between the USENET 
> distributions and the tape release is accounted for by these entries.

If this is indeed the reason for the significant discrepancy then part of the
'solution' to the problem would be to get these EMBL 'updates' posted to
USENET as well. Is this practical to do?  In the meantime we will look to see 
if these new sequences are part of the unannotated part of the tape 
distribution.  Thanks for the help!

I do not think, however, that this is a reason to delay a scheme for checking 
the feed already sent out.  

+---------------------------------------------------------------------------+
|Ross Smith, Cell Biology,  NYU Medical Center,  550 First Ave.,  NYC, 10016|
|E-Mail:  SMITH@NYUMED.BITNET (BITNET),  SMITH@MCCLB0.MED.NYU.EDU (Internet)|
+---------------------------------------------------------------------------+

kristoff@GENBANK.BIO.NET (Dave Kristofferson) (11/01/90)

Ross,

	I have read your posting and Paul's reply.  I am out of the
office all this week, but please send us the list of missing sequences
anyway even if the reason is as Paul explains.  We want to be sure
that is the explanation.

	Paul's message did not imply that there was any need to delay
the implementation of the error-checking list on USENET.  Please be
aware of the fact that in the last month we had to finish up work on a
number of projects: implementing the RDBMS on GOS, revising the GOS
manual, finishing AuthorIn for the Mac, producing release 65, and
writing quarterly and year end reports for the NIH.  Work is still
continuing too on converting Fickett's code for the new feature table
format (although this got delayed by the above).  Because the USENET
distribution is basically an unfunded "spare time" activity, this had
to sit on the back burner.  It is still not completely on the front
burner yet, but we should be able to do something within the next
month.

				Sincerely,

				Dave Kristofferson
				GenBank Manager

				kristoff@genbank.bio.net

PJH@leicester.ac.uk (11/09/90)

In article <CMM.0.88.657390893.kristoff@genbank.bio.net> you write:
>Ross,
>
>     <a lot of irrelevent stuff>
>
>format (although this got delayed by the above).  Because the USENET
>distribution is basically an unfunded "spare time" activity, this had
>to sit on the back burner.  It is still not completely on the front
>burner yet, but we should be able to do something within the next
>month.
>
>				Sincerely,
>
>				Dave Kristofferson
>				GenBank Manager
>
>				kristoff@genbank.bio.net

Can I suggest something that might be a lot easier to implement than the
list of files sent out.  Just a simple sequence number somewhere in the mail
header.  If these numbers started sometime at 1 and just worked upwards you
would know that you had missed News postings and which ones were missing.
A lot of the comp...sources use this mechanism for tracing missing postings.

Keep up the good work - even though it may be incomplete, what you are doing
is incredibly useful to us all.

Pete Humble,                 Internet: PJH%leicester.ac.uk@nsfnet-relay.ac.uk
System Manager,              Bitnet/EARN: PJH%leicester.ac.uk@UKACRL
Computer Centre,             JANET: PJH@UK.AC.LEICESTER
Leicester University
U.K.