[bionet.software] Updates of GenBank via USENET: comparison with V65.0.

smith@mcclb0.med.nyu.edu (10/30/90)

      We have checked the UPDATE banks we have accumulated over the last 3 
months with the new GenBank release, V65.0.  We found that all but 33 of the 
sequences appearing in the UPDATE bank with posting dates before 15th August 
(the drop-dead date for V65) appeared in the distributed bank on tape, and 
were therefore eliminated from the UPDATE bank.  This reduced the size of the 
UPDATE bank by about 50%.  This was the good news.

      Significantly, we found >1,000 sequences new to the tape release, which 
did not appear in the UPDATE bank and which were therefore never posted to
UPDATE, were lost en route to us, or were lost through errors in our
software.  Right now we are not able to determine where these sequences might
have been lost, or the reason for the loss. 

      We therefore would like strongly to recommend that a method of checking 
the update process be implemented: without such a scheme it will simply be 
impossible to be sure that the updates received are complete and accurate.
The apparent loss rate on the UPDATE feed is too high to ignore.

      Right now it seems that the best method for checking would be to 
distribute, with the UPDATE news feed, a list of the sequences posted to 
USENET in each week or two weeks.  This list should contain the accession
number, the locus name and the posting date and time.  We will try to write a
small program which will check the list of weekly deliveries against the list
of sequences on hand to determin what is missing, and then request the
missing sequences automatically from the server. 

      I think it is fair to say that the provision of these 'check' messages 
is not too much to ask for, and should be easy to provide.  It is hard to 
imagine that sequences are being posted to the net without any record being 
kept as to what was actually sent out, so that all that would need to be done 
would be to post these lists.
      
      We have a new version of the VMS software for processing the data feed.
It is available as an archive in two parts (NIGHTLY.1, NIGHTLY.2) on our
MAILSERVer, and also as a single file for anonymous/FTP.  The updated version 
fixes a problem associated with updating the bank with weekly batches FTPed 
from GenBank.

+---------------------------------------------------------------------------+
|Ross Smith, Cell Biology,  NYU Medical Center,  550 First Ave.,  NYC, 10016|
|Phone: (212) 340-5356: FAX: (212) 340-8139 (Alternate NYUMC) (212) 340-7190|
|E-Mail:  SMITH@NYUMED.BITNET (BITNET),  SMITH@MCCLB0.MED.NYU.EDU (Internet)|
+---------------------------------------------------------------------------+