[bionet.molbio.genbank] genbank update cycle and submitted data lag time

cb%intron@LANL.GOV (Christian Burks) (11/20/88)

Dear Dr. Smith:

I gather that Dave Benton answered some of your questions...I've
made stab at answering the others.

Thanks for your concern and interest.

Christian Burks

> Return-Path: <phri!alanine.phri!roy@nyu.edu>
> Received: from rutgers.edu by BIONET-20.BIO.NET with TCP; Sat 5 Nov 88 02:34:36-PST
> Received: by rutgers.edu (5.59/1.15) 
> 	id AA06289; Sat, 5 Nov 88 05:34:19 EST
> Received: by phri.phri (5.51/5.17)
> 	id AA11252; Fri, 4 Nov 88 14:05:43 EST
> Received: by alanine.phri (3.2/5.17)
> 	id AA24967; Fri, 4 Nov 88 14:07:05 EST
> Date: Fri, 4 Nov 88 14:07:05 EST
> From: phri!alanine.phri!roy@nyu.edu (Roy Smith)
> Message-Id: <8811041907.AA24967@alanine.phri>
> To: benton@bionet-20.bio.net, nucall@bionet-20.bio.net
> Subject: Re: Missing entries in GenBank (cooperation with EMBL)
> Cc: roy@alanine.phri
> 
> Dr. Benton,
> 
> 	Thank you for checking this out for me.  While I am glad to know it
> will be in the next release, I do wonder if a 10 month (end of January to
> start of December) lag between publication and entry into the data base is
> too long.  Granted, we're looking at a worst case because we just missed a
> release and thus incurred an extra 3-month delay, but even 7 months seems
> like a long time.  Keep in mind, that this 7 or 10 month delay starts
> counting from the date the paper is published; given that most papers are 6
> months or more from submission to when it hits presses, and you're talking
> over a year from the time a sequence in known to when it's on-line.
> 
> 	What is a typical amount of time between publication of a sequence
> in a journal and when it goes out on a GenBank tape?  What amount of time
> is considered "good" by the GenBank staff (i.e. the target delay, beyond
> which subscribers should feel justified complaining about)?
> 

Most data are now getting into a public release of GenBank within
3-5 months of receipt date at LANL.  The major exception is for
data received prior to publication which the author requests we
withhold from public release until some future date (e.g., date
of publication in a journal article); in this case the data are
queued into a public release as soon as that date is reached.

(This may still fall short of the ideal ...but it should be compared
with 2 years ago when the average time from receipt at LANL to
appearance in the database was 12-14 months.  That was clearly
unacceptable, and we've put much effort and resources into
turning that around.)

When should a correspondent feel concerned enough to follow up?
At this point I would suggest that if more than one public
release has gone by since LANL received and acknowledged receipt
of the data (and if the author released the data for public
consumption at that point), they should definitely contact us.
If one submits data initially and doesn't receive an acknowledgement
within two weeks, that should be followed up on immediately.

> 	Are there any plans to have more frequent updates in-between the
> major quarterly releases?  I could envision once a week ftping the latest
> stuff from bionet-20 (or wherever the master copy is maintained, or perhaps
> there could be several repository sites around the country to reduce system
> load and network congestion).  These intermediate updates could be
> unannotated to get them out faster.  I could envision three levels an entry
> would go through.  First, as soon as possible, an unnanotated entry made
> available for ftp.  Second, each time a quarterly tape goes out, all those
> entries in the ftp area which are still not yet fully ready would be put on
> the tape as part of the current unannotated section.  Lastly, when entries
> are fully annotated, checked, indexed, and otherwise masaged into their
> final form, merged into the main data base.

There are many schemes (including that you suggest) for getting
incremental data out earlier...in fact, we did, until a year ago,
distribute an interim (six weeks) release that included only
"new" data...this was dropped because very few people requested
subscriptions to it and those that did admitted (with only 1-2
exceptions) that they didn't use it anyway.
Given the way that we were maintaining the data at that time, these
interim releases were very time consuming for us with little -- as
far as we could see -- benefit reaching the user community.

Over the next year, we will be shifting over to a data maintenance scheme
that will allow for the continuous or almost-continuous updating of
the database for internal maintenance...we hope by the end of the
year to have established some reflection of this continuity in
the distributed data, perhaps with even weekly updates being
available in some distributed form.

> 
> 	The idea is to get the sequence data out as fast as possible to the
> scientists who want to see it.  From what I see, I classify GenBank (and
> the same comments go pretty much for Dayhoff and other similar databases)
> usage into two catagories.  First is "I want to know the sequence of XXXX".
> This is straight-forward and if XXXX is not yet in the database, you find
> out fast.  If it's critical that you know about XXXX, you can always call
> the author or something like that.  The second one is the shot-in-the-dark
> search.  This latter one is where you really get killed by slow updates,
> because if you don't find something, you don't know what you missed.  These
> searches are often for sequence homolgies; but people just as often say
> "give me all the erythromycin resistance genes" or something like that, for
> which the same comments about the dangers of slow updates apply.
> -------

We share these concerns, and although we've made great strides in
this regard over the past 18 months, we believe the release cycle time
will be much more improved over the coming year.
> 
> 
> 

kramerj@bionette.CS.ORST.EDU (Jack Kramer - CMBL) (12/03/88)

I have been working with many molecular biologists for several years doing
much of their computer work. I often get complaints on the interval of
time between the publication of a paper and inclusion of a contained
sequence in the database.  Most often, when I ask if they submitted the
sequence to the database,  the response is that they didn't have time to
fill out the form or didn't think it was necessary.  "It was published,
why shouldn't everyone rush to take care of this most important sequence?"

I think this may be the most common reason for the gap between publication
and availability.  Both Dave and Christian were too kind to scold complainers
in this class.  The original publication does not always have all the
information that should be included in the database.  Even when it does,
it can take an unreasonable amount of time to decipher many of the papers.
Prompt submission of all the required data directly to the correct location
would probably alleviate any rational basis for dissatisfaction with the 
delays.

I would like to relay the thanks from several hundred biologists at
Oregon State University, who have benefited from the excellent work that
has been and is being done by the molecular database staffs.  Handling 
only the volume of data submitted properly, I'm sure calls for much effort
beyond the normal.  And then much is done even beyond that.  

THANKS!!!!!!

Jack Kramer
Computational Molecular Biology Laboratory
Oregon State University

Kristofferson@BIONET-20.BIO.NET (David Kristofferson) (12/03/88)

Well put, Jack!  The attitudes that you describe in your message are
encountered time and time again.  You have truly hit the nail right on
the head.  I am sure that the database staff will appreciate your kind
remarks.  Criticism often comes from those with little knowledge of
the magnitude of the efforts involved.

Dave

-------