[bionet.molbio.genbank] GenBank Release 64 was incomplete

dmf@MED.UNC.EDU (08/24/90)

Hi,

I just noted yesterday what others have probably already noted.  However, I
wish to call this to the attention off the few remaining ignorant peoples.

We noted that some of the GenBank release 64 files were incomplete.  These
had been downloaded about the end of July.  For instance, the GBpln.seq
files had no yeast sequences.  This caused grave misgivings at our institution
which has more than its share of yeast people.  Some other files were also
incomplete.  I checked with a friend who had also done a download at about
the same time.  His files were identical to mine.  I went back to GenBank
and checked to see if new files were available.  Yes, they are and are
dated August 13.  Thus, If you have downloaded Genbank 64 before August
13, you need to check your files for their sizes compared to those
presently available.

Would it be possible for those  who FTP files to e-mail a note about the 
files we transfer so that Genbank could automatically let us know if there
have been corrections?  

Sincerely

Dana

Dana M. Fowlkes
dmf@med.unc.edu

westerm@aclcb.purdue.edu (Rick Westerman) (08/24/90)

In article <9008241347.AA08860@acme.med.unc.edu>, dmf@MED.UNC.EDU writes:
>
>We noted that some of the GenBank release 64 files were incomplete.  These
>had been downloaded about the end of July.  For instance, the GBpln.seq
>files had no yeast sequences.

I downloaded GB rel 64 on the 18th of July. My plant file has 860 yeast-related
sequences in it. 

>... Thus, If you have downloaded Genbank 64 before August
>13, you need to check your files for their sizes compared to those
>presently available.

I run the GCG package from Wisconsin, thus my files have been reformatted and
I can't directly compare file sizes. Would someone from genbank clarify this
situation, i.e., if the database was downloaded near the end of July, is
it complete or not?

-- Rick

Rick Westerman                        AIDS Center Laboratory for Computational
Internet: westerm@aclcb.purdue.edu    Biochemistry, Biochemistry building,
(317) 494-0505                        Purdue University, W. Lafayette, IN 47907

kristoff@genbank.BIO.NET (David Kristofferson) (08/25/90)

Dana and Rick,

	Your messages have been received, and I'll have someone on the
staff here look into the reported problem.
-- 
				Sincerely,

				Dave Kristofferson
				GenBank Manager

				kristoff@genbank.bio.net

wrp@biochsn.acc.Virginia.EDU (William R. Pearson) (08/25/90)

	Here are some file sizes.  The old file sizes are from a tar
tape that I wrote after downloading the files around July 25.  The
new file sizes are those reported by ftp for files dated Aug. 13.
The +/- indicates whether the old file is larger (+) or smaller 
than the new.  Only 3 files are the same size.


                   new                   old

       4067041 Aug 13 22:43 gbbct.seq.Z 4067267 gbbct.seq.Z +
          8304 Aug 13 22:43 gbdat.frm.Z    8299 gbdat.frm.Z +
       2584078 Aug 13 22:44 gbinv.seq.Z 2582469 gbinv.seq.Z +
       1240959 Aug 13 22:44 gbmam.seq.Z 1240711 gbmam.seq.Z -
        102181 Aug 13 22:44 gbnew.txt.Z  102363 gbnew.txt.Z +
       1192042 Aug 13 22:44 gborg.seq.Z 1195380 gborg.seq.Z +
        492669 Aug 13 22:44 gbphg.seq.Z  492669 gbphg.seq.Z
       2681641 Aug 13 22:45 gbpln.seq.Z 2681845 gbpln.seq.Z +
       6246157 Aug 13 22:46 gbpri.seq.Z 6246011 gbpri.seq.Z -
        688692 Aug 13 22:46 gbrna.seq.Z  688692 gbrna.seq.Z
       5476544 Aug 13 22:47 gbrod.seq.Z 5476559 gbrod.seq.Z +
        922073 Aug 13 22:47 gbsdr.txt.Z  920822 gbsdr.txt.Z -
        500417 Aug 13 22:47 gbsyn.seq.Z  500435 gbsyn.seq.Z +
       2385611 Aug 13 22:47 gbuna.seq.Z 2385611 gbuna.seq.Z
       3776117 Aug 13 22:48 gbvrl.seq.Z 3775449 gbvrl.seq.Z -
       1438000 Aug 13 22:48 gbvrt.seq.Z 1437604 gbvrt.seq.Z +

Bill Pearson

smith@mcclb0.med.nyu.edu (08/25/90)

In article <1990Aug24.193617.7712@murdoch.acc.Virginia.EDU>, wrp@biochsn.acc.Virginia.EDU (William R. Pearson) writes:
> 	Here are some file sizes.  The old file sizes are from a tar
> tape that I wrote after downloading the files around July 25.  

If errors are found in sequences distributed in release 64, the corrections 
should be re-posted to UPDATES, even if the correct sequence was already sent 
there.  We have stripped our UPDATE bank, and have probably removed the 
'good' postings.

+---------------------------------------------------------------------------+
|Ross Smith, Cell Biology,  NYU Medical Center,  550 First Ave.,  NYC, 10016|
|Phone: (212) 340-5356: FAX: (212) 340-8139 (Alternate NYUMC) (212) 340-7190|
|E-Mail:  SMITH@NYUMED.BITNET (BITNET),  SMITH@MCCLB0.MED.NYU.EDU (Internet)|
+---------------------------------------------------------------------------+

benton@genbank.BIO.NET (David Benton) (08/26/90)

After checking the GenBank release 64 files which were on-line in the
genbank.bio.net ftp directory (~ftp/pub/db/gb-rel64) from early July
to mid-August, I can say unequivocally that GenBank Release 64.0 was
*not* incomplete, either as it appeared in that directory or as
distributed on mag tape.  In the course of preparing floppy disk-
format files from those GenBank data files, we discovered a systematic
error in the way certain feature locations were formatted in the
files.  This error affected about 1250 of the 185,079 features in
Release 64.0.  I, therefore, applied a global correction to the files
and replaced the files in the ftp directory with the corrected files.
All but one or two of the annotated divisions were affected and the
total number of affected entries is probably greater than 1000.  The
number of entries in each division, the number of lines (and the
number of words) in each data file did not change.  The only change
was to certain location which originally were written as (for example)
357357 which should have been 357.  So each of the affected files has
grown smaller by a small number of bytes. (I think Bill Peason's
results can be explained by the fact that he compared the sizes of the
compressed files and the amount of L-Z compression is sensitive to the
content.)  Due to my own oversight, I failed to post a notice to this
newsgroup announcing the availability of the new files and the reason
for the correction.  I apologize for the inconvenience this has
caused.  While I am no longer in a position to guarantee that this
won't happen in the future, Dave Kristofferson assures me that the new
management of the project will be more vigilant.  Our philosophy has
always been that, since GenBank is a human endeavor, any snapshot of
the database will contain "errors", but we bend all our efforts toward
removing known errors before distribution of releases.  Now that a
more continuously updated GenBank is widely available in many forms,
we have attempted to correct errors as soon as they are known to us
and notify recipients of the corrections as soon as possible.  In
general, as Dr. Smith recommended, because these corrections are
applied to single entries (by the GenBank annotation staff at Los
Alamos National Laboratory), the corrected entries are posted to
bionet.molbio.genbank.updates.  In the present case, however, because
the corrections were globally applied to the entire database, I never
had the 1000+ entries in my hand to individually post to the updates
newsgroup.  It may be, in cases like this one, that if extracting each
changed entry and posting it is a requirement for making a change to
the database, we will be forced to decide not to make the corrections
until the next release simply because the overhead (imposed by this
requirement) is too great.  I'm sure Dave Kristofferson will be happy
to hear from users of the updates newsgroup on their requirements for
the operation of that group.

Sincerely,

David Benton
GenBank Staff
benton@karyon.bio.net

frist@ccu.umanitoba.ca (08/26/90)

In article <Aug.25.10.35.41.1990.21675@genbank.BIO.NET> of
bionet.molbio.genbank Dave Benton (benton@karyon.bio.net)  writes:

>After checking the GenBank release 64 files which were on-line in the
>genbank.bio.net ftp directory (~ftp/pub/db/gb-rel64) from early July
>to mid-August, I can say unequivocally that GenBank Release 64.0 was
>*not* incomplete, either as it appeared in that directory or as
>distributed on mag tape.  In the course of preparing floppy disk-
>format files from those GenBank data files, we discovered a systematic
>error in the way certain feature locations were formatted in the
>files.  This error affected about 1250 of the 185,079 features in
>Release 64.0.  I, therefore, applied a global correction to the files
>and replaced the files in the ftp directory with the corrected files.

First, I just wanted to make absolutely sure of the meaning of the posting.
Is it correct to say that the location
formatting error you spoke of affected Release 64.0 on ALL media released
from early July to mid Aug, and not just the floppy disk version?
Specifically, would this error be reflected in SUN tar tapes dated Jun
1990?

Also, a quick browse through Release 64.0 indicates another systematic
error.  In all entries that I have looked at in which a gene is divided up
into several exons, the feature key 'mRNA' is used to denote what should be
'prim_transcript'. (An example is shown below.)  In fact, I have not yet
found an example of a mature (ie. spliced using join()) mRNA in any
entries that I have examined. Admittedly, I have not had a chance to really
do a thorough search. 

While we're on the subject, why do some entries have mRNA and CDS, and
others just CDS?  For example, many cDNAs have both features, which have
identical locations, whereas others have only the CDS.   

Example:

LOCUS       CHKACACB     5462 bp ds-DNA             VRT       04-AUG-1986
DEFINITION  Chicken cardiac alpha-actin gene, clone lambda-AC7, complete cds.
ACCESSION   X02212 K02256
KEYWORDS    actin; alpha-actin; alpha-cardiac actin.
SOURCE      Chicken genomic DNA, clone lambda-AC7 [1],[2].
  ORGANISM  Gallus gallus
            Eukaryota; Animalia; Metazoa; Chordata; Vertebrata; Aves;
            Neornithes; Neognathae; Galliformes; Phasianidae; Gallus gallus.
REFERENCE   1 (bases 841 to 897)
  AUTHORS   Chang,K.S., Zimmer,W.E.Jr., Bergsma,D.J., Dodgson,J.B. and
            Schwartz,R.J.
  TITLE     Isolation and characterization of six different chicken actin genes
  JOURNAL   Mol. Cell. Biol. 4, 2498-2508 (1984)
  STANDARD  full staff_review
REFERENCE   2 (bases 1 to 5462)
  AUTHORS   Chang,K.S., Rothblum,K.N. and Schwartz,R.J.
  TITLE     The complete sequence of the chicken alpha-cardiac actin gene: A
            highly conserved vertebrate gene
  JOURNAL   Nucleic Acids Res. 13, 1223-1237 (1985)
  STANDARD  full staff_review
COMMENT     [1] also sequenced part of the 3' terminal fragment.
FEATURES             Location/Qualifiers
     mRNA            299..5075
                     /note="actin mRNA"
     intron          339..820
                     /note="actin mRNA intron A"
     intron          970..1801
                     /note="actin cds intron B"
     intron          2127..2751
                     /note="actin cds intron C"
     intron          2914..3025
                     /note="actin cds intron D"
     intron          3218..4215
                     /note="actin cds intron E"
     intron          4398..4756
                     /note="actin cds intron F"
     CDS             join(841..969,1802..2126,
                     2752..2913,3026..3217,
                     4216..4397,4757..4900)
                     /note="cardiac alpha-actin"
BASE COUNT     1376 a   1280 c   1179 g   1627 t
ORIGIN      3 bp upstream of SmaI site.

===============================================================================
Brian Fristensky                           frist@ccu.umanitoba.ca
Assistant Professor
Dept. of Plant Science
University of Manitoba
Winnipeg, MB R3T 2N2  CANADA
Office phone:                              204-474-6085
FAX:                                       204-275-5128
===============================================================================

benton@genbank.BIO.NET (David Benton) (08/26/90)

Here is a more complete reply to some of the issues raised in Dana
Fowlkes' original posting.

Dana Fowlkes wrote:

> We noted that some of the GenBank release 64 files were incomplete.  These
> had been downloaded about the end of July.  For instance, the GBpln.seq
> files had no yeast sequences.  This caused grave misgivings at our institution
> which has more than its share of yeast people.  Some other files were also
> incomplete.  I checked with a friend who had also done a download at about
> the same time.  His files were identical to mine.  I went back to GenBank
> and checked to see if new files were available.  Yes, they are and are
> dated August 13.  Thus, If you have downloaded Genbank 64 before August
> 13, you need to check your files for their sizes compared to those
> presently available.

Based on Rick Westerman's posting and on our own checking of the the
files originally in ~ftp/pub/db/gb-rel64 (recovered from the July 30
backup tape), I must conclude that the files which were there between
mid-July and August 13 were complete.  Although it isn't clear how
Rick counted the "yeast-related sequences" in the plant division, the
number he reported is of the right order.  So, it seems that the
complete files were on-line and at least one person successfully
downloaded them.  Since Dana Fowlkes reports a second incident of
identically incomplete files, the cause would seem to be more serious
than a transient failure of ftp.  I would suggest that anyone who
retrieves files check their sizes after uncompressing them.  The file
gbrel.txt contains a table of the file sizes both before and after
decompression.

Here is the part of the summary of Release 64 plant division
which reports yeast entries (the full summary is in gbrel.txt):

 Organism                               Reports Entries      Bases
 -----------------------------------    ------- -------   --------
 Zygosaccharomyces fermentati                 1       1       5416
 Saccharomycopsis fibuligera                  3       3       9339
 Candida boidinii                             2       2       1863
 Candida glabrata                             3       3       2758
 Candida albicans                             8       6      11668
 Candida tropicalis                          15      12      20761
 Saccharomyces cerevisiae                   976     807    1420238
 Transposable element TY1                    41      36      43719
 Saccharomyces diastaticus                    4       4       4319
 Candida pelliculosa                          1       1       5327
 Candida maltosa                              5       4       8167
 Saccharomyces carlsbergensis                22      19      36227
 Hansenula wingei                             3       3        720
 Saccharomyces fibuligera                     2       2       6761
 Yarrowia lipolytica                          5       4      11065
 Kluyveromyces lactis                        36      27      83929
 Hansenula polymorpha                         3       3       8018
 Kluyveromyces fragilis                       1       1       4193
 Zygosaccharomyces rouxii                     5       3      15025
 Schizosaccharomyces pombe                  110      92     154243
 Pichia pastoris                              3       3        899
 Cephalosporium acremonium                    4       4       2093
 Yeast sp.                                   33      32      15660
 Candida utilis                               4       4       7578
 Saccharomyces uvarum                         1       1       2001
 Kluyveromyces drosophilarum                  1       1       4757
 Saccharomyces rosei                          1       1        278
 Saccharomyces kluyveri                       3       2       2160
 Zygosaccharomyces bailii                     1       1       5415

If the data files were complete, why were new files put on-line on
August 13? After the GenBank Release 64 tapes were shipped and the
files posted in the ftp area, we, during the course of preparing the
CD ROM release, found a number of errors in feature locations.  These
were corrected and the files placed in the ftp directory.  These
corrections had the effect of reducing the number of bytes in each of
the annotated data divisions.  The number of entries and lines in
those files is unchanged.  Likewise, the indexes are unchanged.
Regrettably, I overlooked posting a note to this newsgroup announcing
that new versions of the Release 64 data files had been posted and
why.  I apologize for any inconvenience this has caused GenBank users.
While, I am no longer in a position to guarantee that this will not
happen in the future, I can say that our policy in the past has been
to rectify our errors and alert users as soon as possible after
identifying the error.  Dave Kristofferson assures me that the new
GenBank management will be even more vigilant in the future.

> Would it be possible for those  who FTP files to e-mail a note about the 
> files we transfer so that Genbank could automatically let us know if there
> have been corrections?  

Perhaps a better solution would be "automatic" posting to this
newsgroup when corrections and changes are made to the on-line data.
That way those who forget to send the e-mail note will have
opportunity to receive the notification as well.

Sincerely,

David Benton
GenBank Staff
benton@karyon.bio.net

benton@genbank.BIO.NET (David Benton) (08/27/90)

To Brian Fristensky's question:

> Is it correct to say that the location
> formatting error you spoke of affected Release 64.0 on ALL media released
> from early July to mid Aug, and not just the floppy disk version?
> Specifically, would this error be reflected in SUN tar tapes dated Jun
> 1990?

the answer is, unfortunately, "yes".  The problem was discovered after
those tapes had been shipped.  If you want to patch these location
errors, I'll append an awk script which reads a GenBank .seq file and
writes (to standard output) a sequence file with those of the errors
which occur in simple spans (about 1224 of the 1250) corrected.  If
anyone is interested, I can post a second awk program which detects,
but does not correct, the remaining errors.  If you choose to use the
attached program, I'd recommend diffing the output against the
original file (especially if you've made any changes to that file)
before you throw the original away.  I've tested the program on the
Rel 64.0 files as distributed and found no side effects, but it is
"use at your own risk."

By the way, there was no floppy disk version of release 64.0
distributed on floppy disks. We will be shipping the floppy-format
files for Release 64.1 (corrected), as well as the floppy-format files
for Release 63.0, and the magnetic tape format files for GenBank Rel. 64.1
and GenPept Rel. 64.2 on a CD ROM in about two weeks.  But that is the
subject for another announcement.

The feature table in Rel 64 was created by automatic translation of
the old feature tables (used in Rel 63 and before) to the new feature
table format.  Brian's comments on the use of the prim_transcript and mRNA
feature keys are accurate (as I understand those keys) except that
the prim_transcript is intended to be used to annotate the primary
(initial) transcript, before any processing.  In most cases, I would
guess, the primary transcript is unknown.  I will leave more detailed
(and knowledgable) comments for the GenBank annotation staff.

David Benton
GenBank
benton@karyon.bio.net

------------------------cut here---------------------------------------

# awk program to find any feature location "to" position which is greater
# than the sequence length, determine if it is a direct repeat ("456456")
# and, if so, divide it (as a string) in half ("456")
# note that this works only on simple spans


BEGIN	{alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
	numeric = "-1234567890"}

/^LOCUS /	{len = $3 + 0
		print
		next}

/^FEATURES /,/^BASE COUNT  / {if ($1 == "FEATURES" || $1 == "BASE" || substr($0,6,1) == " "){
	print
	next}
    else{
	inlin = $0
	dot = index(inlin,"..")
	if ((index(alpha,substr(inlin,22,1)) != 0) || (dot == 0)){
		print
		next}

	from = substr(inlin,1,dot+1)
	to = substr(inlin,dot+2)
	while (index(numeric,substr(to,1,1)) == 0){
		from = from substr(to,1,1)
		to = substr(to,2)}
	half = length(to) / 2

       if ((to + 0 > len) && (substr(to,1,half) == substr(to,half+1))){
		to = substr(to,1,half)}
		 print from to
		 next}}

	{ print }

kristoff@genbank.BIO.NET (David Kristofferson) (08/27/90)

I contacted Dana Fowkles to get further information about the problem.
His colleague was also located at the University of North Carolina
albeit using a different VAX computer running VMS vs Ultrix on
Fowkles' machine.  Both also used different FTP implementations and
both had truncated gbpln.seq and gbvrl.seq.  The only information that
I could obtain was that there were no yeast sequences in gbpln.seq.  I
don't know whether or not their truncated files were exactly the same
size.  

To elaborate on David Benton's posting, we read back in gbpln.seq from
a distribution tape and also retrieved the file gbpln.seq.Z from a
July 30th backup of the ~ftp/pub/db/gb-rel64 directory; both files
(aftering uncompressing the .Z file) were identical and contained over
800 yeast sequences.  They also matched the file that was on-line for
GOS usage.  

In conclusion, the problem reported by Dana did not exist in the files
that were available here nor in files that were sent out by tape;
while I can not track down the cause of his problem since it happened
after the data left here, something must have happened either in FTP
transmission over the Internet or upon uncompressing the files.
-- 
				Sincerely,

				Dave Kristofferson
				GenBank Manager

				kristoff@genbank.bio.net