dmf@MED.UNC.EDU (08/24/90)
Hi, I just noted yesterday what others have probably already noted. However, I wish to call this to the attention off the few remaining ignorant peoples. We noted that some of the GenBank release 64 files were incomplete. These had been downloaded about the end of July. For instance, the GBpln.seq files had no yeast sequences. This caused grave misgivings at our institution which has more than its share of yeast people. Some other files were also incomplete. I checked with a friend who had also done a download at about the same time. His files were identical to mine. I went back to GenBank and checked to see if new files were available. Yes, they are and are dated August 13. Thus, If you have downloaded Genbank 64 before August 13, you need to check your files for their sizes compared to those presently available. Would it be possible for those who FTP files to e-mail a note about the files we transfer so that Genbank could automatically let us know if there have been corrections? Sincerely Dana Dana M. Fowlkes dmf@med.unc.edu
westerm@aclcb.purdue.edu (Rick Westerman) (08/24/90)
In article <9008241347.AA08860@acme.med.unc.edu>, dmf@MED.UNC.EDU writes: > >We noted that some of the GenBank release 64 files were incomplete. These >had been downloaded about the end of July. For instance, the GBpln.seq >files had no yeast sequences. I downloaded GB rel 64 on the 18th of July. My plant file has 860 yeast-related sequences in it. >... Thus, If you have downloaded Genbank 64 before August >13, you need to check your files for their sizes compared to those >presently available. I run the GCG package from Wisconsin, thus my files have been reformatted and I can't directly compare file sizes. Would someone from genbank clarify this situation, i.e., if the database was downloaded near the end of July, is it complete or not? -- Rick Rick Westerman AIDS Center Laboratory for Computational Internet: westerm@aclcb.purdue.edu Biochemistry, Biochemistry building, (317) 494-0505 Purdue University, W. Lafayette, IN 47907
kristoff@genbank.BIO.NET (David Kristofferson) (08/25/90)
Dana and Rick, Your messages have been received, and I'll have someone on the staff here look into the reported problem. -- Sincerely, Dave Kristofferson GenBank Manager kristoff@genbank.bio.net
wrp@biochsn.acc.Virginia.EDU (William R. Pearson) (08/25/90)
Here are some file sizes. The old file sizes are from a tar tape that I wrote after downloading the files around July 25. The new file sizes are those reported by ftp for files dated Aug. 13. The +/- indicates whether the old file is larger (+) or smaller than the new. Only 3 files are the same size. new old 4067041 Aug 13 22:43 gbbct.seq.Z 4067267 gbbct.seq.Z + 8304 Aug 13 22:43 gbdat.frm.Z 8299 gbdat.frm.Z + 2584078 Aug 13 22:44 gbinv.seq.Z 2582469 gbinv.seq.Z + 1240959 Aug 13 22:44 gbmam.seq.Z 1240711 gbmam.seq.Z - 102181 Aug 13 22:44 gbnew.txt.Z 102363 gbnew.txt.Z + 1192042 Aug 13 22:44 gborg.seq.Z 1195380 gborg.seq.Z + 492669 Aug 13 22:44 gbphg.seq.Z 492669 gbphg.seq.Z 2681641 Aug 13 22:45 gbpln.seq.Z 2681845 gbpln.seq.Z + 6246157 Aug 13 22:46 gbpri.seq.Z 6246011 gbpri.seq.Z - 688692 Aug 13 22:46 gbrna.seq.Z 688692 gbrna.seq.Z 5476544 Aug 13 22:47 gbrod.seq.Z 5476559 gbrod.seq.Z + 922073 Aug 13 22:47 gbsdr.txt.Z 920822 gbsdr.txt.Z - 500417 Aug 13 22:47 gbsyn.seq.Z 500435 gbsyn.seq.Z + 2385611 Aug 13 22:47 gbuna.seq.Z 2385611 gbuna.seq.Z 3776117 Aug 13 22:48 gbvrl.seq.Z 3775449 gbvrl.seq.Z - 1438000 Aug 13 22:48 gbvrt.seq.Z 1437604 gbvrt.seq.Z + Bill Pearson
smith@mcclb0.med.nyu.edu (08/25/90)
In article <1990Aug24.193617.7712@murdoch.acc.Virginia.EDU>, wrp@biochsn.acc.Virginia.EDU (William R. Pearson) writes: > Here are some file sizes. The old file sizes are from a tar > tape that I wrote after downloading the files around July 25. If errors are found in sequences distributed in release 64, the corrections should be re-posted to UPDATES, even if the correct sequence was already sent there. We have stripped our UPDATE bank, and have probably removed the 'good' postings. +---------------------------------------------------------------------------+ |Ross Smith, Cell Biology, NYU Medical Center, 550 First Ave., NYC, 10016| |Phone: (212) 340-5356: FAX: (212) 340-8139 (Alternate NYUMC) (212) 340-7190| |E-Mail: SMITH@NYUMED.BITNET (BITNET), SMITH@MCCLB0.MED.NYU.EDU (Internet)| +---------------------------------------------------------------------------+
benton@genbank.BIO.NET (David Benton) (08/26/90)
After checking the GenBank release 64 files which were on-line in the genbank.bio.net ftp directory (~ftp/pub/db/gb-rel64) from early July to mid-August, I can say unequivocally that GenBank Release 64.0 was *not* incomplete, either as it appeared in that directory or as distributed on mag tape. In the course of preparing floppy disk- format files from those GenBank data files, we discovered a systematic error in the way certain feature locations were formatted in the files. This error affected about 1250 of the 185,079 features in Release 64.0. I, therefore, applied a global correction to the files and replaced the files in the ftp directory with the corrected files. All but one or two of the annotated divisions were affected and the total number of affected entries is probably greater than 1000. The number of entries in each division, the number of lines (and the number of words) in each data file did not change. The only change was to certain location which originally were written as (for example) 357357 which should have been 357. So each of the affected files has grown smaller by a small number of bytes. (I think Bill Peason's results can be explained by the fact that he compared the sizes of the compressed files and the amount of L-Z compression is sensitive to the content.) Due to my own oversight, I failed to post a notice to this newsgroup announcing the availability of the new files and the reason for the correction. I apologize for the inconvenience this has caused. While I am no longer in a position to guarantee that this won't happen in the future, Dave Kristofferson assures me that the new management of the project will be more vigilant. Our philosophy has always been that, since GenBank is a human endeavor, any snapshot of the database will contain "errors", but we bend all our efforts toward removing known errors before distribution of releases. Now that a more continuously updated GenBank is widely available in many forms, we have attempted to correct errors as soon as they are known to us and notify recipients of the corrections as soon as possible. In general, as Dr. Smith recommended, because these corrections are applied to single entries (by the GenBank annotation staff at Los Alamos National Laboratory), the corrected entries are posted to bionet.molbio.genbank.updates. In the present case, however, because the corrections were globally applied to the entire database, I never had the 1000+ entries in my hand to individually post to the updates newsgroup. It may be, in cases like this one, that if extracting each changed entry and posting it is a requirement for making a change to the database, we will be forced to decide not to make the corrections until the next release simply because the overhead (imposed by this requirement) is too great. I'm sure Dave Kristofferson will be happy to hear from users of the updates newsgroup on their requirements for the operation of that group. Sincerely, David Benton GenBank Staff benton@karyon.bio.net
frist@ccu.umanitoba.ca (08/26/90)
In article <Aug.25.10.35.41.1990.21675@genbank.BIO.NET> of bionet.molbio.genbank Dave Benton (benton@karyon.bio.net) writes: >After checking the GenBank release 64 files which were on-line in the >genbank.bio.net ftp directory (~ftp/pub/db/gb-rel64) from early July >to mid-August, I can say unequivocally that GenBank Release 64.0 was >*not* incomplete, either as it appeared in that directory or as >distributed on mag tape. In the course of preparing floppy disk- >format files from those GenBank data files, we discovered a systematic >error in the way certain feature locations were formatted in the >files. This error affected about 1250 of the 185,079 features in >Release 64.0. I, therefore, applied a global correction to the files >and replaced the files in the ftp directory with the corrected files. First, I just wanted to make absolutely sure of the meaning of the posting. Is it correct to say that the location formatting error you spoke of affected Release 64.0 on ALL media released from early July to mid Aug, and not just the floppy disk version? Specifically, would this error be reflected in SUN tar tapes dated Jun 1990? Also, a quick browse through Release 64.0 indicates another systematic error. In all entries that I have looked at in which a gene is divided up into several exons, the feature key 'mRNA' is used to denote what should be 'prim_transcript'. (An example is shown below.) In fact, I have not yet found an example of a mature (ie. spliced using join()) mRNA in any entries that I have examined. Admittedly, I have not had a chance to really do a thorough search. While we're on the subject, why do some entries have mRNA and CDS, and others just CDS? For example, many cDNAs have both features, which have identical locations, whereas others have only the CDS. Example: LOCUS CHKACACB 5462 bp ds-DNA VRT 04-AUG-1986 DEFINITION Chicken cardiac alpha-actin gene, clone lambda-AC7, complete cds. ACCESSION X02212 K02256 KEYWORDS actin; alpha-actin; alpha-cardiac actin. SOURCE Chicken genomic DNA, clone lambda-AC7 [1],[2]. ORGANISM Gallus gallus Eukaryota; Animalia; Metazoa; Chordata; Vertebrata; Aves; Neornithes; Neognathae; Galliformes; Phasianidae; Gallus gallus. REFERENCE 1 (bases 841 to 897) AUTHORS Chang,K.S., Zimmer,W.E.Jr., Bergsma,D.J., Dodgson,J.B. and Schwartz,R.J. TITLE Isolation and characterization of six different chicken actin genes JOURNAL Mol. Cell. Biol. 4, 2498-2508 (1984) STANDARD full staff_review REFERENCE 2 (bases 1 to 5462) AUTHORS Chang,K.S., Rothblum,K.N. and Schwartz,R.J. TITLE The complete sequence of the chicken alpha-cardiac actin gene: A highly conserved vertebrate gene JOURNAL Nucleic Acids Res. 13, 1223-1237 (1985) STANDARD full staff_review COMMENT [1] also sequenced part of the 3' terminal fragment. FEATURES Location/Qualifiers mRNA 299..5075 /note="actin mRNA" intron 339..820 /note="actin mRNA intron A" intron 970..1801 /note="actin cds intron B" intron 2127..2751 /note="actin cds intron C" intron 2914..3025 /note="actin cds intron D" intron 3218..4215 /note="actin cds intron E" intron 4398..4756 /note="actin cds intron F" CDS join(841..969,1802..2126, 2752..2913,3026..3217, 4216..4397,4757..4900) /note="cardiac alpha-actin" BASE COUNT 1376 a 1280 c 1179 g 1627 t ORIGIN 3 bp upstream of SmaI site. =============================================================================== Brian Fristensky frist@ccu.umanitoba.ca Assistant Professor Dept. of Plant Science University of Manitoba Winnipeg, MB R3T 2N2 CANADA Office phone: 204-474-6085 FAX: 204-275-5128 ===============================================================================
benton@genbank.BIO.NET (David Benton) (08/26/90)
Here is a more complete reply to some of the issues raised in Dana Fowlkes' original posting. Dana Fowlkes wrote: > We noted that some of the GenBank release 64 files were incomplete. These > had been downloaded about the end of July. For instance, the GBpln.seq > files had no yeast sequences. This caused grave misgivings at our institution > which has more than its share of yeast people. Some other files were also > incomplete. I checked with a friend who had also done a download at about > the same time. His files were identical to mine. I went back to GenBank > and checked to see if new files were available. Yes, they are and are > dated August 13. Thus, If you have downloaded Genbank 64 before August > 13, you need to check your files for their sizes compared to those > presently available. Based on Rick Westerman's posting and on our own checking of the the files originally in ~ftp/pub/db/gb-rel64 (recovered from the July 30 backup tape), I must conclude that the files which were there between mid-July and August 13 were complete. Although it isn't clear how Rick counted the "yeast-related sequences" in the plant division, the number he reported is of the right order. So, it seems that the complete files were on-line and at least one person successfully downloaded them. Since Dana Fowlkes reports a second incident of identically incomplete files, the cause would seem to be more serious than a transient failure of ftp. I would suggest that anyone who retrieves files check their sizes after uncompressing them. The file gbrel.txt contains a table of the file sizes both before and after decompression. Here is the part of the summary of Release 64 plant division which reports yeast entries (the full summary is in gbrel.txt): Organism Reports Entries Bases ----------------------------------- ------- ------- -------- Zygosaccharomyces fermentati 1 1 5416 Saccharomycopsis fibuligera 3 3 9339 Candida boidinii 2 2 1863 Candida glabrata 3 3 2758 Candida albicans 8 6 11668 Candida tropicalis 15 12 20761 Saccharomyces cerevisiae 976 807 1420238 Transposable element TY1 41 36 43719 Saccharomyces diastaticus 4 4 4319 Candida pelliculosa 1 1 5327 Candida maltosa 5 4 8167 Saccharomyces carlsbergensis 22 19 36227 Hansenula wingei 3 3 720 Saccharomyces fibuligera 2 2 6761 Yarrowia lipolytica 5 4 11065 Kluyveromyces lactis 36 27 83929 Hansenula polymorpha 3 3 8018 Kluyveromyces fragilis 1 1 4193 Zygosaccharomyces rouxii 5 3 15025 Schizosaccharomyces pombe 110 92 154243 Pichia pastoris 3 3 899 Cephalosporium acremonium 4 4 2093 Yeast sp. 33 32 15660 Candida utilis 4 4 7578 Saccharomyces uvarum 1 1 2001 Kluyveromyces drosophilarum 1 1 4757 Saccharomyces rosei 1 1 278 Saccharomyces kluyveri 3 2 2160 Zygosaccharomyces bailii 1 1 5415 If the data files were complete, why were new files put on-line on August 13? After the GenBank Release 64 tapes were shipped and the files posted in the ftp area, we, during the course of preparing the CD ROM release, found a number of errors in feature locations. These were corrected and the files placed in the ftp directory. These corrections had the effect of reducing the number of bytes in each of the annotated data divisions. The number of entries and lines in those files is unchanged. Likewise, the indexes are unchanged. Regrettably, I overlooked posting a note to this newsgroup announcing that new versions of the Release 64 data files had been posted and why. I apologize for any inconvenience this has caused GenBank users. While, I am no longer in a position to guarantee that this will not happen in the future, I can say that our policy in the past has been to rectify our errors and alert users as soon as possible after identifying the error. Dave Kristofferson assures me that the new GenBank management will be even more vigilant in the future. > Would it be possible for those who FTP files to e-mail a note about the > files we transfer so that Genbank could automatically let us know if there > have been corrections? Perhaps a better solution would be "automatic" posting to this newsgroup when corrections and changes are made to the on-line data. That way those who forget to send the e-mail note will have opportunity to receive the notification as well. Sincerely, David Benton GenBank Staff benton@karyon.bio.net
benton@genbank.BIO.NET (David Benton) (08/27/90)
To Brian Fristensky's question: > Is it correct to say that the location > formatting error you spoke of affected Release 64.0 on ALL media released > from early July to mid Aug, and not just the floppy disk version? > Specifically, would this error be reflected in SUN tar tapes dated Jun > 1990? the answer is, unfortunately, "yes". The problem was discovered after those tapes had been shipped. If you want to patch these location errors, I'll append an awk script which reads a GenBank .seq file and writes (to standard output) a sequence file with those of the errors which occur in simple spans (about 1224 of the 1250) corrected. If anyone is interested, I can post a second awk program which detects, but does not correct, the remaining errors. If you choose to use the attached program, I'd recommend diffing the output against the original file (especially if you've made any changes to that file) before you throw the original away. I've tested the program on the Rel 64.0 files as distributed and found no side effects, but it is "use at your own risk." By the way, there was no floppy disk version of release 64.0 distributed on floppy disks. We will be shipping the floppy-format files for Release 64.1 (corrected), as well as the floppy-format files for Release 63.0, and the magnetic tape format files for GenBank Rel. 64.1 and GenPept Rel. 64.2 on a CD ROM in about two weeks. But that is the subject for another announcement. The feature table in Rel 64 was created by automatic translation of the old feature tables (used in Rel 63 and before) to the new feature table format. Brian's comments on the use of the prim_transcript and mRNA feature keys are accurate (as I understand those keys) except that the prim_transcript is intended to be used to annotate the primary (initial) transcript, before any processing. In most cases, I would guess, the primary transcript is unknown. I will leave more detailed (and knowledgable) comments for the GenBank annotation staff. David Benton GenBank benton@karyon.bio.net ------------------------cut here--------------------------------------- # awk program to find any feature location "to" position which is greater # than the sequence length, determine if it is a direct repeat ("456456") # and, if so, divide it (as a string) in half ("456") # note that this works only on simple spans BEGIN {alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" numeric = "-1234567890"} /^LOCUS / {len = $3 + 0 print next} /^FEATURES /,/^BASE COUNT / {if ($1 == "FEATURES" || $1 == "BASE" || substr($0,6,1) == " "){ print next} else{ inlin = $0 dot = index(inlin,"..") if ((index(alpha,substr(inlin,22,1)) != 0) || (dot == 0)){ print next} from = substr(inlin,1,dot+1) to = substr(inlin,dot+2) while (index(numeric,substr(to,1,1)) == 0){ from = from substr(to,1,1) to = substr(to,2)} half = length(to) / 2 if ((to + 0 > len) && (substr(to,1,half) == substr(to,half+1))){ to = substr(to,1,half)} print from to next}} { print }
kristoff@genbank.BIO.NET (David Kristofferson) (08/27/90)
I contacted Dana Fowkles to get further information about the problem. His colleague was also located at the University of North Carolina albeit using a different VAX computer running VMS vs Ultrix on Fowkles' machine. Both also used different FTP implementations and both had truncated gbpln.seq and gbvrl.seq. The only information that I could obtain was that there were no yeast sequences in gbpln.seq. I don't know whether or not their truncated files were exactly the same size. To elaborate on David Benton's posting, we read back in gbpln.seq from a distribution tape and also retrieved the file gbpln.seq.Z from a July 30th backup of the ~ftp/pub/db/gb-rel64 directory; both files (aftering uncompressing the .Z file) were identical and contained over 800 yeast sequences. They also matched the file that was on-line for GOS usage. In conclusion, the problem reported by Dana did not exist in the files that were available here nor in files that were sent out by tape; while I can not track down the cause of his problem since it happened after the data left here, something must have happened either in FTP transmission over the Internet or upon uncompressing the files. -- Sincerely, Dave Kristofferson GenBank Manager kristoff@genbank.bio.net