[bionet.molbio.genbank] GenBank gets big and PIR format has problems!

TONY@wehi.dn.mu.oz (Tony Kyne, Walter and Eliza Hall Institute) (04/10/90)

	I have just received version 63.0 of GenBank with its 33,377 sequences.
I (like many others I presume) keep the database as one database in PIR format
on my VMS system. Now that GenBank has passed the 32000 odd mark (ie 2**15-1),
the standard PIR VMS format falls apart, as the format stores the ISNO part
in INTEGER*2 format.
	Before I start reworking all the software on our system, are the powers
that be at PIR (or elsewhere) settling on a new standard format??? Or is the
thing to do, just break up GenBank into its taxonomic parts? I suspect EMBL
will also crash through the 2**15-1 barrier with the next release as well.
Or is a new generally agreed format about to emerge that will facilitate
more dynamic updating now that we have weekly ftp updates and USENET daily
updates. The current PIR format more or less requires a complete database
reload (or part thereof) every day or week as the case maybe.
	                               Tony Kyne

================================================================================
Dr. Tony Kyne, Head, Computer Sciences Unit,
               The Walter and Eliza Hall Institute of Medical Research,
               P.O. Royal Melbourne Hospital, Victoria, 3050, Australia.
Phone: International +61-3-345-2586       FAX: International +61-3-347-0852
            National 03-345-2586                    National 03-347-0852
Email: ACSnet: tony@wehi.dn.mu.oz        UUCP: uunet!munnari!wehi.dn.mu.oz!tony
     Internet: tony%wehi.dn.mu.oz@uunet.uu.net
      PSIMAIL: PSI%0505233430002::tony
===============================================================================

roy@phri.nyu.edu (Roy Smith) (04/10/90)

TONY@wehi.dn.mu.oz (Tony Kyne, Walter and Eliza Hall Institute) writes:
> is a new generally agreed format about to emerge that will facilitate
> more dynamic updating now that we have weekly ftp updates and USENET daily
> updates. The current PIR format more or less requires a complete database
> reload (or part thereof) every day or week as the case maybe.

	The need to rebuild the database each time it is updated is a
problem which has not escaped our attention.  I can guess how Ross Smith and
Dave Kristofferson (my partners in crime on the daily updates experiment),
would answer your question, but I'll let them talk for themselves.  As for
my part, what we have done is to keep essentially a complete separate
database just for the daily updates.  That makes the size of the index file
rebuilds managable.  We currently have a mishmosh of all the updates in one
file, but we envision probably doing something like a 3-tier system.

	For each division of the data base (viral, bacterial, etc), we see
having 3 files.  The first two are the full and new-sequences files as they
come off the GB tape.  The third are those daily updates that belong to that
section.  Only that last (presumably fairly small) file need be rebuilt
often.  If I understand it right, file 2 is a subset of file 1.  So, if
somebody wants to search the entire database, they only need search 1 and 3.
If they want to search just the new stuff since the last major release (i.e.
run in keeping-up-with-the-jonses mode), they need to search 2 and 3.
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
"Don't Worry, Be Happy"

Mats.Sundvall@bio.embnet.se (Mats Sundvall) (04/11/90)

In article <1990Apr10.032155.22233@phri.nyu.edu>, roy@phri.nyu.edu (Roy Smith) writes:
> TONY@wehi.dn.mu.oz (Tony Kyne, Walter and Eliza Hall Institute) writes:
> 
> 	The need to rebuild the database each time it is updated is a
> problem which has not escaped our attention.  I can guess how Ross Smith and
> Dave Kristofferson (my partners in crime on the daily updates experiment),
> would answer your question, but I'll let them talk for themselves.  As for
> my part, what we have done is to keep essentially a complete separate
> database just for the daily updates.  That makes the size of the index file
> rebuilds managable.  We currently have a mishmosh of all the updates in one
> file, but we envision probably doing something like a 3-tier system.
> 

We have made the changes needed to APPEND entries to the PIR format
database. We only use the GCG package and not NAQ and PSQ so we do
not know if this work with them.

The idea is quite simple. You append an entry to the sequential file.
Then you update the bytepointer in the indexfile to point to the new
entry instead of the old one. This of course leaves you with some
old entries in the seq file with no pointers to. This is not a big
problem with the programs that usees the index files to retrieve
entries. Of course you are in trouble when using database searching
programs like wordsearch and FASTA that read the database sequentually
to run faster. You get several matches to the same sequence, but the
second round of the program, when it retrieve the sequence, it will fetch
only the right one. This will screw up the statistics, but we feel this
is a minor problem compared to other solutions offered.

Of course you will need some sort of garbage collection after a while.
There is ways to do this, but at the moment we plan to let the delivery
of new tapes be our garbage collector. We just install the new tape
and start all over again.

Of course the problems with duplicated matches only occurs when you get
updates to already existing entries.

Questions about availablility of the "fixes" should be adressed to
Peter.Gad@Bio.embnet.SE who did the actual coding. He maybe read
this and can post some info himself.

> --
> Roy Smith, Public Health Research Institute
> 455 First Avenue, New York, NY 10016
> roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
> "Don't Worry, Be Happy"

	Mats Sundvall
	Biomedical Center
	Uppsala University
	Sweden