TONY@wehi.dn.mu.oz (Tony Kyne, Walter and Eliza Hall Institute) (04/10/90)
I have just received version 63.0 of GenBank with its 33,377 sequences. I (like many others I presume) keep the database as one database in PIR format on my VMS system. Now that GenBank has passed the 32000 odd mark (ie 2**15-1), the standard PIR VMS format falls apart, as the format stores the ISNO part in INTEGER*2 format. Before I start reworking all the software on our system, are the powers that be at PIR (or elsewhere) settling on a new standard format??? Or is the thing to do, just break up GenBank into its taxonomic parts? I suspect EMBL will also crash through the 2**15-1 barrier with the next release as well. Or is a new generally agreed format about to emerge that will facilitate more dynamic updating now that we have weekly ftp updates and USENET daily updates. The current PIR format more or less requires a complete database reload (or part thereof) every day or week as the case maybe. Tony Kyne ================================================================================ Dr. Tony Kyne, Head, Computer Sciences Unit, The Walter and Eliza Hall Institute of Medical Research, P.O. Royal Melbourne Hospital, Victoria, 3050, Australia. Phone: International +61-3-345-2586 FAX: International +61-3-347-0852 National 03-345-2586 National 03-347-0852 Email: ACSnet: tony@wehi.dn.mu.oz UUCP: uunet!munnari!wehi.dn.mu.oz!tony Internet: tony%wehi.dn.mu.oz@uunet.uu.net PSIMAIL: PSI%0505233430002::tony ===============================================================================
roy@phri.nyu.edu (Roy Smith) (04/10/90)
TONY@wehi.dn.mu.oz (Tony Kyne, Walter and Eliza Hall Institute) writes: > is a new generally agreed format about to emerge that will facilitate > more dynamic updating now that we have weekly ftp updates and USENET daily > updates. The current PIR format more or less requires a complete database > reload (or part thereof) every day or week as the case maybe. The need to rebuild the database each time it is updated is a problem which has not escaped our attention. I can guess how Ross Smith and Dave Kristofferson (my partners in crime on the daily updates experiment), would answer your question, but I'll let them talk for themselves. As for my part, what we have done is to keep essentially a complete separate database just for the daily updates. That makes the size of the index file rebuilds managable. We currently have a mishmosh of all the updates in one file, but we envision probably doing something like a 3-tier system. For each division of the data base (viral, bacterial, etc), we see having 3 files. The first two are the full and new-sequences files as they come off the GB tape. The third are those daily updates that belong to that section. Only that last (presumably fairly small) file need be rebuilt often. If I understand it right, file 2 is a subset of file 1. So, if somebody wants to search the entire database, they only need search 1 and 3. If they want to search just the new stuff since the last major release (i.e. run in keeping-up-with-the-jonses mode), they need to search 2 and 3. -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy "Don't Worry, Be Happy"
Mats.Sundvall@bio.embnet.se (Mats Sundvall) (04/11/90)
In article <1990Apr10.032155.22233@phri.nyu.edu>, roy@phri.nyu.edu (Roy Smith) writes: > TONY@wehi.dn.mu.oz (Tony Kyne, Walter and Eliza Hall Institute) writes: > > The need to rebuild the database each time it is updated is a > problem which has not escaped our attention. I can guess how Ross Smith and > Dave Kristofferson (my partners in crime on the daily updates experiment), > would answer your question, but I'll let them talk for themselves. As for > my part, what we have done is to keep essentially a complete separate > database just for the daily updates. That makes the size of the index file > rebuilds managable. We currently have a mishmosh of all the updates in one > file, but we envision probably doing something like a 3-tier system. > We have made the changes needed to APPEND entries to the PIR format database. We only use the GCG package and not NAQ and PSQ so we do not know if this work with them. The idea is quite simple. You append an entry to the sequential file. Then you update the bytepointer in the indexfile to point to the new entry instead of the old one. This of course leaves you with some old entries in the seq file with no pointers to. This is not a big problem with the programs that usees the index files to retrieve entries. Of course you are in trouble when using database searching programs like wordsearch and FASTA that read the database sequentually to run faster. You get several matches to the same sequence, but the second round of the program, when it retrieve the sequence, it will fetch only the right one. This will screw up the statistics, but we feel this is a minor problem compared to other solutions offered. Of course you will need some sort of garbage collection after a while. There is ways to do this, but at the moment we plan to let the delivery of new tapes be our garbage collector. We just install the new tape and start all over again. Of course the problems with duplicated matches only occurs when you get updates to already existing entries. Questions about availablility of the "fixes" should be adressed to Peter.Gad@Bio.embnet.SE who did the actual coding. He maybe read this and can post some info himself. > -- > Roy Smith, Public Health Research Institute > 455 First Avenue, New York, NY 10016 > roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy > "Don't Worry, Be Happy" Mats Sundvall Biomedical Center Uppsala University Sweden