[bionet.molbio.genome-program] local copies of genbank

cjv@mace.cc.purdue.edu (westerman) (03/21/90)

As a system manager who plans to keep the Genbank database online on my
systems and who plans *not* to, except rarely, utilize the fasta and
retreival capiblities of genbank.bio.net, I'd like to respond to Dave
Kristofferson's recent posting on why he thinks local copies of the
database should be discouraged.

First, my circumstances:

   1) We are running the GCG (Wisconsin) sequence analysis package on
	VAX/VMS systems.

   2) My systems are not overloaded; we have spare CPU power and disk
	space.

   3) I do weekly updates of the database via ftp. This takes about 5
        minutes of my time and about 1 1/2 hours of machine time (done
        in the background at very low priority, maybe 15 minutes of 
	actual CPU time).

   4) I have looked at/installed Clark's shells. They are very nice
	and hide the "dirty details" from the user.


My objections to using genbank server are threefold:

   1) Time. 
        While it only takes a little bit more time to retreive a
        database entry from the server as it does from our local database (I
	estimate twice as long, which isn't bad considering the emailing that
	needs to be done), this delay is irritating when you sitting looking
	at a blank CRT.  

	While I haven't done fasta timing tests, I suspect that the genbank
	computer is faster than mine; on the other hand, having 4 computers
	at my disposal means I can do 4 searches simitaneously. In any case,
	fasta searching is not time critical -- a search of 1/2 hour (via
	genbank) or 2 hours (maximum via my computers) still means that I 
	must walk away from my desk and/or do something else; in any case 
	I am not sitting around just waiting (unlike in the retreival case 
	above).

   2) Formatting
	Retreival results from genbank come back in a form that I cannot
	immediately use for further processing, instead I must extract
	the sequence from my mail and then convert the sequence to a 
	form the GCG package can use. Granted, these steps are minor, but
	they are extra steps and irritating because of that.

   3) Other uses of the database
	I have other programs that need to access the entire database
	besides fasta. One of these is the GCG program "FIND", which finds 
	short matches in sequences; one of my group is using this program
	to try to find various promoter sites. By having a local copy of
	the database, we can do theoretical analysis of the database.


A further comment:

   4) I suspect that the reason genbank is currently a feasible option
      is that it is not overload, much in the same manner as my system
      is able to handle a minimum of 4 fasta searches at a time; however
      if we started getting over 6 searches we would start bogging down;
      and if genbank starting getting over XXX (60? ten times my load?)
      searches at a time, they would bog down too. (BTW: I have about a 10
      MIPs system)


I wish I could contribute further to this thread of netnews, but I am off on
vacation for a week or so. 

-- Rick


-- 

Rick Westerman                        AIDS Center Laboratory for Computational
Internet: cjv@mace.cc.purdue.edu      Biochemistry, Biochemistry building,
(317) 494-0505                        Purdue University, W. Lafayette, IN 47907

kristoff@genbank.BIO.NET (David Kristofferson) (03/21/90)

Rick,

I'm sure that there are lots of machines which are not yet overloaded.
The issue will begin to develop as the Genome Project starts to
produce larger amounts of sequence data than currently comes in each
day.

I also was not questioning whether or not it was difficult to do the
updates; the issue will be disk space and CPU power as the database
grows.

You mentioned some of the problems with extracting sequences
out of e-mail messages, but I should remind you that users can also
directly access the IRX program on genbank.bio.net and download
sequences of interest directly without mail headers.  The time of
less than half an hour applies just to FASTA searches.  E-mail
retrieval of sequences takes about 2-3 minutes based on some tests
that I have run from the east coast to our machine in California.

Regarding formatting, it is true that the sequences come in a form
that is not immediately usable by your commercial software, but this
also holds true for the daily updates over USENET or the weekly FTP
files.  Reformatting for GCG must be done in any case.  Of course, it
is undoubtedly easier for the systems manager to process a whole block
of data at a time, but it is also trivial to have a small script which
users can run to do this on sequences of interest.

We are in agreement about the need for a local copy of the database if
you run local analyses on the whole database other than FASTA and IRX.
I should point out, however, that the functionality which you are
using is also available on the GOS computer for those who get accounts
on the system (the QUEST program).

Regarding overload, this will undoubtedly happen here eventually too.
As compared to a 10 MIPS machine, our system has four 22 MIPS
processors.  We have not been bogged down yet and can handle much more
than 4 fasta searches at a time.  Part of my point though ***which
holds true even if you do have a local copy of GenBank*** is that you
can save your local CPU power by offloading FASTA searches to our
machines.  That way your users will have better response for their
other uses of your local software.

Have a good vacation!
-- 
				Sincerely,

				Dave Kristofferson
				GenBank On-line Service Manager

				kristoff@genbank.bio.net

roy@phri.nyu.edu (Roy Smith) (03/21/90)

cjv@mace.cc.purdue.edu (Rick Westerman) writes:
> I'd like to respond to Dave Kristofferson's recent posting on why he
> thinks local copies of the database should be discouraged.

	Both Dave and Rick make good points.  This is basicly the
centralized vs. distributed computing argument all over again.  The latest
incarnation of this argument is raging right now in (I think) comp.arch,
and involves X-terminals vs. workstations, but the gist is the same.  For
some people, one solution will be the right one, but not for everybody.
For somebody with something like a PC/AT class machine with a 40 Meg disk,
I don't think there is any doubt that accessing a central fasta server is
the only rational way to go.  I am almost totally ignorant of the ways of
PC's, but I assume that there is some sort of software one can run on a PC
to allow you to send and recieve mail.

	I suspect, however, that if everybody who currently maintains their
own GB database were to suddenly switch to banging on the bionet Solbourne
for fasta searches, that machine would quickly roll over and die.  That may
not be a fair argument, however, since presumably the switch would be slow
and you can always buy more Solbournes to keep up with the load.  But then
the question becomes, if we (in the global sense) are going to buy 10 more
Solbournes for the central fasta site, why not buy me and 9 other research
institutes Solbournes instead and let us use them to number crunch when
we're not running fasta?

	Of course, while that might make sense, it's not true that while
I'm not doing fasta searches, I can use the 100+ Meg of disk genbank takes
up for something else, or at least not without a lot of fuss to shuffle
things to tape, or what have you.  One could, I guess, just NFS mount the
genbank partition directly from the bionet server, but that's probably not
a good way to make use of network bandwidth (although, with the plans afoot
to make the whole NFSNet 45 Mbps or even 1 Gbps, we're going to have to
find something to do with all that bandwidth!).  On the other hand, it
might make sense for a half dozen sites within a single University-wide 10
Mbps LAN to share one copy on disk.

> other programs that need to access the entire database besides fasta

	This one's the kicker.  While I think Dave is right that the vast
majority of what people want to do with genbank is run fasta, there are
enough other uses to make me want to keep my own copy.  For example, we
have a program given to us years ago by Jim Fickett which parses the
genbank features table and generates a protein data base by translating the
annontated reading frames.  People can then search that derived database.
Yes, a lot of our derived data base overlaps with dayhoff/PIR, but there is
a lot of stuff which doesn't make it into PIR.  You could run tfasta, but
people around here say they prefer using fasta on the derived database,
claiming that it finds things that tfasta doesn't.  In practice, if people
are serious about doing a protein search, they use all three methods and
merge the results.

	One thing that strikes me about genbank is that it's about an order
of magnitude bigger than it has to be.  If all you want to do is run fasta
locally, you don't need the annotations.  Right there, you cut the size of
the files in half.  Next, with the database stored as ascii, each base
takes up 8 bits when it really only needs 2.  Another factor of 4 savings.
Maybe what people like me should be doing is storing a binary version of
just the sequence data to run fasta against and throwing away the ascii
files?  I could then retrieve the full annotated ascii version of any
interesting loci from a central server after a fasta run is finished.  PIR,
by the way, is even worse.  For some reason that I have never figured out,
they put a blank space between every residue in the ascii version of the
data base!  This makes the files bigger without adding any information.
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy
"My karma ran over my dogma"