cjv@mace.cc.purdue.edu (westerman) (03/21/90)
As a system manager who plans to keep the Genbank database online on my systems and who plans *not* to, except rarely, utilize the fasta and retreival capiblities of genbank.bio.net, I'd like to respond to Dave Kristofferson's recent posting on why he thinks local copies of the database should be discouraged. First, my circumstances: 1) We are running the GCG (Wisconsin) sequence analysis package on VAX/VMS systems. 2) My systems are not overloaded; we have spare CPU power and disk space. 3) I do weekly updates of the database via ftp. This takes about 5 minutes of my time and about 1 1/2 hours of machine time (done in the background at very low priority, maybe 15 minutes of actual CPU time). 4) I have looked at/installed Clark's shells. They are very nice and hide the "dirty details" from the user. My objections to using genbank server are threefold: 1) Time. While it only takes a little bit more time to retreive a database entry from the server as it does from our local database (I estimate twice as long, which isn't bad considering the emailing that needs to be done), this delay is irritating when you sitting looking at a blank CRT. While I haven't done fasta timing tests, I suspect that the genbank computer is faster than mine; on the other hand, having 4 computers at my disposal means I can do 4 searches simitaneously. In any case, fasta searching is not time critical -- a search of 1/2 hour (via genbank) or 2 hours (maximum via my computers) still means that I must walk away from my desk and/or do something else; in any case I am not sitting around just waiting (unlike in the retreival case above). 2) Formatting Retreival results from genbank come back in a form that I cannot immediately use for further processing, instead I must extract the sequence from my mail and then convert the sequence to a form the GCG package can use. Granted, these steps are minor, but they are extra steps and irritating because of that. 3) Other uses of the database I have other programs that need to access the entire database besides fasta. One of these is the GCG program "FIND", which finds short matches in sequences; one of my group is using this program to try to find various promoter sites. By having a local copy of the database, we can do theoretical analysis of the database. A further comment: 4) I suspect that the reason genbank is currently a feasible option is that it is not overload, much in the same manner as my system is able to handle a minimum of 4 fasta searches at a time; however if we started getting over 6 searches we would start bogging down; and if genbank starting getting over XXX (60? ten times my load?) searches at a time, they would bog down too. (BTW: I have about a 10 MIPs system) I wish I could contribute further to this thread of netnews, but I am off on vacation for a week or so. -- Rick -- Rick Westerman AIDS Center Laboratory for Computational Internet: cjv@mace.cc.purdue.edu Biochemistry, Biochemistry building, (317) 494-0505 Purdue University, W. Lafayette, IN 47907
kristoff@genbank.BIO.NET (David Kristofferson) (03/21/90)
Rick, I'm sure that there are lots of machines which are not yet overloaded. The issue will begin to develop as the Genome Project starts to produce larger amounts of sequence data than currently comes in each day. I also was not questioning whether or not it was difficult to do the updates; the issue will be disk space and CPU power as the database grows. You mentioned some of the problems with extracting sequences out of e-mail messages, but I should remind you that users can also directly access the IRX program on genbank.bio.net and download sequences of interest directly without mail headers. The time of less than half an hour applies just to FASTA searches. E-mail retrieval of sequences takes about 2-3 minutes based on some tests that I have run from the east coast to our machine in California. Regarding formatting, it is true that the sequences come in a form that is not immediately usable by your commercial software, but this also holds true for the daily updates over USENET or the weekly FTP files. Reformatting for GCG must be done in any case. Of course, it is undoubtedly easier for the systems manager to process a whole block of data at a time, but it is also trivial to have a small script which users can run to do this on sequences of interest. We are in agreement about the need for a local copy of the database if you run local analyses on the whole database other than FASTA and IRX. I should point out, however, that the functionality which you are using is also available on the GOS computer for those who get accounts on the system (the QUEST program). Regarding overload, this will undoubtedly happen here eventually too. As compared to a 10 MIPS machine, our system has four 22 MIPS processors. We have not been bogged down yet and can handle much more than 4 fasta searches at a time. Part of my point though ***which holds true even if you do have a local copy of GenBank*** is that you can save your local CPU power by offloading FASTA searches to our machines. That way your users will have better response for their other uses of your local software. Have a good vacation! -- Sincerely, Dave Kristofferson GenBank On-line Service Manager kristoff@genbank.bio.net
roy@phri.nyu.edu (Roy Smith) (03/21/90)
cjv@mace.cc.purdue.edu (Rick Westerman) writes: > I'd like to respond to Dave Kristofferson's recent posting on why he > thinks local copies of the database should be discouraged. Both Dave and Rick make good points. This is basicly the centralized vs. distributed computing argument all over again. The latest incarnation of this argument is raging right now in (I think) comp.arch, and involves X-terminals vs. workstations, but the gist is the same. For some people, one solution will be the right one, but not for everybody. For somebody with something like a PC/AT class machine with a 40 Meg disk, I don't think there is any doubt that accessing a central fasta server is the only rational way to go. I am almost totally ignorant of the ways of PC's, but I assume that there is some sort of software one can run on a PC to allow you to send and recieve mail. I suspect, however, that if everybody who currently maintains their own GB database were to suddenly switch to banging on the bionet Solbourne for fasta searches, that machine would quickly roll over and die. That may not be a fair argument, however, since presumably the switch would be slow and you can always buy more Solbournes to keep up with the load. But then the question becomes, if we (in the global sense) are going to buy 10 more Solbournes for the central fasta site, why not buy me and 9 other research institutes Solbournes instead and let us use them to number crunch when we're not running fasta? Of course, while that might make sense, it's not true that while I'm not doing fasta searches, I can use the 100+ Meg of disk genbank takes up for something else, or at least not without a lot of fuss to shuffle things to tape, or what have you. One could, I guess, just NFS mount the genbank partition directly from the bionet server, but that's probably not a good way to make use of network bandwidth (although, with the plans afoot to make the whole NFSNet 45 Mbps or even 1 Gbps, we're going to have to find something to do with all that bandwidth!). On the other hand, it might make sense for a half dozen sites within a single University-wide 10 Mbps LAN to share one copy on disk. > other programs that need to access the entire database besides fasta This one's the kicker. While I think Dave is right that the vast majority of what people want to do with genbank is run fasta, there are enough other uses to make me want to keep my own copy. For example, we have a program given to us years ago by Jim Fickett which parses the genbank features table and generates a protein data base by translating the annontated reading frames. People can then search that derived database. Yes, a lot of our derived data base overlaps with dayhoff/PIR, but there is a lot of stuff which doesn't make it into PIR. You could run tfasta, but people around here say they prefer using fasta on the derived database, claiming that it finds things that tfasta doesn't. In practice, if people are serious about doing a protein search, they use all three methods and merge the results. One thing that strikes me about genbank is that it's about an order of magnitude bigger than it has to be. If all you want to do is run fasta locally, you don't need the annotations. Right there, you cut the size of the files in half. Next, with the database stored as ascii, each base takes up 8 bits when it really only needs 2. Another factor of 4 savings. Maybe what people like me should be doing is storing a binary version of just the sequence data to run fasta against and throwing away the ascii files? I could then retrieve the full annotated ascii version of any interesting loci from a central server after a fasta run is finished. PIR, by the way, is even worse. For some reason that I have never figured out, they put a blank space between every residue in the ascii version of the data base! This makes the files bigger without adding any information. -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy "My karma ran over my dogma"