clark@mshri.utoronto.ca (03/21/90)
Since there has been some discussion recently on the relative merits of keeping copies of the DNA databases locally compared to accessing a central server, I thought I would put in my $0.02 worth. The group I support is probably typical molecular biologists who want to have access to the data primarily for two reasons: to run fasta-type searches against their new clones to identify similar sequences that have been previously reported, and to avoid typing in published sequences when they want to identify certain features they contain, such as restriction sites (when using them in constructs) or to compare various motifs in members of a gene family. For the former, I encourage them to use the excellent fasta service that GenBank provides via email, and for the latter they use the local database, which is updated quarterly. I don't anticipate that it will be necessary to have the local database updated daily for these functions since we can always get the few sequences that we might need from GenBank if they aren't in our latest quarterly release. With the services that GenBank provides at the moment, it is essential that we maintain a local copy, for several reasons. As has been previously mentioned, it is much more convenient and faster to retrieve a local sequence than to get it mailed from GenBank, and we do some types of database searching that GenBank doesn't provide us with (the GCG Find program). Furthermore, if need be, I can write a program to do any kind of database analysis I want. An obvious example is to locate all the splice sites or TATA boxes, as identified in the SITES table, and generate a consensus sequence for certain species or classes of genes. I think it will be a long time before GenBank can provide us with software tools via email that are so flexible that they can anticipate everyone's needs. Another very important consideration is the long-term stability of the GenBank email services provided by a government contractor. Considering the recent demise of Bionet, and the switch of the GenBank contract from BBN to IntelliGenetics, who can say what will be available 5 years from now? With our local database, all funding for network distribution could completely stop and it wouldn't affect us that much. There are basically two arguments against maintaining a local database, namely, the expense of disk storage cost, and the expense of distribution. Personally I feel that the first is a red herring because magnetic disk storage costs are at the moment pretty reasonable and decreasing rapidly, and soon all sorts of optical technology will be available which will both reduce the cost and increase the capacity, probably by at least an order of magnitude. No one can accuse the computer hardware people of sleeping while the students and post-docs are in the lab sequencing! My point is that we need to have access to both locally maintained databases as well as an up-to-date central one. BTW, I wish people would stop talking about how many MIPS their machines can pull (for the biologists in the audience, MIPS stands for Meaningless Idiotic Parameters for Suckers). Maybe we need a new term that will help us compare the speeds of various machines. MFLOPS isn't really appropriate since most comparisons don't use a lot of floating point math. How about "Millions of nucleotides compared (with fasta) per second per 1000 base-long query sequence"? Stephen Clark clark@mshri.utoronto.ca (Internet) sinai@utoroci (Netnorth/Bitnet) "We should be quite remiss not to emphasize that despite the popularity of secondary structural prediction schemes, and the almost ritual performance of these calculations, the information available from this is of limited reliability. This is true even of the best methods now known, and much more so of the less successful methods commonly available in sequence analysis packages. Running a secondary structure prediction on a newly-determined sequence just because everyone else does so, is to be deplored, and the fact that the results of such predictions are generally ignored is insufficient justification for doing and publishing them." - Arthur Lesk, 1988