[bionet.molbio.genome-program] Central vrs Local databases

clark@mshri.utoronto.ca (03/21/90)

   Since there has been some discussion recently on the relative merits of 
keeping copies of the DNA databases locally compared to accessing a central 
server, I thought I would put in my $0.02 worth.

   The group I support is probably typical molecular biologists who want 
to have access to the data primarily for two reasons: to run fasta-type 
searches against their new clones to identify similar sequences that have been 
previously reported, and to avoid typing in published sequences when they 
want to identify certain features they contain, such as restriction sites 
(when using them in constructs) or to compare various motifs in members of
a gene family. For the former, I encourage them to use the excellent fasta 
service that GenBank provides via email, and for the latter they use the 
local database, which is updated quarterly. I don't anticipate that it will 
be necessary to have the local database updated daily for these functions 
since we can always get the few sequences that we might need from GenBank 
if they aren't in our latest quarterly release. 

   With the services that GenBank provides at the moment, it is essential
that we maintain a local copy, for several reasons. As has been previously
mentioned, it is much more convenient and faster to retrieve a local
sequence than to get it mailed from GenBank, and we do some types of
database searching that GenBank doesn't provide us with (the GCG Find
program). Furthermore, if need be, I can write a program to do any kind of
database analysis I want. An obvious example is to locate all the splice
sites or TATA boxes, as identified in the SITES table, and generate a
consensus sequence for certain species or classes of genes. I think it will
be a long time before GenBank can provide us with software tools via email
that are so flexible that they can anticipate everyone's needs. Another very
important consideration is the long-term stability of the GenBank email
services provided by a government contractor. Considering the recent demise
of Bionet, and the switch of the GenBank contract from BBN to
IntelliGenetics, who can say what will be available 5 years from now? With
our local database, all funding for network distribution could completely
stop and it wouldn't affect us that much. 

   There are basically two arguments against maintaining a local database, 
namely, the expense of disk storage cost, and the expense of distribution. 
Personally I feel that the first is a red herring because magnetic disk 
storage costs are at the moment pretty reasonable and decreasing rapidly, 
and soon all sorts of optical technology will be available which will both
reduce the cost and increase the capacity, probably by at least an order of
magnitude. No one can accuse the computer hardware people of sleeping
while the students and post-docs are in the lab sequencing! 

   My point is that we need to have access to both locally maintained 
databases as well as an up-to-date central one.

   BTW, I wish people would stop talking about how many MIPS their machines 
can pull (for the biologists in the audience, MIPS stands for Meaningless 
Idiotic Parameters for Suckers). Maybe we need a new term that will help us 
compare the speeds of various machines. MFLOPS isn't really appropriate 
since most comparisons don't use a lot of floating point math. How about 
"Millions of nucleotides compared (with fasta) per second per 1000
base-long query sequence"? 

Stephen Clark

clark@mshri.utoronto.ca  (Internet)
sinai@utoroci            (Netnorth/Bitnet)

"We should be quite remiss not to emphasize that despite the popularity of 
secondary structural prediction schemes, and the almost ritual performance 
of these calculations, the information available from this is of limited 
reliability. This is true even of the best methods now known, and much more 
so of the less successful methods commonly available in sequence analysis 
packages. Running a secondary structure prediction on a newly-determined 
sequence just because everyone else does so, is to be deplored, and the 
fact that the results of such predictions are generally ignored is 
insufficient justification for doing and publishing them."
   - Arthur Lesk, 1988