[bionet.molbio.genome-program] GenBank, FASTA, life and the universe.

MJB1@biology.cambridge.ac.uk (03/22/90)

                  DISTRIBUTION OF GENOME DATA

I would like to make two points in relation to the current debate on the
distribution of nucliec acid sequence data.

(1) I think it is essential to set up a number of centres which accept
experimentally determined nucleic acid sequence data from a given catchment
area and redistribute it to other similar centres.  This happens to an extent
at present with EMBL, GenBank and DDBJ. These centres should then feed other
sites where the data are used for scientific studies. Again this is already
going on.  We are probably in the process of working towards a situation
where the whole of this infra-structure is in place worldwide and is
implemented in a uniform manner. It will be a multi-vendor environment
running the successor of Unix (which may be called Unix?) and the successor
of TCP/IP (which may be ISO/OSI). If there are format differences between
various parts of the world (eg. EMBL versus GenBank) there should be filters
(like network gateways) which can convert from one to the other without
people having to intervene. (I cant help thinking a single format might be
nicer, but that goes against the human desire for a rich diversity in life,
cf. Babel).  People will want data locally (or at least subsets of it) so
they can play with it as they wish. Progress in computer technology will
always beat progess in nucleic acids sequencing unless a dramatically new
technique is discovered for reading the genome. I expect my machine to have
512 processors of at least 100 MIPS each, with vast amounts of RAM and
1000 GigaBytes of storage connected to a sequence data distribution
centre at 200 Mbits/second. (This is probably an underestimate as its more
or less feasible today).  I firmly believe that computing should be distributed
and that the network is crucial (I was going to say the computer, but someone
already thought of that).

(2) If people seriously believe that in five years time they will be running
FASTA on all determined sequences every time they determine a new bit of their
own I am amazed, astonished, speechless...
It is not very sensible even today but as it only takes 5 minutes to run 1 kb
against the whole of GenBank (at least on my Sun 4/390 rated at 17 MIPS)
it is not cost effective to spend months devising a better method.
[Yes, I am not ashamed of my mippage and its not so meaningless as sometimes
claimed. Some machines are faster than others. The figure for a VAX 8350
is 45 minutes not 5 minutes so something is different!]
This is because FASTA is such a good program. In the long term it is not the
way to identify sequences however. The way to do it is not totally obvious
and this is a subject for research in connection with genome projects.

Martin Bishop
MRC Molecular Genetics Unit
Cambridge, UK.