MJB1@biology.cambridge.ac.uk (03/22/90)
DISTRIBUTION OF GENOME DATA I would like to make two points in relation to the current debate on the distribution of nucliec acid sequence data. (1) I think it is essential to set up a number of centres which accept experimentally determined nucleic acid sequence data from a given catchment area and redistribute it to other similar centres. This happens to an extent at present with EMBL, GenBank and DDBJ. These centres should then feed other sites where the data are used for scientific studies. Again this is already going on. We are probably in the process of working towards a situation where the whole of this infra-structure is in place worldwide and is implemented in a uniform manner. It will be a multi-vendor environment running the successor of Unix (which may be called Unix?) and the successor of TCP/IP (which may be ISO/OSI). If there are format differences between various parts of the world (eg. EMBL versus GenBank) there should be filters (like network gateways) which can convert from one to the other without people having to intervene. (I cant help thinking a single format might be nicer, but that goes against the human desire for a rich diversity in life, cf. Babel). People will want data locally (or at least subsets of it) so they can play with it as they wish. Progress in computer technology will always beat progess in nucleic acids sequencing unless a dramatically new technique is discovered for reading the genome. I expect my machine to have 512 processors of at least 100 MIPS each, with vast amounts of RAM and 1000 GigaBytes of storage connected to a sequence data distribution centre at 200 Mbits/second. (This is probably an underestimate as its more or less feasible today). I firmly believe that computing should be distributed and that the network is crucial (I was going to say the computer, but someone already thought of that). (2) If people seriously believe that in five years time they will be running FASTA on all determined sequences every time they determine a new bit of their own I am amazed, astonished, speechless... It is not very sensible even today but as it only takes 5 minutes to run 1 kb against the whole of GenBank (at least on my Sun 4/390 rated at 17 MIPS) it is not cost effective to spend months devising a better method. [Yes, I am not ashamed of my mippage and its not so meaningless as sometimes claimed. Some machines are faster than others. The figure for a VAX 8350 is 45 minutes not 5 minutes so something is different!] This is because FASTA is such a good program. In the long term it is not the way to identify sequences however. The way to do it is not totally obvious and this is a subject for research in connection with genome projects. Martin Bishop MRC Molecular Genetics Unit Cambridge, UK.