[bionet.molbio.news] CSLG|COMMENTARY: From Alex Reisner

daemon@ig.UUCP (12/01/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

		Computers and the Sequencing of Large Genomes

	Currently the proposal to obtain the complete and ordered nucleotide 
sequence of the human genome comes immediately to mind when this topic is 
raised.  Leaving aside the question as to whether it is sensible to set such 
a goal as a specific and highly directed project, the requirement for 
automated data acquisition, processing and analysis is an obvious necessity.
	The least difficult of problems in dealing with the data that would be 
acquired would be their storage and straight forward single sequence 
comparison searches with the data base.  3,000 million bases, allowing two 
bits per character (ignore ambiguity coding), is 750 Mbytes - Dealing with 
peptide sequences is slightly more complex.  Allowing for storage overheads 
two current CD-ROMS would easily contain the lot. Add to that the additional 
information that one would want the database to contain and probably some 
half dozen 120mm CD-ROMS would be sufficient.  With more advanced optical 
technology and 300-400mm discs a single disc would be more than adequate.  
The slowness of data transfer from optical media would, even now, be no 
particular obstacle, e.g. it could be overcome using even sub-Gbyte magnetic 
discs and high speed cashe memory as intermediate stages.  At this stage not 
even particularly large sums would be involved.
	
-------

daemon@ig.UUCP (12/01/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

		Computers and the Sequencing of Large Genomes


	The much more difficult problems that require solving in dealing with 
the 'large genome problem' are concerned with: 1) speed and reliability of 
data acquisition, 2) rapid and efficient ordering of the acquired raw 
sequence data, 3) throughput of the data into the common database as to 
speed, reliability and expense, 4) proper annotation of the data, 5) 
digestible and useful representation of the data.

-------

daemon@ig.UUCP (12/05/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

 4-Dec-87 12:33:43-PST,9076;000000000001
Return-Path: <@WISCVM.WISC.EDU:A.F.W.Coulson@EDINBURGH.AC.UK>
Received: from WISCVM.WISC.EDU by BIONET-20.ARPA with TCP; Fri 4 Dec 87 12:33:28-PST
Received: from UKACRL.BITNET by WISCVM.WISC.EDU ; Fri, 04 Dec 87 14:34:41 CDT
Received: from RL.IB by UKACRL.BITNET (Mailer X1.25) with BSMTP id 1126; Fri,
 04 Dec 87 20:27:53 GMT
Via:        UK.AC.RL.EARN; Fri, 04 Dec 87 20:27:52 GMT
Received:
Via:        000015001006.FTP.MAIL;  4 DEC 87 20:27:44 GMT
Date:       04 Dec 87  20:28:06 gmt
From:       A.F.W.Coulson@EDINBURGH.AC.UK
Subject:    CSLG Discussion or Conference
To:         MAULIK%arpa.bionet-20%RL.earn
Message-ID: <04 Dec 87  20:28:06 gmt  100798@EMAS-A>


       Searching large databases for sequence similarities.

         Will the growth in sequence data overwhelm our ability to deal
with it computationally?  One can always write a science fiction scenario
in which it does (suppose those Japanese robot factories get up to
107 bp/day? You can manage that? Then how about if they get up to 108?....)
but I really think it is counterproductive to do so.  At present the
sequence databases are still quite modest in size in computing terms, and
they are providing lots of us with a rich new field for research.  Even with
the machines we have now, there wouldn't really be any problem until the
databases are ten times their present size.  As long as one can formulate
what one wants to do in reasonably concrete terms, I'm pretty sure that
both computer science and the granting agencies will have no great
difficulty in continuing to provide us with what we need in software tools
and hardware respectively. It's primarily an engineering problem (though
there is also useful and interesting research to be done), and all
it will take to solve is money. Nothing I've seen so far suggests that
the total will be more than a substantial fraction of what the data
acquisition will cost.

-------