daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> Computers and the Sequencing of Large Genomes Currently the proposal to obtain the complete and ordered nucleotide sequence of the human genome comes immediately to mind when this topic is raised. Leaving aside the question as to whether it is sensible to set such a goal as a specific and highly directed project, the requirement for automated data acquisition, processing and analysis is an obvious necessity. The least difficult of problems in dealing with the data that would be acquired would be their storage and straight forward single sequence comparison searches with the data base. 3,000 million bases, allowing two bits per character (ignore ambiguity coding), is 750 Mbytes - Dealing with peptide sequences is slightly more complex. Allowing for storage overheads two current CD-ROMS would easily contain the lot. Add to that the additional information that one would want the database to contain and probably some half dozen 120mm CD-ROMS would be sufficient. With more advanced optical technology and 300-400mm discs a single disc would be more than adequate. The slowness of data transfer from optical media would, even now, be no particular obstacle, e.g. it could be overcome using even sub-Gbyte magnetic discs and high speed cashe memory as intermediate stages. At this stage not even particularly large sums would be involved. -------
daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> Computers and the Sequencing of Large Genomes The much more difficult problems that require solving in dealing with the 'large genome problem' are concerned with: 1) speed and reliability of data acquisition, 2) rapid and efficient ordering of the acquired raw sequence data, 3) throughput of the data into the common database as to speed, reliability and expense, 4) proper annotation of the data, 5) digestible and useful representation of the data. -------
daemon@ig.UUCP (12/05/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> 4-Dec-87 12:33:43-PST,9076;000000000001 Return-Path: <@WISCVM.WISC.EDU:A.F.W.Coulson@EDINBURGH.AC.UK> Received: from WISCVM.WISC.EDU by BIONET-20.ARPA with TCP; Fri 4 Dec 87 12:33:28-PST Received: from UKACRL.BITNET by WISCVM.WISC.EDU ; Fri, 04 Dec 87 14:34:41 CDT Received: from RL.IB by UKACRL.BITNET (Mailer X1.25) with BSMTP id 1126; Fri, 04 Dec 87 20:27:53 GMT Via: UK.AC.RL.EARN; Fri, 04 Dec 87 20:27:52 GMT Received: Via: 000015001006.FTP.MAIL; 4 DEC 87 20:27:44 GMT Date: 04 Dec 87 20:28:06 gmt From: A.F.W.Coulson@EDINBURGH.AC.UK Subject: CSLG Discussion or Conference To: MAULIK%arpa.bionet-20%RL.earn Message-ID: <04 Dec 87 20:28:06 gmt 100798@EMAS-A> Searching large databases for sequence similarities. Will the growth in sequence data overwhelm our ability to deal with it computationally? One can always write a science fiction scenario in which it does (suppose those Japanese robot factories get up to 107 bp/day? You can manage that? Then how about if they get up to 108?....) but I really think it is counterproductive to do so. At present the sequence databases are still quite modest in size in computing terms, and they are providing lots of us with a rich new field for research. Even with the machines we have now, there wouldn't really be any problem until the databases are ten times their present size. As long as one can formulate what one wants to do in reasonably concrete terms, I'm pretty sure that both computer science and the granting agencies will have no great difficulty in continuing to provide us with what we need in software tools and hardware respectively. It's primarily an engineering problem (though there is also useful and interesting research to be done), and all it will take to solve is money. Nothing I've seen so far suggests that the total will be more than a substantial fraction of what the data acquisition will cost. -------