daemon@ig.UUCP (12/05/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> 4-Dec-87 12:33:43-PST,9076;000000000001 Return-Path: <@WISCVM.WISC.EDU:A.F.W.Coulson@EDINBURGH.AC.UK> Received: from WISCVM.WISC.EDU by BIONET-20.ARPA with TCP; Fri 4 Dec 87 12:33:28-PST Received: from UKACRL.BITNET by WISCVM.WISC.EDU ; Fri, 04 Dec 87 14:34:41 CDT Received: from RL.IB by UKACRL.BITNET (Mailer X1.25) with BSMTP id 1126; Fri, 04 Dec 87 20:27:53 GMT Via: UK.AC.RL.EARN; Fri, 04 Dec 87 20:27:52 GMT Received: Via: 000015001006.FTP.MAIL; 4 DEC 87 20:27:44 GMT Date: 04 Dec 87 20:28:06 gmt From: A.F.W.Coulson@EDINBURGH.AC.UK Subject: CSLG Discussion or Conference To: MAULIK%arpa.bionet-20%RL.earn Message-ID: <04 Dec 87 20:28:06 gmt 100798@EMAS-A> Searching large databases for sequence similarities. Ellis Golub asks whether good searching methods find anything of biological interest. You bet they do! Our best case so far is still in press with PNAS (Bownes, M., Shirras,A., Blair,M., Collins,J. and Coulson,A. "Yolk protein degradation times release of ecdysteroid in insect embryogenesis."), but two that have already appeared are in Nature 1987,326, 614-617 and 328,766. We have not applied this method very much so far to DNA databases; the main reaon for this is lack of resources (= personnel), but there are also scientific reasons. For any cistron, it is always better to perform searches and comparisons at the protein level (see Collins J.F. and Coulson,A.F.W. "Molecular sequence comparison and alignment" in Bishop,M.J. and Rawlings,C.J. "Nucleic Acid and Protein Sequence Analysis a Practical Approach", IRL, Oxford, 1987, pp 323-358, if you need convincing); and in any case there is really no scope for sophisticated matching scores in nucleic acid comparisons. Nor do we have much general information about the accepatibility of gaps in non-coding nucleic acid sequences. My feeling is that pattern detection methods are really what is appropriate for nucleic acid database searching, rather than the NWS algorithms. -------
daemon@ig.UUCP (12/05/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> 4-Dec-87 12:33:43-PST,9076;000000000001 Return-Path: <@WISCVM.WISC.EDU:A.F.W.Coulson@EDINBURGH.AC.UK> Received: from WISCVM.WISC.EDU by BIONET-20.ARPA with TCP; Fri 4 Dec 87 12:33:28-PST Received: from UKACRL.BITNET by WISCVM.WISC.EDU ; Fri, 04 Dec 87 14:34:41 CDT Received: from RL.IB by UKACRL.BITNET (Mailer X1.25) with BSMTP id 1126; Fri, 04 Dec 87 20:27:53 GMT Via: UK.AC.RL.EARN; Fri, 04 Dec 87 20:27:52 GMT Received: Via: 000015001006.FTP.MAIL; 4 DEC 87 20:27:44 GMT Date: 04 Dec 87 20:28:06 gmt From: A.F.W.Coulson@EDINBURGH.AC.UK Subject: CSLG Discussion or Conference To: MAULIK%arpa.bionet-20%RL.earn Message-ID: <04 Dec 87 20:28:06 gmt 100798@EMAS-A> May I make some comment on questions raised in the CSLG discussion? Searching large databases for sequence similarities. It is clear from several contributions that it is still not generally appreciated, even in the computer-sophisticated US, how grossly inefficient implementations of the exhaustive searching methods (NWS algorithms) are on machines of conventional architecture -- including conventional supercomputers. For some time we have been routinely running exhaustive searches of the protein sequence database, using the "Best Local Homology" algorithm of Smith and Waterman to compare the query sequence in turn with every known sequence, and returning a list of the best 4000+ similarities in the complete surface of comparison. This program will use any arbitrarily defined scoring scheme for both residue comparisons and indel penalties; and it takes 1-2 secs of cpu time per residue of the query sequence. The algorithm is essentially the same as that used in the second stage of the FASTP program; the key difference is that we can afford to run this exhaustive algorithm on the entire database, while FASTP attempts to filter out a subset of the database by a rapid and "approximate" search method, and runs the full search only on this subset. This program runs on the ICL Distributed Array Processor ("DAP"; DAP's are now being marketed by an ICL spin-off company called AMT), a massively parallel SIMD machine, which is orders of magnitude cheaper than any of the other high performance machines whose names have been invoked in this correspondence so far, because of the simplicity of the processors at each node; each processor in a DAP is a single bit full adder. Numerical calculation can be made to run on the DAP with great efficiency (returning an overall performance on these applications comparable to a Cray-1), but the architecture is obviously best suited to applications (such as the database search) involving a lot of logical and integer operations, and for these applications the machine out-performs anything we've heard of. -------
daemon@ig.UUCP (12/05/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> 4-Dec-87 12:33:43-PST,9076;000000000001 Return-Path: <@WISCVM.WISC.EDU:A.F.W.Coulson@EDINBURGH.AC.UK> Received: from WISCVM.WISC.EDU by BIONET-20.ARPA with TCP; Fri 4 Dec 87 12:33:28-PST Received: from UKACRL.BITNET by WISCVM.WISC.EDU ; Fri, 04 Dec 87 14:34:41 CDT Received: from RL.IB by UKACRL.BITNET (Mailer X1.25) with BSMTP id 1126; Fri, 04 Dec 87 20:27:53 GMT Via: UK.AC.RL.EARN; Fri, 04 Dec 87 20:27:52 GMT Received: Via: 000015001006.FTP.MAIL; 4 DEC 87 20:27:44 GMT Date: 04 Dec 87 20:28:06 gmt From: A.F.W.Coulson@EDINBURGH.AC.UK Subject: CSLG Discussion or Conference To: MAULIK%arpa.bionet-20%RL.earn Message-ID: <04 Dec 87 20:28:06 gmt 100798@EMAS-A> Searching large databases for sequence similarities. As far as the molecular biologist user is concerned, the important point about this efficiency is that it is practicable to perform exhaustive searches repeatedly and interactively; our experience is that database searching is not best regarded as a one-off process -- much more of biological value is to be found by modifying search conditions (changing scoring schemes, limiting the search to some regions of the query sequence or the database, etc) in the light of the results obtained with the first set of conditions used. The appropriate measure of efficiency in computing terms is the cost of doing the calculation. This summer we were able to make one comparison in a particularly straightforward way, by running a search using FASTP on a VAX, and using our program on a DAP in the Edinburgh Regional Computing Centre. The point was that ERCC sold computing time on both types of machine on essentially the same basis as far as their financial accounting was concerned, so we could turn the cpu times into strictly comparable sums of money. It turned out that the exhaustive search was not only better and faster than FASTP, but actually slightly cheaper. I should say of course that ERCC is not a commercial operation, and that academic users at Edinburgh do not have to find real money for their use of the Centre's facilities; also that the terms on which ERCC acquired the two machines are not comparable: the only firm conclusion I want to argue is that it is quite clear that exhaustive searches on the appropriate architecture are at least comparable in cost to slower, approximate searches on conventional machines of inappropriat architecture. -------
daemon@ig.UUCP (12/05/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> 4-Dec-87 12:33:43-PST,9076;000000000001 Return-Path: <@WISCVM.WISC.EDU:A.F.W.Coulson@EDINBURGH.AC.UK> Received: from WISCVM.WISC.EDU by BIONET-20.ARPA with TCP; Fri 4 Dec 87 12:33:28-PST Received: from UKACRL.BITNET by WISCVM.WISC.EDU ; Fri, 04 Dec 87 14:34:41 CDT Received: from RL.IB by UKACRL.BITNET (Mailer X1.25) with BSMTP id 1126; Fri, 04 Dec 87 20:27:53 GMT Via: UK.AC.RL.EARN; Fri, 04 Dec 87 20:27:52 GMT Received: Via: 000015001006.FTP.MAIL; 4 DEC 87 20:27:44 GMT Date: 04 Dec 87 20:28:06 gmt From: A.F.W.Coulson@EDINBURGH.AC.UK Subject: CSLG Discussion or Conference To: MAULIK%arpa.bionet-20%RL.earn Message-ID: <04 Dec 87 20:28:06 gmt 100798@EMAS-A> Searching large databases for sequence similarities. Since Alex Reisner has mentioned transputers, I would extend what he say somewhat:- Viewed as a general purpose computing machine, a single T800 has comparable power to a VAX (?750 ?780); the NWS algorithms for a single pairwise comparison will run in a practicable time on a VAX. So the database search problem can be farmed out to a heap of transputers, using one for each pairwise comparison. If you have 1000 transputers (the planned size of the Edinburgh Concurrent Supercomputer; the first 200 are being installe at present), and 6000 sequences in the database, each machine only has to do six jobs. So the problem can be mapped in a simple way onto ECS, and no doubt we shall do some searches this way, because why not? But this application (as I've said already) doesn't use the general purpose computing facilities of the transputer processor, so though it may be an effective solution, it will not be a cost-effective one compared to the DAP. Of course, this may not matter to individuals who do not themselves have to find the true economic cost of the computing they do...... I don't know much about the Connection Machine apart from what I read i Scientific American, but I think a similar point may apply. As I understand it, it will be possible to use a similar mapping to that used in the DAP search program on the Connection Machine, but the only communication this needs is between each node and two neighbours, so that the expensively-provided richness and flexibility of interconnection may not be used very effectively. As I say, I don't know much about it, and if anybody has an idea for mapping the problem in a way which makes better use of the particular strengths of this machine (or of a transputer array), I shall be interested to hear it. -------