[bionet.molbio.news] CSLG|COMMENTARY: From Andrew Coulson

daemon@ig.UUCP (12/05/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

 4-Dec-87 12:33:43-PST,9076;000000000001
Return-Path: <@WISCVM.WISC.EDU:A.F.W.Coulson@EDINBURGH.AC.UK>
Received: from WISCVM.WISC.EDU by BIONET-20.ARPA with TCP; Fri 4 Dec 87 12:33:28-PST
Received: from UKACRL.BITNET by WISCVM.WISC.EDU ; Fri, 04 Dec 87 14:34:41 CDT
Received: from RL.IB by UKACRL.BITNET (Mailer X1.25) with BSMTP id 1126; Fri,
 04 Dec 87 20:27:53 GMT
Via:        UK.AC.RL.EARN; Fri, 04 Dec 87 20:27:52 GMT
Received:
Via:        000015001006.FTP.MAIL;  4 DEC 87 20:27:44 GMT
Date:       04 Dec 87  20:28:06 gmt
From:       A.F.W.Coulson@EDINBURGH.AC.UK
Subject:    CSLG Discussion or Conference
To:         MAULIK%arpa.bionet-20%RL.earn
Message-ID: <04 Dec 87  20:28:06 gmt  100798@EMAS-A>


       Searching large databases for sequence similarities.

         Ellis Golub asks whether good searching methods find anything of
biological interest. You bet they do! Our best case so far is still in press
with PNAS (Bownes, M., Shirras,A., Blair,M., Collins,J. and Coulson,A.
"Yolk protein degradation times release of ecdysteroid in insect
embryogenesis."), but two that have already appeared are in Nature 1987,326,
614-617 and 328,766.
         We have not applied this method very much so far to DNA databases; the
main reaon for this is lack of resources (= personnel), but there are also
scientific reasons.
For any cistron, it is always better to perform searches and comparisons at
the protein level (see Collins J.F. and Coulson,A.F.W. "Molecular sequence
comparison and alignment" in  Bishop,M.J. and Rawlings,C.J. "Nucleic Acid
and Protein Sequence Analysis a Practical Approach", IRL, Oxford, 1987, pp
323-358, if you need convincing); and in any case there is really no scope
for sophisticated matching scores in nucleic acid comparisons.
Nor do we have much general information about the accepatibility
of gaps in non-coding nucleic acid sequences.  My feeling
is that pattern detection methods are really what is appropriate for
nucleic acid database searching, rather than the NWS algorithms.
-------

daemon@ig.UUCP (12/05/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

 4-Dec-87 12:33:43-PST,9076;000000000001
Return-Path: <@WISCVM.WISC.EDU:A.F.W.Coulson@EDINBURGH.AC.UK>
Received: from WISCVM.WISC.EDU by BIONET-20.ARPA with TCP; Fri 4 Dec 87 12:33:28-PST
Received: from UKACRL.BITNET by WISCVM.WISC.EDU ; Fri, 04 Dec 87 14:34:41 CDT
Received: from RL.IB by UKACRL.BITNET (Mailer X1.25) with BSMTP id 1126; Fri,
 04 Dec 87 20:27:53 GMT
Via:        UK.AC.RL.EARN; Fri, 04 Dec 87 20:27:52 GMT
Received:
Via:        000015001006.FTP.MAIL;  4 DEC 87 20:27:44 GMT
Date:       04 Dec 87  20:28:06 gmt
From:       A.F.W.Coulson@EDINBURGH.AC.UK
Subject:    CSLG Discussion or Conference
To:         MAULIK%arpa.bionet-20%RL.earn
Message-ID: <04 Dec 87  20:28:06 gmt  100798@EMAS-A>

May I make some comment on questions raised in the CSLG discussion?

       Searching large databases for sequence similarities.

       It is clear from several contributions that it is still not generally
appreciated, even in the computer-sophisticated US, how grossly inefficient
implementations of the exhaustive searching methods (NWS algorithms) are on
machines of conventional architecture -- including conventional supercomputers.
       For some time we have been routinely running exhaustive searches of the
protein sequence database, using the "Best Local Homology" algorithm of
Smith and Waterman to compare the query sequence in turn with every known
sequence, and returning a list of the best 4000+ similarities in the complete
surface of comparison.   This program will use any arbitrarily defined scoring
scheme for both residue comparisons and indel penalties; and it takes 1-2 secs
of cpu time per residue of the query sequence.  The algorithm is essentially
the same as that used in the second stage of the FASTP program; the key
difference is that we can afford to run this exhaustive algorithm on the
entire database, while FASTP attempts to filter out a subset of the database
by a rapid and "approximate" search method, and runs the full search only on
this subset.
       This program runs on the ICL Distributed Array Processor ("DAP"; DAP's
are now being marketed by an ICL spin-off company called AMT), a massively
parallel SIMD machine, which is orders of magnitude cheaper than any of
the other high performance machines whose names have been invoked in this
correspondence so far, because of the simplicity of the processors at each
node; each processor in a DAP is a single bit full adder.  Numerical calculation
can be made to run on the DAP with great efficiency (returning an overall
performance on these applications comparable to a Cray-1), but the architecture
is obviously best suited to applications (such as the database search) involving
a lot of logical and integer operations, and for these applications
the machine out-performs anything we've heard of.
-------

daemon@ig.UUCP (12/05/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

 4-Dec-87 12:33:43-PST,9076;000000000001
Return-Path: <@WISCVM.WISC.EDU:A.F.W.Coulson@EDINBURGH.AC.UK>
Received: from WISCVM.WISC.EDU by BIONET-20.ARPA with TCP; Fri 4 Dec 87 12:33:28-PST
Received: from UKACRL.BITNET by WISCVM.WISC.EDU ; Fri, 04 Dec 87 14:34:41 CDT
Received: from RL.IB by UKACRL.BITNET (Mailer X1.25) with BSMTP id 1126; Fri,
 04 Dec 87 20:27:53 GMT
Via:        UK.AC.RL.EARN; Fri, 04 Dec 87 20:27:52 GMT
Received:
Via:        000015001006.FTP.MAIL;  4 DEC 87 20:27:44 GMT
Date:       04 Dec 87  20:28:06 gmt
From:       A.F.W.Coulson@EDINBURGH.AC.UK
Subject:    CSLG Discussion or Conference
To:         MAULIK%arpa.bionet-20%RL.earn
Message-ID: <04 Dec 87  20:28:06 gmt  100798@EMAS-A>

       Searching large databases for sequence similarities.

       As far as the molecular biologist user is concerned, the important point
about this efficiency is that it is practicable to perform exhaustive searches
repeatedly and interactively; our experience is that database searching is not
best regarded as a one-off process -- much more of biological value is to be
found by modifying search conditions (changing scoring schemes, limiting the
search to some regions of the query sequence or the database, etc) in the light
of the results obtained with the first set of conditions used.
       The appropriate measure of efficiency in computing terms is the cost of
doing the calculation.  This summer we were able to make one comparison in
a particularly straightforward way, by running a search using FASTP on a VAX,
and using our program on a DAP in the Edinburgh Regional Computing Centre.
The point was that ERCC sold computing time on both types of machine on
essentially the same basis as far as their financial accounting was
concerned, so we could turn the cpu times into strictly comparable sums of
money.  It turned out that the exhaustive search was not only better and
faster than FASTP, but actually slightly cheaper.  I should say of course that
ERCC is not a commercial operation, and that academic users at Edinburgh do
not have to find real money for their use of the Centre's facilities; also that
the terms on which ERCC acquired the two machines are not comparable:
the only firm conclusion I want to argue is that it is quite clear that
exhaustive searches on the appropriate architecture are at least comparable
in cost to slower, approximate searches on conventional machines of inappropriat
architecture.
-------

daemon@ig.UUCP (12/05/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

 4-Dec-87 12:33:43-PST,9076;000000000001
Return-Path: <@WISCVM.WISC.EDU:A.F.W.Coulson@EDINBURGH.AC.UK>
Received: from WISCVM.WISC.EDU by BIONET-20.ARPA with TCP; Fri 4 Dec 87 12:33:28-PST
Received: from UKACRL.BITNET by WISCVM.WISC.EDU ; Fri, 04 Dec 87 14:34:41 CDT
Received: from RL.IB by UKACRL.BITNET (Mailer X1.25) with BSMTP id 1126; Fri,
 04 Dec 87 20:27:53 GMT
Via:        UK.AC.RL.EARN; Fri, 04 Dec 87 20:27:52 GMT
Received:
Via:        000015001006.FTP.MAIL;  4 DEC 87 20:27:44 GMT
Date:       04 Dec 87  20:28:06 gmt
From:       A.F.W.Coulson@EDINBURGH.AC.UK
Subject:    CSLG Discussion or Conference
To:         MAULIK%arpa.bionet-20%RL.earn
Message-ID: <04 Dec 87  20:28:06 gmt  100798@EMAS-A>


       Searching large databases for sequence similarities.

        Since Alex Reisner has mentioned transputers, I would extend what he say
somewhat:-
        Viewed as a general purpose computing machine, a single T800 has
comparable power to a VAX (?750 ?780); the NWS algorithms for a single pairwise
comparison will run in a practicable time on a VAX.
So the database search problem can be farmed out to a heap of transputers, using
one for each pairwise comparison. If you have 1000 transputers (the planned
size of the Edinburgh Concurrent Supercomputer; the first 200 are being installe
at present), and 6000 sequences in the database, each machine only has to do
six jobs.  So the problem can be mapped in a simple way onto ECS, and no
doubt we shall do some searches this way, because why not?  But this application
(as I've said already) doesn't use the general purpose computing facilities of
the transputer processor,  so though it may be an effective solution, it will
not be a cost-effective one compared to the DAP.  Of course, this may not
matter to individuals who do not themselves have to find the true economic
cost of the computing they do......
         I don't know much about the Connection Machine apart from what I read i
Scientific American, but I think a similar point may apply.  As I understand
it, it will be possible to use a similar mapping to that used in the DAP search
program on the Connection Machine, but the only communication this needs is
between each node and two neighbours, so that the expensively-provided richness
and flexibility of interconnection may not be used very effectively.  As I say,
I don't know much about it, and if anybody has an idea for mapping the problem
in a way which makes better use of the particular strengths of this machine
(or of a transputer array), I shall be interested to hear it.
-------