[sci.research] Fast restriction analysis program available

roy@phri.UUCP (Roy Smith) (05/27/89)

	A program to perform restriction site analysis on dna sequences,
called renzyme, is now being made available.  The algorithm was described in
the paper cited below.  In a nutshell, it's about 20 times faster than other
common programs.  If you ever want to find restriction sites in sequences of
non-trivial length, you want this program.

	The source code is copyrighted, but is being made available at no
charge to non-profit, academic institutions for research purposes.  You can
get it by anonymous ftp from goober.phri.nyu.edu (128.122.136.10) in the
directory ~ftp/pub/seq.  You want to get the following files:

-rw-r--r--  1 roy          1146 May 22 17:52 Acknowledgements
-rw-r--r--  1 roy          1349 May 25 13:55 Copyright
-rw-r--r--  1 roy          5594 May 26 20:58 Setup
-rw-r--r--  1 roy        204645 May 26 20:53 renzyme.tar.Z
-rw-r--r--  1 roy         29321 May 22 17:52 seqlib.tar.Z

	The first 3 files are pretty obvious.  The two compressed tar files
are the sources for the renzyme program and some related utilities, and a
library of routines which you need to compile the other programs.  The source
trees are set up to maintain multiple binaries from a single set of sources
so the Makefiles are a tad complex; the Setup file should explain everything.
The code is all in C and is known to compile and run under SunOS-3.5 and
MtXinu 4.3BSD/NFS.  I don't expect any major problems porting to other
systems.  If you ftp this stuff, drop me a note so I know who has it.  For
details of the algorithm, read:

%A R. Smith
%T A finite state machine algorithm for finding restriction sites
and other pattern matching applications
%J CABIOS
%V 4
%N 4
%D 1988
%X Existing algorithms for finding restriction endonuclease recognition sites
use brute-force algorithms which run in time O(NM) where N is the number of
nucleotides in the sequence under analysis and M is the total number of
nucleotides in all the different sites being searched for.  This paper
presents a deterministic finite state machine algorithm which runs in time
O(N).  Memory use can be as high as O(M^4) but a slight modification to the
basic algorithm can impose a theoretical upper bound of O(M) at the cost of
some added complexity in the execution of the state machine.  The algorithm
can operate with a single pass through the sequence under analysis, with no
need to ever back up or (for non-circular sequences) store more than a single
input character at a time.  This type of algorithm can be adapted to many
pattern matching tasks and is simple enough to implement in hardware that it
could, for example, be built into a disk controller as part of a specialized
database machine.
-- 
Roy Smith, System Administrator
Public Health Research Institute
{allegra,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@phri.nyu.edu
"The connector is the network"

roy@phri.UUCP (Roy Smith) (05/28/89)

	If you ftp'd my "renzyme" program as announced in <3782@phri.UUCP>
and tried to do a "make depend" you won't have gotten very far.  I forgot
to include the file "config.h", which both the library and applications
depend on.   It's now in our ftp directory for you to pick up.  You want to
put it in a directory one level higher than the root of the seqlib or
renzyme directory (or change all the #include "../../config.h" lines).
Sorry for the confusion.
-- 
Roy Smith, System Administrator
Public Health Research Institute
{allegra,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@phri.nyu.edu
"The connector is the network"