roy@phri.UUCP (Roy Smith) (05/27/89)
A program to perform restriction site analysis on dna sequences, called renzyme, is now being made available. The algorithm was described in the paper cited below. In a nutshell, it's about 20 times faster than other common programs. If you ever want to find restriction sites in sequences of non-trivial length, you want this program. The source code is copyrighted, but is being made available at no charge to non-profit, academic institutions for research purposes. You can get it by anonymous ftp from goober.phri.nyu.edu (128.122.136.10) in the directory ~ftp/pub/seq. You want to get the following files: -rw-r--r-- 1 roy 1146 May 22 17:52 Acknowledgements -rw-r--r-- 1 roy 1349 May 25 13:55 Copyright -rw-r--r-- 1 roy 5594 May 26 20:58 Setup -rw-r--r-- 1 roy 204645 May 26 20:53 renzyme.tar.Z -rw-r--r-- 1 roy 29321 May 22 17:52 seqlib.tar.Z The first 3 files are pretty obvious. The two compressed tar files are the sources for the renzyme program and some related utilities, and a library of routines which you need to compile the other programs. The source trees are set up to maintain multiple binaries from a single set of sources so the Makefiles are a tad complex; the Setup file should explain everything. The code is all in C and is known to compile and run under SunOS-3.5 and MtXinu 4.3BSD/NFS. I don't expect any major problems porting to other systems. If you ftp this stuff, drop me a note so I know who has it. For details of the algorithm, read: %A R. Smith %T A finite state machine algorithm for finding restriction sites and other pattern matching applications %J CABIOS %V 4 %N 4 %D 1988 %X Existing algorithms for finding restriction endonuclease recognition sites use brute-force algorithms which run in time O(NM) where N is the number of nucleotides in the sequence under analysis and M is the total number of nucleotides in all the different sites being searched for. This paper presents a deterministic finite state machine algorithm which runs in time O(N). Memory use can be as high as O(M^4) but a slight modification to the basic algorithm can impose a theoretical upper bound of O(M) at the cost of some added complexity in the execution of the state machine. The algorithm can operate with a single pass through the sequence under analysis, with no need to ever back up or (for non-circular sequences) store more than a single input character at a time. This type of algorithm can be adapted to many pattern matching tasks and is simple enough to implement in hardware that it could, for example, be built into a disk controller as part of a specialized database machine. -- Roy Smith, System Administrator Public Health Research Institute {allegra,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@phri.nyu.edu "The connector is the network"
roy@phri.UUCP (Roy Smith) (05/28/89)
If you ftp'd my "renzyme" program as announced in <3782@phri.UUCP> and tried to do a "make depend" you won't have gotten very far. I forgot to include the file "config.h", which both the library and applications depend on. It's now in our ftp directory for you to pick up. You want to put it in a directory one level higher than the root of the seqlib or renzyme directory (or change all the #include "../../config.h" lines). Sorry for the confusion. -- Roy Smith, System Administrator Public Health Research Institute {allegra,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@phri.nyu.edu "The connector is the network"