jes@mbio.med.upenn.edu (Joe Smith) (05/03/91)
I'm experimenting with perl to find patterns in DNA sequences. So
far, the experiment is partially successful. Can anyone suggest
improvements? The DNA is represented by a long scalar string
(5000-10,000 characters is not uncommon), in which I want to find
instances of a pattern. Here's a first draft:
#!/usr/local/bin/perl
# a test sequence
$seq = 'TATAGTGAGTCGTATTACAATTCACTGGCCGTCGTTTTACAACGTCG' .
'CCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGT' .
'TCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGACGCGCC' ;
# now the search...
$n = ($seq =~ s/(CC[AT]+)/
push(@sites, $1), # record the actual match
push(@positions, length $`), # and it's position (1)
$1/gei # don't change anything (2)
);
print "$n sites:\n";
for ($[..$#sites) {
print " $_: $positions[$_], '$sites[$_]'\n";
}
__END__
Here's what I get:
7 sites:
0: 0, 'CCAA'
1: 0, 'CCTT'
2: 0, 'CCTTT'
3: 0, 'CCA'
4: 0, 'CCTT'
5: 0, 'CCAA'
6: 0, 'CCT'
Note that
1) The 'length $`' doesn't seem to work while the search is going
on. Keeping track of the positions that matched is critical, and
carving the sequence into substrings is likely to be messy and
slow. Did I miss something simple? Would it be possible/useful
to have perl update a variable with the offset of the beginning
of the match?
2) Having to replace the matched pattern with itself seems very
inefficient, (especially when processing a 10Kb string!). Is
there any way of doing a similar operation with m//, or perhaps
tricking s/// into not doing any replacement?
Thanks for any suggestions,
<Joe
--
Joe Smith
University of Pennsylvania jes@mbio.med.upenn.edu
Dept. of Biochemistry and Biophysics (215) 898-8348
Philadelphia, PA 19104-6059
lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) (05/04/91)
In article <JES.91May3121528@mbio.med.upenn.edu> jes@mbio.med.upenn.edu (Joe Smith) writes:
:
: I'm experimenting with perl to find patterns in DNA sequences. So
: far, the experiment is partially successful. Can anyone suggest
: improvements? The DNA is represented by a long scalar string
: (5000-10,000 characters is not uncommon), in which I want to find
: instances of a pattern. Here's a first draft:
:
: #!/usr/local/bin/perl
:
: # a test sequence
: $seq = 'TATAGTGAGTCGTATTACAATTCACTGGCCGTCGTTTTACAACGTCG' .
: 'CCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGT' .
: 'TCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGACGCGCC' ;
:
: # now the search...
: $n = ($seq =~ s/(CC[AT]+)/
: push(@sites, $1), # record the actual match
: push(@positions, length $`), # and it's position (1)
: $1/gei # don't change anything (2)
: );
:
: print "$n sites:\n";
: for ($[..$#sites) {
: print " $_: $positions[$_], '$sites[$_]'\n";
: }
: __END__
:
: Here's what I get:
:
: 7 sites:
: 0: 0, 'CCAA'
: 1: 0, 'CCTT'
: 2: 0, 'CCTTT'
: 3: 0, 'CCA'
: 4: 0, 'CCTT'
: 5: 0, 'CCAA'
: 6: 0, 'CCT'
:
: Note that
:
: 1) The 'length $`' doesn't seem to work while the search is going
: on. Keeping track of the positions that matched is critical, and
: carving the sequence into substrings is likely to be messy and
: slow. Did I miss something simple? Would it be possible/useful
: to have perl update a variable with the offset of the beginning
: of the match?
In 4.003, $` is unfortunately broken within s/// because of a fix to
something else. That'll be fixed in patch 4. As a workaround you
could use length($seq) - length($') - length($1). Yeah, blech...
Alternately, you could use split.
: 2) Having to replace the matched pattern with itself seems very
: inefficient, (especially when processing a 10Kb string!). Is
: there any way of doing a similar operation with m//, or perhaps
: tricking s/// into not doing any replacement?
An option to s/// not to do replacement would be interesting, though
there might be a better way--maybe an initial offset for m//, or
allowing patterns in index().
Larry
jes@mbio.med.upenn.edu (Joe Smith) (05/06/91)
In article <1991May4.010307.11792@jpl-devvax.jpl.nasa.gov> lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) writes: > In 4.003, $` is unfortunately broken within s/// because of a fix to > something else. That'll be fixed in patch 4. As a workaround you > could use length($seq) - length($') - length($1). Yeah, blech... Oops, I just saw the other bug report. The book says that $` and $' aren't generated unless they're mentioned. In an expression like length $`, does perl actually generate a copy of this piece of the original string just to determine the match position, or is the length just calculated from the match position? > An option to s/// not to do replacement would be interesting, though And to think I've always wondered how such wonderfully convoluted language design happens... > there might be a better way--maybe an initial offset for m//, or > allowing patterns in index(). This seems more direct, but the ability to bind an action to the match is also very important. Perhaps access to a primitive pattern matching function (the way splice is related to push, pop, etc.) would be reasonable: $options = "oi"; for ($pos = 0; $matched = regexp( $pattern, $seq, $pos, $options ); $pos += $matched ) { push( @positions, $pos ); push( @sites, subst($seq, $pos, $matched )); } I think you could replicate the match and substitute capabilities with that (subst($seq, $pos, $matched ) = "your replacement here..."), with arbitrary actions on a match and without a lot of extra string copies. It'll never fit on one line, but I could live with that. <Joe -- Joe Smith University of Pennsylvania jes@mbio.med.upenn.edu Dept. of Biochemistry and Biophysics (215) 898-8348 Philadelphia, PA 19104-6059
lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) (05/07/91)
In article <JES.91May5190550@mbio.med.upenn.edu> jes@mbio.med.upenn.edu (Joe Smith) writes: : The book says that $` and $' aren't generated unless they're : mentioned. In an expression like length $`, does perl actually : generate a copy of this piece of the original string just to determine : the match position, or is the length just calculated from the match : position? Unfortunately, "length $`" counts as a mention currently. I've wanted to optimize that out for some time, but haven't collected enough of those little round tuits. : > An option to s/// not to do replacement would be interesting, though : : And to think I've always wondered how such wonderfully : convoluted language design happens... Yah. : > there might be a better way--maybe an initial offset for m//, or : > allowing patterns in index(). : : This seems more direct, but the ability to bind an action to the match : is also very important. Perhaps access to a primitive pattern : matching function (the way splice is related to push, pop, etc.) would : be reasonable: : : $options = "oi"; : : for ($pos = 0; : $matched = regexp( $pattern, $seq, $pos, $options ); : $pos += $matched ) { : push( @positions, $pos ); : push( @sites, subst($seq, $pos, $matched )); : } : : I think you could replicate the match and substitute capabilities with : that (subst($seq, $pos, $matched ) = "your replacement here..."), with : arbitrary actions on a match and without a lot of extra string copies. : It'll never fit on one line, but I could live with that. I'm thinking more along the lines of the g// operator in ed: while ($foo =~ g/pat/) { push(@positions, length $`); push(@sites, $&); } g// would have to have its own built in state. I'm not sure what it should do if $foo changes... Perhaps it ought to be m/pat/g instead. Larry