[comp.lang.perl] global pattern matching question

jes@mbio.med.upenn.edu (Joe Smith) (05/03/91)

I'm experimenting with perl to find patterns in DNA sequences.  So
far, the experiment is partially successful.  Can anyone suggest
improvements?  The DNA is represented by a long scalar string
(5000-10,000 characters is not uncommon), in which I want to find
instances of a pattern.  Here's a first draft:

#!/usr/local/bin/perl

# a test sequence
$seq =  'TATAGTGAGTCGTATTACAATTCACTGGCCGTCGTTTTACAACGTCG' .
	'CCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGT' .
	'TCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGACGCGCC' ;

# now the search...
$n = ($seq =~ s/(CC[AT]+)/
	push(@sites, $1),			# record the actual match
	push(@positions, length $`),		# and it's position (1)
	$1/gei					# don't change anything (2)
);

print "$n sites:\n";
for ($[..$#sites) {
	print "  $_: $positions[$_], '$sites[$_]'\n";
}
__END__

Here's what I get:

7 sites:
  0: 0, 'CCAA'
  1: 0, 'CCTT'
  2: 0, 'CCTTT'
  3: 0, 'CCA'
  4: 0, 'CCTT'
  5: 0, 'CCAA'
  6: 0, 'CCT'

Note that

  1) The 'length $`' doesn't seem to work while the search is going
     on.  Keeping track of the positions that matched is critical, and
     carving the sequence into substrings is likely to be messy and
     slow.  Did I miss something simple?  Would it be possible/useful
     to have perl update a variable with the offset of the beginning
     of the match?

  2) Having to replace the matched pattern with itself seems very
     inefficient, (especially when processing a 10Kb string!).  Is
     there any way of doing a similar operation with m//, or perhaps
     tricking s/// into not doing any replacement?

Thanks for any suggestions,
<Joe

--
 Joe Smith
 University of Pennsylvania                    jes@mbio.med.upenn.edu
 Dept. of Biochemistry and Biophysics          (215) 898-8348
 Philadelphia, PA 19104-6059

lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) (05/04/91)

In article <JES.91May3121528@mbio.med.upenn.edu> jes@mbio.med.upenn.edu (Joe Smith) writes:
: 
: I'm experimenting with perl to find patterns in DNA sequences.  So
: far, the experiment is partially successful.  Can anyone suggest
: improvements?  The DNA is represented by a long scalar string
: (5000-10,000 characters is not uncommon), in which I want to find
: instances of a pattern.  Here's a first draft:
: 
: #!/usr/local/bin/perl
: 
: # a test sequence
: $seq =  'TATAGTGAGTCGTATTACAATTCACTGGCCGTCGTTTTACAACGTCG' .
: 	'CCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGT' .
: 	'TCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGACGCGCC' ;
: 
: # now the search...
: $n = ($seq =~ s/(CC[AT]+)/
: 	push(@sites, $1),			# record the actual match
: 	push(@positions, length $`),		# and it's position (1)
: 	$1/gei					# don't change anything (2)
: );
: 
: print "$n sites:\n";
: for ($[..$#sites) {
: 	print "  $_: $positions[$_], '$sites[$_]'\n";
: }
: __END__
: 
: Here's what I get:
: 
: 7 sites:
:   0: 0, 'CCAA'
:   1: 0, 'CCTT'
:   2: 0, 'CCTTT'
:   3: 0, 'CCA'
:   4: 0, 'CCTT'
:   5: 0, 'CCAA'
:   6: 0, 'CCT'
: 
: Note that
: 
:   1) The 'length $`' doesn't seem to work while the search is going
:      on.  Keeping track of the positions that matched is critical, and
:      carving the sequence into substrings is likely to be messy and
:      slow.  Did I miss something simple?  Would it be possible/useful
:      to have perl update a variable with the offset of the beginning
:      of the match?

In 4.003, $` is unfortunately broken within s/// because of a fix to
something else.  That'll be fixed in patch 4.  As a workaround you
could use length($seq) - length($')  - length($1).  Yeah, blech...

Alternately, you could use split.

:   2) Having to replace the matched pattern with itself seems very
:      inefficient, (especially when processing a 10Kb string!).  Is
:      there any way of doing a similar operation with m//, or perhaps
:      tricking s/// into not doing any replacement?

An option to s/// not to do replacement would be interesting, though
there might be a better way--maybe an initial offset for m//, or
allowing patterns in index().

Larry

jes@mbio.med.upenn.edu (Joe Smith) (05/06/91)

In article <1991May4.010307.11792@jpl-devvax.jpl.nasa.gov>
lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) writes:

> In 4.003, $` is unfortunately broken within s/// because of a fix to
> something else.  That'll be fixed in patch 4.  As a workaround you
> could use length($seq) - length($')  - length($1).  Yeah, blech...

Oops, I just saw the other bug report.

The book says that $` and $' aren't generated unless they're
mentioned.  In an expression like length $`, does perl actually
generate a copy of this piece of the original string just to determine
the match position, or is the length just calculated from the match
position?

> An option to s/// not to do replacement would be interesting, though

And to think I've always wondered how such wonderfully
convoluted language design happens...

> there might be a better way--maybe an initial offset for m//, or
> allowing patterns in index().

This seems more direct, but the ability to bind an action to the match
is also very important.  Perhaps access to a primitive pattern
matching function (the way splice is related to push, pop, etc.) would
be reasonable:

   $options = "oi";

   for ($pos = 0;
	   $matched = regexp( $pattern, $seq, $pos, $options );
	   $pos += $matched ) {
		   push( @positions, $pos );
		   push( @sites, subst($seq, $pos, $matched ));
   }

I think you could replicate the match and substitute capabilities with
that (subst($seq, $pos, $matched ) = "your replacement here..."), with
arbitrary actions on a match and without a lot of extra string copies.
It'll never fit on one line, but I could live with that.

<Joe
--
 Joe Smith
 University of Pennsylvania                    jes@mbio.med.upenn.edu
 Dept. of Biochemistry and Biophysics          (215) 898-8348
 Philadelphia, PA 19104-6059

lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) (05/07/91)

In article <JES.91May5190550@mbio.med.upenn.edu> jes@mbio.med.upenn.edu (Joe Smith) writes:
: The book says that $` and $' aren't generated unless they're
: mentioned.  In an expression like length $`, does perl actually
: generate a copy of this piece of the original string just to determine
: the match position, or is the length just calculated from the match
: position?

Unfortunately, "length $`" counts as a mention currently.  I've wanted
to optimize that out for some time, but haven't collected enough of
those little round tuits.

: > An option to s/// not to do replacement would be interesting, though
: 
: And to think I've always wondered how such wonderfully
: convoluted language design happens...

Yah.

: > there might be a better way--maybe an initial offset for m//, or
: > allowing patterns in index().
: 
: This seems more direct, but the ability to bind an action to the match
: is also very important.  Perhaps access to a primitive pattern
: matching function (the way splice is related to push, pop, etc.) would
: be reasonable:
: 
:    $options = "oi";
: 
:    for ($pos = 0;
: 	   $matched = regexp( $pattern, $seq, $pos, $options );
: 	   $pos += $matched ) {
: 		   push( @positions, $pos );
: 		   push( @sites, subst($seq, $pos, $matched ));
:    }
: 
: I think you could replicate the match and substitute capabilities with
: that (subst($seq, $pos, $matched ) = "your replacement here..."), with
: arbitrary actions on a match and without a lot of extra string copies.
: It'll never fit on one line, but I could live with that.

I'm thinking more along the lines of the g// operator in ed:

	while ($foo =~ g/pat/) {
	    push(@positions, length $`);
	    push(@sites, $&);
	}

g// would have to have its own built in state.  I'm not sure what it
should do if $foo changes...

Perhaps it ought to be m/pat/g instead.

Larry