[comp.lang.perl] pattern matching on binary data

jmm@eci386.uucp (John Macdonald) (06/28/90)

This is mostly a precautionary tale (although Larry may decide
to treat it is a bug report if he wishes).

I was recently doing a conversion script that processed a file
that could contain mixed text and binary - where the text was
at the beginning of the file (a leading #! line and some other
stuff).  The conversion would result in a file which had the
same leading header lines, but would have the binary stuff run
through a filter (decrypt, or uncompress, etc.).

I was trying to process the leading text lines as follows:

---- start ----
sub splitline {
    if( $buf =~ /\n/ ) {
	$line = "$`\n";
	$buf = "$'";
    } else {
	$line = "";
    }
}

open( curin, "file" ) || die "Can't open file";

exit unless read( curin, $buf, 1024 );

do splitline();

if( $line =~ /^#!/ ) {
    print $line;
    do splitline();
}

# check for other possible leading text lines ...
#  ...

# now filter the binary
open( FILTER, '|filterprog' );
select( FILTER );

print $buf;

while( read( curin, $buf, 1024 ) ) {
    print $buf;
}

close( FILTER );
---- end ---

The problem occurred in the splitline function - when it found
a text line it would correctly set $line, but the assignment of

	$buf = "$'";

did not set $buf to everything after the match, but stopped at
a null byte.  I'll let Larry decide whether he wants to consider
this to be a bug.  (While it would be possible to handle this
special case without (presumably) too much hassle, trying to
allow for all possible variations of binary data being processed
by regexp might be rather tough.  I'm sure Randall will be able
to find fertile ground for obfuscated signatures in the
possibilities...)

It was easy enough for me to work around in my case, I just used:

	$buf = substr( $buf, length($line), length($buf)-length($line) );

instead.
-- 
Algol 60 was an improvment on most           | John Macdonald
of its successors - C.A.R. Hoare             |   jmm@eci386

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (06/29/90)

In article <1990Jun28.142155.12170@eci386.uucp> jmm@eci386.UUCP (John Macdonald) writes:
: This is mostly a precautionary tale (although Larry may decide
: to treat it is a bug report if he wishes).

I do.  $' should work on binary data.  It will be fixed in the next patch.

Larry