markb@agora.uucp (Mark Biggar) (07/11/90)
As Larry said the simplest way to write a tokenizer in perl is to use
s/^...// to chop token off the front of your string. With that in mind
the following are a set of usefull regular expressions for this purpose:
m|/\*[^*]*\*+([^/][^*]*\*+)*/|
matches just the first C-style nonnested comment on line
/"[^"\\]*(\\.[^"\\]*)*"/
matches just the first " string on line with \ escapes
/("[^"]*")+/
matches just the first string on line ADA style
You can use the following to translate C-style \ escapes in a string matched
by the RE above. NOTE: the order of the alternatives in the RE below
if significant.
s/\\(([0-7]{1,3})|x([\da-fA-F]+)|(.))/$trans($2,$3,$4)/eg
sub trans {
local($oct,$hex,$single) = @_;
if ($oct ne '') {
pack("c",oct($oct));
} elsif ($hex ne '') {
pack("c",hex($hex));
} else { #singleton case must have matched if others didn't
substr($trans,ord($single),1);
# def of $trans left as exericse for reader :-)
}
}
--
Perl's Maternal Uncle
Mark Biggar