[comp.lang.perl] Tokenizing in Perl

markb@agora.uucp (Mark Biggar) (07/11/90)

As Larry said the simplest way to write a tokenizer in perl is to use
s/^...// to chop token off the front of your string.  With that in mind
the following are a set of usefull regular expressions for this purpose:

m|/\*[^*]*\*+([^/][^*]*\*+)*/|
	matches just the first C-style nonnested comment on line

/"[^"\\]*(\\.[^"\\]*)*"/
	matches just the first " string on line with \ escapes

/("[^"]*")+/
	matches just the first string on line ADA style

You can use the following to translate C-style \ escapes in a string matched
	by the RE above. NOTE: the order of the alternatives in the RE below
	if significant.


s/\\(([0-7]{1,3})|x([\da-fA-F]+)|(.))/$trans($2,$3,$4)/eg
sub trans {
	local($oct,$hex,$single) = @_;
	if ($oct ne '') {
		pack("c",oct($oct));
	} elsif ($hex ne '') {
		pack("c",hex($hex));
	} else { #singleton case must have matched if others didn't
		substr($trans,ord($single),1);
			# def of $trans left as exericse for reader :-)
	}
}

--
Perl's Maternal Uncle
Mark Biggar