[comp.lang.perl] User Definable Character Classes

tneff@bfmny0.UU.NET (Tom Neff) (03/15/90)

A feature I'd like to see added: a few user definable character classes.
Call them \X, \Y, \Z.  These could be used in regexp's without the
additional overhead or confusion of using $vars.

Example: I define \X to be [\w-$_] -- with whatever syntax.

Now I can have complex substitutions

		s/<([.\X]+!)+(\X+\.[.\X]+!)/<$2/;
	
with good performance.  It seems straightforward to modify regcomp.c
to use \X thru \Z if defined.
-- 
Perestroika: could   \O\     Tom Neff
 it happen here?      \O\    uunet.uu.net!bfmny0!tneff

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (03/15/90)

In article <15253@bfmny0.UU.NET> tneff@bfmny0.UU.NET (Tom Neff) writes:
: A feature I'd like to see added: a few user definable character classes.
: Call them \X, \Y, \Z.  These could be used in regexp's without the
: additional overhead or confusion of using $vars.
: 
: Example: I define \X to be [\w-$_] -- with whatever syntax.
: 
: Now I can have complex substitutions
: 
: 		s/<([.\X]+!)+(\X+\.[.\X]+!)/<$2/;
: 	
: with good performance.  It seems straightforward to modify regcomp.c
: to use \X thru \Z if defined.


I'd probably make that \x, \y and \z, with \X, \Y and \Z being the negations.

I'm looking carefully at ways to integrate better argument parsing with
regular expressions.  Right now it's not possible to, say, swap the
second and third arguments of a function call, at least not with complete
generality.  You need to have some means of tokenizing, and rejecting
commas and right parens that are inside parens or quotes or comments, or,
depending on the language, after backslashes or dollar signs.

I've got some ideas, but I'm open to suggestions.  I don't think mere
syntax tables ala emacs are good enough.

Larry

chip@tct.uucp (Chip Salzenberg) (03/17/90)

According to lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall):
>I'm looking carefully at ways to integrate better argument parsing with
>regular expressions.  Right now it's not possible to, say, swap the
>second and third arguments of a function call, at least not with complete
>generality.  You need to have some means of tokenizing, and rejecting
>commas and right parens that are inside parens or quotes or comments, or,
>depending on the language, after backslashes or dollar signs.

Sounds like a job for:

	lex.pl

:-)

What we could use here is a full-blown lexical analyis engine
integrated into the Perl language.

For example, the RCS file format is an ASCII stream with "@"
delimeters, where "@@" means a literal "@".  I've often wondered how
to write Perl to interpret such a file without calling getc thousands
of times.

Then again, if I want flex, I know where to find it.
-- 
Chip Salzenberg at ComDev/TCT   <chip%tct@ateng.com>, <uunet!ateng!tct!chip>
          "The Usenet, in a very real sense, does not exist."

jbw@bucsf.bu.edu (Joe Wells) (03/17/90)

In article <7420@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:

   I don't think mere syntax tables a la emacs are good enough.

Aye, I second that!

-- 
Joe Wells <jbw@bu.edu>

hakanson@ogicse.ogi.edu (Marion Hakanson) (03/18/90)

In article <260176EA.C58@tct.uucp> chip@tct.uucp (Chip Salzenberg) writes:
>What we could use here is a full-blown lexical analyis engine
>integrated into the Perl language.
>
>For example, the RCS file format is an ASCII stream with "@"
>delimeters, where "@@" means a literal "@".  I've often wondered how
>to write Perl to interpret such a file without calling getc thousands
>of times.

Take it from someone who's been down this road.  You DON'T want to call
getc lots of times.  I converted my perl "dnslex" to C, and the C version
ran probably 300 times faster.  But putting the lex-er in a separate
program works quite well, esp. with Perl's nice way of opening pipes.

However, the @/@@ problem isn't so tough, as long as you don't have to
worry about backslash-escapes (where the backslash could be escaped).
Even that can be done in absence of other quoting mechanisms, but it
is not pretty (see below).

Here's a routine I wrote to split on a comma (instead of an @), with
it doubled for a literal comma.  Maybe it will give an idea....

# The arg may be of the form 'part1,part2', where ',' is
# the first un-doubled comma (later commas are not processed).

sub commasplit {
    local ($_) = @_;
    local ($first,$secnd);

    $first = '';
    $secnd = '';
    
    commasplit: while ( /,/ ) {
        $first .= $`;	# before the comma
        $_ = $';	# and after it

        if ( s/^,// ) {	# turn double into a single & continue
            $first .= ',';
        } else {	# make the split
            $secnd = $_;
            $_ = '';	# remainder goes above
            last commasplit;
        }
    }
    $first .= $_;	# in case no single comma was found
    ($first,$secnd);
}


-- 
Marion Hakanson         Domain: hakanson@cse.ogi.edu
                        UUCP  : {hp-pcd,tektronix}!ogicse!hakanson