[comp.lang.perl] Recursive descent parsers in perl?

yukngo@obelix.gaul.csd.uwo.ca (Cheung Yukngo) (06/28/90)

When I was writing a perl program to parse rfc822 mail headers, I
noticed that the grammar to describe the header is actually
context-free. The comments in a header are delimited by "(" and ")"
and can be nested.  Using regular expressions to recognize them may or
may not be painful; but I wonder how difficult it is to have a general
recursive descent parser generator facility in perl. Am I asking too
much?

vixie@decwrl.dec.com (Paul A Vixie) (06/30/90)

Recursive-descent parsers are hard unless you can implement a "get next token"
operator.  Even "get next character" is helpful.  Perl doesn't have an easy
way to do either, though somewhere among the thousands of messages in Larry's
"look at it someday" mail folder, he has several from me suggesting ways to
tokenizing fairly elegantly.  What's needed is something like "split" only
inside-out and backwards.  Larry?
--
Paul Vixie
DEC Western Research Lab	<vixie@wrl.dec.com>
Palo Alto, California		...!decwrl!vixie

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (07/02/90)

In article <VIXIE.90Jun30105204@volition.pa.dec.com> vixie@decwrl.dec.com (Paul A Vixie) writes:
: Recursive-descent parsers are hard unless you can implement a "get next token"
: operator.  Even "get next character" is helpful.  Perl doesn't have an easy
: way to do either, though somewhere among the thousands of messages in Larry's
: "look at it someday" mail folder, he has several from me suggesting ways to
: tokenizing fairly elegantly.  What's needed is something like "split" only
: inside-out and backwards.  Larry?

The current way to write a tokener in Perl is to just start hacking tokens
off the front of $_ with s/^token//.  The first version of the perl debugger
worked this way, before I got smart and made the perl compiler do my parsing.
Substituting tokens off the front of a string is in general fairly efficient--
all it does internally is move up the pointer and decrease the length.

But I've hankered for a way to tokenize better too.  Whatever I do has to
be general though, and I don't think even emacs-style syntax tables are
smart enough.  So for the moment, just keep hacking tokens off the front.

Larry

worley@compass.com (Dale Worley) (07/03/90)

   From: yukngo@obelix.gaul.csd.uwo.ca (Cheung Yukngo)

   The comments in a header are delimited by "(" and ")"
   and can be nested.  Using regular expressions to recognize them may or
   may not be painful;

I don't know if this has been covered before, but nested structures of
any kind cannot be recognized by regular expressions.

Dale Worley		Compass, Inc.			worley@compass.com
--
"If you could have any amount of money... How much would you want?"
"All of it."

hakanson@ogicse.ogc.edu (Marion Hakanson) (07/03/90)

In article <8563@jpl-devvax.JPL.NASA.GOV> lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) writes:
>. . .
>The current way to write a tokener in Perl is to just start hacking tokens
>off the front of $_ with s/^token//.  The first version of the perl debugger

But if things are really complicated, you can do what I did, which is
to write a lexical analyzer in C (or lex, or flex), and have that spit
out an expression with some special delimiter between tokens, and then
use split() to break the entire expression into an array.  I found
this to work well for parsing DNS (Domain) master files, as the lex-er
dealt with quoting, continuation lines, escaped characters, etc.,
producing an equivalent simplified ("canonical") form which was easy
for Perl to split up efficiently.  My next iteration of the software
(if there is one) will likely move even more of the parsing into C (or
yacc, or whatever).

One other technique I've used is in designing the syntax of
configuration files or tables for Perl programs.  I've taken to just
specifying them using Perl syntax, so I can just "do" the files to
parse them.  If you're careful, config-file maintainers won't even
know they're writing Perl code.

-- 
Marion Hakanson         Domain: hakanson@cse.ogi.edu
                        UUCP  : {hp-pcd,tektronix}!ogicse!hakanson

jbw@zeb.uswest.com (Joe Wells) (07/06/90)

In article <YUKNGO.90Jun28100354@obelix.gaul.csd.uwo.ca> yukngo@obelix.gaul.csd.uwo.ca (Cheung Yukngo) writes:

   When I was writing a perl program to parse rfc822 mail headers, I
   noticed that the grammar to describe the header is actually
   context-free. The comments in a header are delimited by "(" and ")"
   and can be nested.  Using regular expressions to recognize them may or
   may not be painful; but I wonder how difficult it is to have a general
   recursive descent parser generator facility in perl. Am I asking too
   much?

Look at the file lisp/rfc822.el in the GNU Emacs distribution.  This is a
"hairy" recursive descent parser for rfc822 addresses that uses regular
expressions as much as it can.  Of course, it will help if you know Lisp.

You could probably translate this code into perl with only a few months
work ...  It's only 281 lines of code.

-- 
Joe Wells <jbw@uswest.com>