[comp.compilers] Lex surrogates

db@NSS.CS.UCL.AC.UK (Dave Berry) (02/06/89)

The following references describe scanner generators that are claimed to
produce smaller and faster scanners than Lex.  None of them seem to have
caught on (admittedly the references are fairly recent).  Does anyone have
any experience with these or similar tools?  What problems do they have?
 Do people use Lex just because it's there?

If Lex is as bad as these articles (and my experience) suggest, I'm surprised
that GNU are using it for an optimising compiler.

@Article(gla,
        Author="V. P. Heuring",
        Title="The Automatic Generation of Fast Lexical Analysers",
        Journal="Software Practice and Experience",
        Volume="16",
        Number="9",
        Pages="801-808",
        Month="Sep",
        Year="1986")

@Article(mkscan,
        Author="R. Nigel Horspool and Michael R. Levy",
        Title="{\it Mkscan} -- An Interactive Scanner Generator",
        Journal="Software Practice and Experience",
        Volume="17",
        Number="6",
        Pages="369-379",
        Month="June",
        Year="1987")

@Article(schnorf,
        Author="P. Schnorf",
        Title="Dynamic Instantiation and Configuration of Functionally Extended,
 Efficient Lexical Analysers",
        Journal="{SIGPLAN} Notices",
        Volume="23",
        Number="10",
        Pages="93-102",
        Month="Oct",
        Year="1988")

Dave Berry,	Laboratory for Foundations of Computer Science, Edinburgh.
		db%lfcs.ed.ac.uk@nss.cs.ucl.ac.uk
		<Atlantic Ocean>!mcvax!ukc!lfcs!db
[Lex was a summer student's project that escaped from the lab. I can easily
believe that there are lots of better lexer generators, since although lex's
theory is sound, its implementation leaves a lot to be desired. Gnu isn't
using it, since it's licensed AT&T code, but is probably using a
reimplementation that works. -John]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

schmidt@ORION.CF.UCI.EDU (Douglas C. Schmidt) (02/06/89)

In article <3286@ima.ima.isc.com> Dave Berry <lfcs.edinburgh.ac.uk!db writes:
++If Lex is as bad as these articles (and my experience) suggest, I'm surprised
++that GNU are using it for an optimising compiler.

Arrrrrggghh.  I can't take the spread of misinformation any more!  GNU
C and GNU C++ *don't* use LEX or FLEX or GNULEX (no, it doesn't
exist).  The lexical analyzers are hand coded (I know, because I wrote
the perfect hash function that recognizes GNU C and C++ reserved words
in O(1) time).  Please get the facts straight on this (which shouldn't
be too hard, since the source is freely available ;-)).

Doug Schmidt

PS: They do use an LALR BISON (YACC) grammar, however.
[From "Douglas C. Schmidt" <schmidt@ORION.CF.UCI.EDU>]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

wpl@PRC.Unisys.COM (William P Loftus) (02/07/89)

Many of the lexical analyzer generators gain their speed from limiting
the specification power of their tools. They are taking advantage of
the fact that most languages don't need a general RE processor. In my
opinion, this limits the use of such tools to traditional language
lexical analyzers (they are not for what Aho et al. call swiss army
knives---see the dragon book).   However, there is a full RE lexical
analyzer generator called Flex.

Flex is a lex clone, but much faster.

Here is a file contained in the distribution of flex (which is
freely available from several ftp and uucp sites) that describes
timing:

--
flex vs. lex timings for a C tokenizer which includes keywords:

Generation times:

	lex	    83.0 secs
	flex	     3.9
	flex -cfe    7.1	# uncompressed table, equivalence classes
	flex -cf    15.0	# uncompressed table, no equivalence classes

Scanner object file sizes:

	lex	   41.0K bytes
	flex	    9.4K
	flex -cfe  49.6K
	flex -cf  126.5K

Running times on a 28,088 line input (685K characters):

	lex	   29.8 secs
	flex	   19.3
	flex -cfe   9.0
	flex -cf    7.8

The timings were made on a Sun 3/60.  All times are user + system CPU time,
and don't include hashing of identifiers.

Summary:

    For about the same sized scanner, you get a factor of 3 in performance.
    For a 30% faster scanner, you get a scanner 1/4th the size, and it's
	generated in 1/20th the time.
    For a scanner that's 3 times larger, you get a factor of 3.8 in
	performance.
--

There is a paper by Vern Paxson that describes how he obtains these speeds (I
don't have the paper with me at the moment, so I can't tell you the title).

William Loftus
--
William P Loftus                        UUCP:   wpl@burdvax.UUCP
Unisys/Paoli Research Center            ARPA:   wpl@anarchy.prc.unisys.com
PO Box 517                              215-354-0614 (home) 
Paoli, PA 19301                         215-648-7248 (work)
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

vern@pistachio.ee.lbl.gov (Vern Paxson) (02/07/89)

lfcs.edinburgh.ac.uk!db@NSS.CS.UCL.AC.UK (Dave Berry) asked about scanner
generators that claim to produce smaller and faster scanners than Lex,
listing three references of which "None of them seem to have caught on."

Allow me if I may to plug flex, which is my lex rewrite claiming to produce
smaller and faster scanners than Lex.  It has caught on fairly well, to
judge from the email and bug reports I receive, and I've been told that
UCB's CSRG will be using it to replace lex in subsequent BSD releases.  It's
available as flex.shar.Z for anonymous ftp from rtsg.ee.lbl.gov (128.3.254.68);
it's also available (sorry, I don't know the exact name) from prep.ai.mit.edu
(128.52.32.14), though as schmidt@ORION.CF.UCI.EDU (Douglas C. Schmidt) writes:

> Arrrrrggghh.  I can't take the spread of misinformation any more!  GNU
> C and GNU C++ *don't* use LEX or FLEX or GNULEX (no, it doesn't
> exist).

I.e., it's not part of the GNU project.  It \is/ "free" software in
more-or-less the same spirit, however, which I think is why they are
distributing it.

		Vern

	Vern Paxson				vern@lbl-csam.arpa
	Real Time Systems			ucbvax!lbl-csam.arpa!vern
	Lawrence Berkeley Laboratory		(415) 486-6580
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

rsalz@BBN.COM (Rich Salz) (02/08/89)

In <3295@ima.ima.isc.com> Vern Paxson <vern@pistachio.ee.lbl.gov> writes:
>Allow me if I may to plug flex, which is my lex rewrite claiming to produce
>smaller and faster scanners than Lex....

Vern is much too modest.  Flex is FAST.

If you need flex, and can do without the couple of missing features
(right-context awareness?  I forget), then you should get it.  As he
pointed out, the FSF folks distribute it as it's got a similar licensing
philosophy.

As for the general question: "it's true what they say about lex."   Most
folks know about this comment because Peter Honeyman stuck it in the
change log of pathalias.  It turns out to basically not be all that true.
Van Jacobson (also of LBL) spent a few some time tuning lex and spoke
about it at the Washington Usenix conference.  As John pointed out, the
problem with Lex is that it was so space-conscious that the generated
code ran real slowly.  By opening up the tables and a few other changes,
he was able to get the innermost loop to compile into a single 68000
instructions.   He got an ovation when he showed timing charts where
lex was just a shade slower than cat(1).  Van's changes were never
made available, but it doesn't matter:  Use Flex. :-)
	/rich $alz
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

ken@cs.rochester.edu (Ken Yap) (02/09/89)

This is straying from the subject but it is worth noting that one way
flex gets its speed is by returning a pointer to the string comprising
the token in place (no string copying).  This can bite you if you are
using flex with yacc/bison and you don't copy the string into your own
variable right away. After the next terminal has been matched is too
late.

But I recommend flex heartily.
[From Ken Yap <ken@cs.rochester.edu>]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

mike@arizona.edu (Mike Coffin) (02/10/89)

> Vern is much too modest.  Flex is FAST.

I agree.  Not only are the scanners fast, but it generates them
very quickly --- much faster than LEX.  FLEX is well worth getting.
-- 
Mike Coffin				mike@arizona.edu
Univ. of Ariz. Dept. of Comp. Sci.	{allegra,cmcl2}!arizona!mike
Tucson, AZ  85721			(602)621-2858
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

tower@bu-cs.BU.EDU (Leonard H. Tower Jr.) (02/11/89)

In article <3290@ima.ima.isc.com> "Douglas C. Schmidt" <schmidt@ORION.CF.UCI.EDU> writes:
|[Gnu C compilers use a hand coded lexer.]
|
|Doug Schmidt
|
|PS: They do use an LALR BISON (YACC) grammar, however.

Doug is a 100% here, but he doesn't know the full scoop.

GNU is distributing FLEX on it's tapes, and plans to include it as
part of GNU, becasue it's the fastest freely re-distributable scanner
that is coded well and cleanly, that we knew of.

Lexers do have use outside of compilers ...  ;-}

enjoy -len 
[From tower@bu-cs.BU.EDU (Leonard H. Tower Jr.)]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

pardo@june.cs.washington.edu (David Keppel) (02/11/89)

>"Douglas C. Schmidt" <schmidt@ORION.CF.UCI.EDU> writes:
>|[Gnu C compilers use a hand coded lexer.]

tower@bu-cs.BU.EDU (Leonard H. Tower Jr.) writes:
>[GNU is distributing FLEX]
>Lexers do have use outside of compilers ...  ;-}

Since this is comp.compilers ...

Compilers have uses outside of programming languages.  Or put another
way, compilers (and compiler pieces, and interpreters, ...) are useful
for writing ``little languages''.  A formal lexer and parser is often
far more reliable than an ad-hoc parser.  Remember, the keystrokes you
type to your editor are both (a) a programing language and (b) get
parsed and executed.

Jeekers.
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

henry@zoo.toronto.edu (Henry Spencer) (02/11/89)

>...one way
>flex gets its speed is by returning a pointer to the string comprising
>the token in place (no string copying).  This can bite you if you are
>using flex with yacc/bison and you don't copy the string into your own
>variable right away. After the next terminal has been matched is too
>late.

The stuff lex puts in yytext[] also changes for each terminal, and hence
also must be saved immediately if you want to use it.  I don't understand
why the lack of copying makes a practical difference.

                                     Henry Spencer at U of Toronto Zoology
                                 uunet!attcan!utzoo!henry henry@zoo.toronto.edu
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

holt@turing.toronto.edu (Ric Holt) (02/14/89)

In reply to Henry's comments ....

>The stuff lex puts in yytext[] also changes for each terminal, and hence
>also must be saved immediately if you want to use it.  I don't understand
>why the lack of copying makes a practical difference.
>
>                                     Henry Spencer at U of Toronto Zoology
>                                 uunet!attcan!utzoo!henry henry@zoo.toronto.edu

In a FAST scanner, every machine instruction counts. I'd like to know how
fast, on an instruction basis, FLEX and LEX are. The scanner for the new
Turing compiler (written in Turing Plus) seems to execute about 129 machine
instructions per token, including hashing identifiers and entering them into
buckets (as well as skipping white space and comments). These are MC680x0
instructions. A token is something like a keyword, an operator, an identifier,
etc. In a scanner like this, copying extra bytes appreciably slows down
things.

		Ric Holt
[From Ric Holt <holt@turing.toronto.edu>]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

henry@zoo.toronto.edu (02/16/89)

>>The stuff lex puts in yytext[] also changes for each terminal, and hence
>>also must be saved immediately if you want to use it.  I don't understand
>>why the lack of copying makes a practical difference.
>
>In a FAST scanner, every machine instruction counts...

Agreed, and I think FLEX does the right thing by not copying -- there
never was a good reason for it, except that LEX was a quick hack that
didn't ever get worked over properly for performance.

The original discussion was about how it affected the programming
semantics, not the performance, though, which is why I said what I did.

                                     Henry Spencer at U of Toronto Zoology
                                 uunet!attcan!utzoo!henry henry@zoo.toronto.edu
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

grosch@unido.irb.informatik.uni-dortmund.de (Josef Grosch) (02/18/89)

1. Van Jacobson's loop

In <3299@ima.ima.isc.com> Rich Salz <rsalz@BBN.COM> writes:
> Van Jacobson (also of LBL) spent a few some time tuning lex ...
> he was able to get the innermost loop to compile into a single 68000
> instructions. ...
> lex was just a shade slower than cat(1).  Van's changes were never
> made available, ...

I would really like to see this one instruction loop!
Otherwise it sounds like an unbelievable rumour, because the fastest
hand-coded loop for scanning that I can imagine has 4 instructions.
For a general automaton I as well as others need at least 5 instructions.

2. Rex: a fast scanner generator

Rex is like flex an improved remake of Lex. Additional features are the
automatic calculation of the line and column position of every token and a
mechanism to handle include files. REJECT has been sacrificed.
Rex generated scanners are always 20-30% faster than the fastest flex
version. The tables are always as small as possible. Depending on the
problem the table size can be only 5-77% and the generation time only
50-70% of the flex figures. For more info see below. Some figures:

                         Lex             Flex            Rex
----------------------------------------------------------------
speed [lines/min.]
   without hashing       36,400          139,000         182,700
   with hashing          34,700          118,000         141,400
table size [bytes]       39,200          57,300          4,400
scanner size [bytes]     43,800          64,100          11,200
generation time [sec.]   73.7            7.2             4.9

3. Lalr + Ell: fast parser generators

Not only Lex is slow, Yacc also. We have an Yacc remake called Lalr
that generates parsers that are 2 to 3 times faster. Moreover automatic
error handling is provided which includes error reporting, recovery, and
repair. Ell generated parsers beat Yacc by a factor of 4. Ell processes
LL(1) grammars and generates conventional recursive descent parsers.
For more info see below. Some figures:

                    Bison     Yacc      PGS       Lalr      Ell
-------------------------------------------------------------------
speed [lines/min.]  105,000   184,000   200,000   385,000   700,000
table size [bytes]  8,004     10,364    11,268    11,795    -
parser size [bytes] 11,136    12,548    17,616    17,416    14,344
gen. time [sec.]    5.0       19.6      69.5      29.6      6.4

4. I'm wondering that nobody is talking about ALEX:

%A H. M\\*:ossenb\\*:ock
%T Alex - A Simple and Efficient Scanner Generator
%J SIGPLAN
%V 21
%N 12
%P 139-148
%D DEC 1986

5. Announcement

      Rex, Lalr and Ell - Compiler Construction Tools
      ===============================================

     Rex (Regular EXpression tool) is a scanner  generator  whose
specifications  are  based  on  regular expressions and arbitrary
semantic actions written in one of  the  target  languages  C  or
Modula-2.  As  scanners sometimes have to consider the context to
unambiguously recognize a token the right context can  be  speci-
fied by an additional regular expression and the left context can
be handled by so-called  start  states.  The  generated  scanners
automatically  compute the line and column position of the tokens
and offer an efficient mechanism  to  normalize  identifiers  and
keywords  to upper or lower case letters. The scanners are table-
driven and run at a speed of 180,000 to 195,000 lines per  minute
on a MC 68020 processor.

     Lalr is a LALR(1) parser generator accepting grammars  writ-
ten  in  extended BNF notation which may be augmented by semantic
actions expressed by statements of the target language. The  gen-
erator  provides  a  mechanism  for  S-attribution,  that is syn-
thesized attributes can be computed during parsing.  In  case  of
LR-conflicts  unlike  other tools Lalr provides not only informa-
tion about an internal state consisting of a set of items but  it
prints a derivation tree which is much more useful to analyze the
problem. Conflicts can be resolved by specifying  precedence  and
associativity of operators and productions. The generated parsers
include automatic  error  recovery,  error  messages,  and  error
repair.  The  parsers  are  table-driven  and  run  at a speed of
400,000 lines per minute. Currently parsers can be  generated  in
the target languages C and Modula-2.

     Ell is a LL(1) parser generator accepting the same  specifi-
cation  language  as  Lalr except that the grammars must obey the
LL(1) property. It  is  possible  to  evaluate  an  L-attribution
during parsing. The generated  parsers  include  automatic  error
recovery,  error  messages,  and  error  repair  like  Lalr.  The
parsers are implemented following the  recursive  descent  method
and  reach a speed of 700,000 lines per minute. The possible tar-
get languages are again C and Modula-2.

     All the tools are implemented in Modula-2.
A comparison of the above tools with the corresponding UNIX
tools shows that significant improvements have been achieved:
Rex generated scanners are 4 times faster than those of LEX.
Lalr generated parsers are 2-3 times faster than those of YACC.
Ell generated parsers are 4 times faster than those of YACC.

The tools are public domain. They have been developed in 1987.
They have been tested by generating scanners and parsers for
e. g. Pascal, Modula, Oberon, Ada and found stable.

The tools have been developed using our own Modula-2 compiler called
MOCKA on a MC 68020 based UNIX workstation. They have been ported
to the SUN workstation and been compiled successfully using the SUN
Modula-2 compiler. They also run on VAX/BSD UNIX and VAX/ULTRIX machines.
This should assure a reasonable level of portability.

I would be pleased to send the programs to everybody that is interested.
To minimize the costs (at least for me) I suggest to send a blank SUN
streamer tape. 1/2" magtapes are possible, too, but cause more work for me.
I will return the tape with the sources organized as a UNIX file tree.
(tar format)
It would be nice if you could include a cheque of US $ 10 or equivalent
to cover return postage.

The tape will contain user manuals for each tool.
Papers about Rex and Lalr will be published soon.

I have submitted the sources to the usenet newsgroup comp.sources.unix.

Josef Grosch
GMD - Forschungsstelle an der Universitaet Karlsruhe
Haid-und-Neu-Str. 7
D-7500 Karlsruhe 1
West Germany

Phone: +721-662226
Email: grosch@gmdka.UUCP
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request