db@NSS.CS.UCL.AC.UK (Dave Berry) (02/06/89)
The following references describe scanner generators that are claimed to
produce smaller and faster scanners than Lex.  None of them seem to have
caught on (admittedly the references are fairly recent).  Does anyone have
any experience with these or similar tools?  What problems do they have?
 Do people use Lex just because it's there?
If Lex is as bad as these articles (and my experience) suggest, I'm surprised
that GNU are using it for an optimising compiler.
@Article(gla,
        Author="V. P. Heuring",
        Title="The Automatic Generation of Fast Lexical Analysers",
        Journal="Software Practice and Experience",
        Volume="16",
        Number="9",
        Pages="801-808",
        Month="Sep",
        Year="1986")
@Article(mkscan,
        Author="R. Nigel Horspool and Michael R. Levy",
        Title="{\it Mkscan} -- An Interactive Scanner Generator",
        Journal="Software Practice and Experience",
        Volume="17",
        Number="6",
        Pages="369-379",
        Month="June",
        Year="1987")
@Article(schnorf,
        Author="P. Schnorf",
        Title="Dynamic Instantiation and Configuration of Functionally Extended,
 Efficient Lexical Analysers",
        Journal="{SIGPLAN} Notices",
        Volume="23",
        Number="10",
        Pages="93-102",
        Month="Oct",
        Year="1988")
Dave Berry,	Laboratory for Foundations of Computer Science, Edinburgh.
		db%lfcs.ed.ac.uk@nss.cs.ucl.ac.uk
		<Atlantic Ocean>!mcvax!ukc!lfcs!db
[Lex was a summer student's project that escaped from the lab. I can easily
believe that there are lots of better lexer generators, since although lex's
theory is sound, its implementation leaves a lot to be desired. Gnu isn't
using it, since it's licensed AT&T code, but is probably using a
reimplementation that works. -John]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-requestschmidt@ORION.CF.UCI.EDU (Douglas C. Schmidt) (02/06/89)
In article <3286@ima.ima.isc.com> Dave Berry <lfcs.edinburgh.ac.uk!db writes:
++If Lex is as bad as these articles (and my experience) suggest, I'm surprised
++that GNU are using it for an optimising compiler.
Arrrrrggghh.  I can't take the spread of misinformation any more!  GNU
C and GNU C++ *don't* use LEX or FLEX or GNULEX (no, it doesn't
exist).  The lexical analyzers are hand coded (I know, because I wrote
the perfect hash function that recognizes GNU C and C++ reserved words
in O(1) time).  Please get the facts straight on this (which shouldn't
be too hard, since the source is freely available ;-)).
Doug Schmidt
PS: They do use an LALR BISON (YACC) grammar, however.
[From "Douglas C. Schmidt" <schmidt@ORION.CF.UCI.EDU>]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-requestwpl@PRC.Unisys.COM (William P Loftus) (02/07/89)
Many of the lexical analyzer generators gain their speed from limiting
the specification power of their tools. They are taking advantage of
the fact that most languages don't need a general RE processor. In my
opinion, this limits the use of such tools to traditional language
lexical analyzers (they are not for what Aho et al. call swiss army
knives---see the dragon book).   However, there is a full RE lexical
analyzer generator called Flex.
Flex is a lex clone, but much faster.
Here is a file contained in the distribution of flex (which is
freely available from several ftp and uucp sites) that describes
timing:
--
flex vs. lex timings for a C tokenizer which includes keywords:
Generation times:
	lex	    83.0 secs
	flex	     3.9
	flex -cfe    7.1	# uncompressed table, equivalence classes
	flex -cf    15.0	# uncompressed table, no equivalence classes
Scanner object file sizes:
	lex	   41.0K bytes
	flex	    9.4K
	flex -cfe  49.6K
	flex -cf  126.5K
Running times on a 28,088 line input (685K characters):
	lex	   29.8 secs
	flex	   19.3
	flex -cfe   9.0
	flex -cf    7.8
The timings were made on a Sun 3/60.  All times are user + system CPU time,
and don't include hashing of identifiers.
Summary:
    For about the same sized scanner, you get a factor of 3 in performance.
    For a 30% faster scanner, you get a scanner 1/4th the size, and it's
	generated in 1/20th the time.
    For a scanner that's 3 times larger, you get a factor of 3.8 in
	performance.
--
There is a paper by Vern Paxson that describes how he obtains these speeds (I
don't have the paper with me at the moment, so I can't tell you the title).
William Loftus
--
William P Loftus                        UUCP:   wpl@burdvax.UUCP
Unisys/Paoli Research Center            ARPA:   wpl@anarchy.prc.unisys.com
PO Box 517                              215-354-0614 (home) 
Paoli, PA 19301                         215-648-7248 (work)
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-requestvern@pistachio.ee.lbl.gov (Vern Paxson) (02/07/89)
lfcs.edinburgh.ac.uk!db@NSS.CS.UCL.AC.UK (Dave Berry) asked about scanner generators that claim to produce smaller and faster scanners than Lex, listing three references of which "None of them seem to have caught on." Allow me if I may to plug flex, which is my lex rewrite claiming to produce smaller and faster scanners than Lex. It has caught on fairly well, to judge from the email and bug reports I receive, and I've been told that UCB's CSRG will be using it to replace lex in subsequent BSD releases. It's available as flex.shar.Z for anonymous ftp from rtsg.ee.lbl.gov (128.3.254.68); it's also available (sorry, I don't know the exact name) from prep.ai.mit.edu (128.52.32.14), though as schmidt@ORION.CF.UCI.EDU (Douglas C. Schmidt) writes: > Arrrrrggghh. I can't take the spread of misinformation any more! GNU > C and GNU C++ *don't* use LEX or FLEX or GNULEX (no, it doesn't > exist). I.e., it's not part of the GNU project. It \is/ "free" software in more-or-less the same spirit, however, which I think is why they are distributing it. Vern Vern Paxson vern@lbl-csam.arpa Real Time Systems ucbvax!lbl-csam.arpa!vern Lawrence Berkeley Laboratory (415) 486-6580 -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
rsalz@BBN.COM (Rich Salz) (02/08/89)
In <3295@ima.ima.isc.com> Vern Paxson <vern@pistachio.ee.lbl.gov> writes: >Allow me if I may to plug flex, which is my lex rewrite claiming to produce >smaller and faster scanners than Lex.... Vern is much too modest. Flex is FAST. If you need flex, and can do without the couple of missing features (right-context awareness? I forget), then you should get it. As he pointed out, the FSF folks distribute it as it's got a similar licensing philosophy. As for the general question: "it's true what they say about lex." Most folks know about this comment because Peter Honeyman stuck it in the change log of pathalias. It turns out to basically not be all that true. Van Jacobson (also of LBL) spent a few some time tuning lex and spoke about it at the Washington Usenix conference. As John pointed out, the problem with Lex is that it was so space-conscious that the generated code ran real slowly. By opening up the tables and a few other changes, he was able to get the innermost loop to compile into a single 68000 instructions. He got an ovation when he showed timing charts where lex was just a shade slower than cat(1). Van's changes were never made available, but it doesn't matter: Use Flex. :-) /rich $alz -- Please send comp.sources.unix-related mail to rsalz@uunet.uu.net. -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
ken@cs.rochester.edu (Ken Yap) (02/09/89)
This is straying from the subject but it is worth noting that one way
flex gets its speed is by returning a pointer to the string comprising
the token in place (no string copying).  This can bite you if you are
using flex with yacc/bison and you don't copy the string into your own
variable right away. After the next terminal has been matched is too
late.
But I recommend flex heartily.
[From Ken Yap <ken@cs.rochester.edu>]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-requestmike@arizona.edu (Mike Coffin) (02/10/89)
> Vern is much too modest.  Flex is FAST.
I agree.  Not only are the scanners fast, but it generates them
very quickly --- much faster than LEX.  FLEX is well worth getting.
-- 
Mike Coffin				mike@arizona.edu
Univ. of Ariz. Dept. of Comp. Sci.	{allegra,cmcl2}!arizona!mike
Tucson, AZ  85721			(602)621-2858
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-requesttower@bu-cs.BU.EDU (Leonard H. Tower Jr.) (02/11/89)
In article <3290@ima.ima.isc.com> "Douglas C. Schmidt" <schmidt@ORION.CF.UCI.EDU> writes: |[Gnu C compilers use a hand coded lexer.] | |Doug Schmidt | |PS: They do use an LALR BISON (YACC) grammar, however. Doug is a 100% here, but he doesn't know the full scoop. GNU is distributing FLEX on it's tapes, and plans to include it as part of GNU, becasue it's the fastest freely re-distributable scanner that is coded well and cleanly, that we knew of. Lexers do have use outside of compilers ... ;-} enjoy -len [From tower@bu-cs.BU.EDU (Leonard H. Tower Jr.)] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
pardo@june.cs.washington.edu (David Keppel) (02/11/89)
>"Douglas C. Schmidt" <schmidt@ORION.CF.UCI.EDU> writes: >|[Gnu C compilers use a hand coded lexer.] tower@bu-cs.BU.EDU (Leonard H. Tower Jr.) writes: >[GNU is distributing FLEX] >Lexers do have use outside of compilers ... ;-} Since this is comp.compilers ... Compilers have uses outside of programming languages. Or put another way, compilers (and compiler pieces, and interpreters, ...) are useful for writing ``little languages''. A formal lexer and parser is often far more reliable than an ad-hoc parser. Remember, the keystrokes you type to your editor are both (a) a programing language and (b) get parsed and executed. Jeekers. -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
henry@zoo.toronto.edu (Henry Spencer) (02/11/89)
>...one way >flex gets its speed is by returning a pointer to the string comprising >the token in place (no string copying). This can bite you if you are >using flex with yacc/bison and you don't copy the string into your own >variable right away. After the next terminal has been matched is too >late. The stuff lex puts in yytext[] also changes for each terminal, and hence also must be saved immediately if you want to use it. I don't understand why the lack of copying makes a practical difference. Henry Spencer at U of Toronto Zoology uunet!attcan!utzoo!henry henry@zoo.toronto.edu -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
holt@turing.toronto.edu (Ric Holt) (02/14/89)
In reply to Henry's comments .... >The stuff lex puts in yytext[] also changes for each terminal, and hence >also must be saved immediately if you want to use it. I don't understand >why the lack of copying makes a practical difference. > > Henry Spencer at U of Toronto Zoology > uunet!attcan!utzoo!henry henry@zoo.toronto.edu In a FAST scanner, every machine instruction counts. I'd like to know how fast, on an instruction basis, FLEX and LEX are. The scanner for the new Turing compiler (written in Turing Plus) seems to execute about 129 machine instructions per token, including hashing identifiers and entering them into buckets (as well as skipping white space and comments). These are MC680x0 instructions. A token is something like a keyword, an operator, an identifier, etc. In a scanner like this, copying extra bytes appreciably slows down things. Ric Holt [From Ric Holt <holt@turing.toronto.edu>] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
henry@zoo.toronto.edu (02/16/89)
>>The stuff lex puts in yytext[] also changes for each terminal, and hence >>also must be saved immediately if you want to use it. I don't understand >>why the lack of copying makes a practical difference. > >In a FAST scanner, every machine instruction counts... Agreed, and I think FLEX does the right thing by not copying -- there never was a good reason for it, except that LEX was a quick hack that didn't ever get worked over properly for performance. The original discussion was about how it affected the programming semantics, not the performance, though, which is why I said what I did. Henry Spencer at U of Toronto Zoology uunet!attcan!utzoo!henry henry@zoo.toronto.edu -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
grosch@unido.irb.informatik.uni-dortmund.de (Josef Grosch) (02/18/89)
1. Van Jacobson's loop In <3299@ima.ima.isc.com> Rich Salz <rsalz@BBN.COM> writes: > Van Jacobson (also of LBL) spent a few some time tuning lex ... > he was able to get the innermost loop to compile into a single 68000 > instructions. ... > lex was just a shade slower than cat(1). Van's changes were never > made available, ... I would really like to see this one instruction loop! Otherwise it sounds like an unbelievable rumour, because the fastest hand-coded loop for scanning that I can imagine has 4 instructions. For a general automaton I as well as others need at least 5 instructions. 2. Rex: a fast scanner generator Rex is like flex an improved remake of Lex. Additional features are the automatic calculation of the line and column position of every token and a mechanism to handle include files. REJECT has been sacrificed. Rex generated scanners are always 20-30% faster than the fastest flex version. The tables are always as small as possible. Depending on the problem the table size can be only 5-77% and the generation time only 50-70% of the flex figures. For more info see below. Some figures: Lex Flex Rex ---------------------------------------------------------------- speed [lines/min.] without hashing 36,400 139,000 182,700 with hashing 34,700 118,000 141,400 table size [bytes] 39,200 57,300 4,400 scanner size [bytes] 43,800 64,100 11,200 generation time [sec.] 73.7 7.2 4.9 3. Lalr + Ell: fast parser generators Not only Lex is slow, Yacc also. We have an Yacc remake called Lalr that generates parsers that are 2 to 3 times faster. Moreover automatic error handling is provided which includes error reporting, recovery, and repair. Ell generated parsers beat Yacc by a factor of 4. Ell processes LL(1) grammars and generates conventional recursive descent parsers. For more info see below. Some figures: Bison Yacc PGS Lalr Ell ------------------------------------------------------------------- speed [lines/min.] 105,000 184,000 200,000 385,000 700,000 table size [bytes] 8,004 10,364 11,268 11,795 - parser size [bytes] 11,136 12,548 17,616 17,416 14,344 gen. time [sec.] 5.0 19.6 69.5 29.6 6.4 4. I'm wondering that nobody is talking about ALEX: %A H. M\\*:ossenb\\*:ock %T Alex - A Simple and Efficient Scanner Generator %J SIGPLAN %V 21 %N 12 %P 139-148 %D DEC 1986 5. Announcement Rex, Lalr and Ell - Compiler Construction Tools =============================================== Rex (Regular EXpression tool) is a scanner generator whose specifications are based on regular expressions and arbitrary semantic actions written in one of the target languages C or Modula-2. As scanners sometimes have to consider the context to unambiguously recognize a token the right context can be speci- fied by an additional regular expression and the left context can be handled by so-called start states. The generated scanners automatically compute the line and column position of the tokens and offer an efficient mechanism to normalize identifiers and keywords to upper or lower case letters. The scanners are table- driven and run at a speed of 180,000 to 195,000 lines per minute on a MC 68020 processor. Lalr is a LALR(1) parser generator accepting grammars writ- ten in extended BNF notation which may be augmented by semantic actions expressed by statements of the target language. The gen- erator provides a mechanism for S-attribution, that is syn- thesized attributes can be computed during parsing. In case of LR-conflicts unlike other tools Lalr provides not only informa- tion about an internal state consisting of a set of items but it prints a derivation tree which is much more useful to analyze the problem. Conflicts can be resolved by specifying precedence and associativity of operators and productions. The generated parsers include automatic error recovery, error messages, and error repair. The parsers are table-driven and run at a speed of 400,000 lines per minute. Currently parsers can be generated in the target languages C and Modula-2. Ell is a LL(1) parser generator accepting the same specifi- cation language as Lalr except that the grammars must obey the LL(1) property. It is possible to evaluate an L-attribution during parsing. The generated parsers include automatic error recovery, error messages, and error repair like Lalr. The parsers are implemented following the recursive descent method and reach a speed of 700,000 lines per minute. The possible tar- get languages are again C and Modula-2. All the tools are implemented in Modula-2. A comparison of the above tools with the corresponding UNIX tools shows that significant improvements have been achieved: Rex generated scanners are 4 times faster than those of LEX. Lalr generated parsers are 2-3 times faster than those of YACC. Ell generated parsers are 4 times faster than those of YACC. The tools are public domain. They have been developed in 1987. They have been tested by generating scanners and parsers for e. g. Pascal, Modula, Oberon, Ada and found stable. The tools have been developed using our own Modula-2 compiler called MOCKA on a MC 68020 based UNIX workstation. They have been ported to the SUN workstation and been compiled successfully using the SUN Modula-2 compiler. They also run on VAX/BSD UNIX and VAX/ULTRIX machines. This should assure a reasonable level of portability. I would be pleased to send the programs to everybody that is interested. To minimize the costs (at least for me) I suggest to send a blank SUN streamer tape. 1/2" magtapes are possible, too, but cause more work for me. I will return the tape with the sources organized as a UNIX file tree. (tar format) It would be nice if you could include a cheque of US $ 10 or equivalent to cover return postage. The tape will contain user manuals for each tool. Papers about Rex and Lalr will be published soon. I have submitted the sources to the usenet newsgroup comp.sources.unix. Josef Grosch GMD - Forschungsstelle an der Universitaet Karlsruhe Haid-und-Neu-Str. 7 D-7500 Karlsruhe 1 West Germany Phone: +721-662226 Email: grosch@gmdka.UUCP -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request