[comp.lang.c++] Lexer timings

schmidt@zola.ics.uci.edu (Doug Schmidt) (02/27/90)

In article <MEISSNER.90Feb26145638@curley.osf.org>, meissner@osf (Michael Meissner) writes:
>I have supported three compilers in the last 10 years (DG/L, Data
>General C, and now GNU C), and none of them use any automatic tools
>for lexing.

Actually, GNU C uses a reserved word recognizer function generated
automatically by my GPERF program.  But I understand your general
point.

>First of all, UNIX lex is too general purpose, and too slow to be used
>in a production compiler (yes I know flex fixes at least some of the
>speed problems).  You don't need backtracing, or even generalized
>regular expressions, and such for most computer languages to lex
>stuff, just simple one character lookahead.

Here are some interesting statistics that add empirical support to
Michael's claims:

Given an input file with 350,466 C++ reserved words and 624,156
user-defined identifiers, organized as 1 word per-line (created by
running the command tr -cs A-Za-z_ '\012' on the preprocessed source
code for ET++) the following timing results were obtained on an
otherwise idle Sun 4/260 with 32 meg of RAM:

/bin/time gperf.exe < ET++.input
       19.9 real        17.8 user         1.3 sys
/bin/time flex.exe < ET++.input
       49.2 real        32.2 user         2.5 sys
/bin/time comp-flex.exe < ET++.input
       92.9 real        62.4 user         2.7 sys

The gperf.exe file was a gperf-generated recognizer for the 47
standard C++ reserved words invoked by the following short driver:

----------------------------------------
int main (void)
{
  char buf[200];

    while (gets (buf))
      in_word_set (buf, strlen (buf));
}
----------------------------------------

The flex.exe file came from a flex 2.1 generated recognizer (also for
C++ reserved words) created with the -f (no table compaction) option,
in order to maximize speed.  comp-flex is the same input file
generated with the default settings (-cem, which gives maximal
compression).

Finally, here are the relative sizes (in bytes) for the object files
(all compiled gcc 1.37 using the -O -strength-reduce option):

----------------------------------------
text    data    bss     dec
1672    0       0       1672    gperf.o
6032    56      16440   22528   comp-flex.o
3616    56632   16440   76688   flex.o
----------------------------------------

My conclusions?  Even flex can be big and slow for certain lexing
purposes, though it beats the standard UNIX lex and its use is often
justifiable on other grounds (facilitates rapid-prototyping, may be
easier to maintain than a hand-generated lexer for certain languages,
etc.).  Finally, tool-generated code (like that produced by gperf)
doesn't necessarily have to be slow, even though it's produced
automatically.

  Doug

--
On a clear day, under blue skies, there is no need to seek.
And asking about Buddha                +------------------------+
Is like proclaiming innocence,         | schmidt@ics.uci.edu    |
With loot in your pocket.              | office: (714) 856-4043 |