[comp.lang.c++] FLEX/GPERF update

schmidt@crimee.ics.uci.edu (Doug Schmidt) (03/22/90)

Howdy,

  Several weeks ago I posted the result of an empirical comparison
between a FLEX-generated scanner and a GPERF-generated recognizer for
C++ reserved words (actually GNU G++ reserved words, which has 71
total keywords).

  Thanks to FLEX author Vern Paxson, who provided a more suitable flex
input specification, I've got some new results.  It turns out that
given the appropriate input spec and program option, FLEX generates
extremely fast scanners.  The empirical results are presented below,
followed by a shar file containing the FLEX input file, the GPERF
input file, and the GPERF-generated recognizer.

  First the relative object-code sizes in bytes for the respective .o
files:

----------------------------------------
Filename       Total    Text       Data      Bss      Options
gperf.o        2,576    2,576         0        0      -C -D -S1
comp-flex.o   24,320    7,824        56   16,440      -cem
mid-flex.o    41,744    3,040    22,264   16,440      -cfe
flex.o       117,360    2,816    98,104   16,440      -f
----------------------------------------

The input file contained a large number of words, both user-defined
identifiers and keywords, organized with one word per line.  The words
were created by running the UNIX command

tr -cs A-Za-z_ '\012'

on the preprocessed source code for ET++.  The input file
characteristic were:

Input File      Identifiers    Keywords     Total
ET++.input        624,156       350,466     974,622

All reserved word recognizer programs were tested on an otherwise idle
Sun 4/260 with 32 megabytes of RAM; they were compiled by the GNU G++
1.37.1 compiler with the:

-O -fstrength-reduce -finline-functions -fdelayed-branch

options enabled.  Here are the timing results:

% /bin/time flex.exe < ET++.input
       12.6 real         9.4 user         2.3 sys
% /bin/time mid-flex.exe < ET++.input
       15.8 real        12.7 user         2.5 sys
%/bin/time gperf.exe < ET++.input
       19.9 real        17.2 user         1.4 sys
% /bin/time comp-flex.exe < ET++.input
       40.0 real        34.8 user         2.4 sys

Tentative conclusion from all this are:

* The uncompacted table-driven flex implementation was both the fastest
  and the largest implementation, illustrating the time/space trade-off
  dichotomy.  Applications where saving time is more important than
  conserving space should benefit from this approach.

* The efficiency of FLEX-generated code is highly dependent on choosing
  the appropriate set of program options and specifying the input file
  correctly.  The new alpha-release of FLEX 2.2 includes documentation
  that explains how to specify the input description for best results.

* Despite having a load factor of only 0.45 percent the GPERF-generated
  perfect hash function recognizer is both space and time efficient.
  FLEX recognizers allow programmers to trade off space for
  time, perfect hashing does not overly penalize either.

* Hopefully, these results will help dispel the notion that
  automatically generated scanners are inherently slower than
  hand-coded scanners.  Vern has clearly done a great job of speeding up
  FLEX-generated code (certainly compared against that produced by the
  standard UNIX LEX utility!).

For a discussion of GPERF (and a more complete comparison of GPERF and
FLEX generated scanners against other common reserved word recognizer
implementations) please see my paper on GPERF appearing in the
upcoming USENIX C++ Conference proceedings.

Finally, here's a reprint of Vern's annoucement of release 2.2 of
flex:

     Flex, a lex replacement, is now available via anonymous ftp to
     svax.cs.cornell.edu (128.84.254.2, East coast) or ftp.ee.lbl.gov
     (128.3.254.68, West coast).  Retrieve flex-2.2.alpha.tar.Z, using
     binary mode.

Doug

----------------------------------------
#! /bin/sh
# This is a shell archive.  Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file".  To overwrite existing
# files, type "sh file -c".  You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g..  If this archive is complete, you
# will see the following message at the end:
#		"End of shell archive."
# Contents:  c++.l c++.gperf gperf.c
# Wrapped by schmidt@crimee.ics.uci.edu on Wed Mar 21 16:38:44 1990
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'c++.l' -a "${1}" != "-c" ; then
  echo shar: Will not clobber existing file \"'c++.l'\"
else
echo shar: Extracting \"'c++.l'\" \(916 characters\)
sed "s/^X//" >'c++.l' <<'END_OF_FILE'
X%{
X#ifdef __GNUG__
Xextern "C" int yylex (void);
X#endif
X%}
X%a 8000
X%p 3000
X%o 12000
X%%
X__alignof\n |
X__alignof__\n |
X__asm\n |
X__asm__\n |
X__attribute\n |
X__attribute__\n |
X__const\n |
X__const__\n |
X__inline\n |
X__inline__\n |
X__signed\n |
X__signed__\n |
X__typeof\n |
X__typeof__\n |
X__volatile\n |
X__volatile__\n |
Xall\n |
Xexcept\n |
Xexception\n |
Xraise\n |
Xraises\n |
Xreraise\n |
Xtry\n |
Xasm\n |
Xauto\n |
Xbreak\n |
Xcase\n |
Xcatch\n |
Xchar\n |
Xclass\n |
Xconst\n |
Xcontinue\n |
Xdefault\n |
Xdelete\n |
Xdo\n |
Xdouble\n |
Xdynamic\n |
Xelse\n |
Xenum\n |
Xextern\n |
Xfloat\n |
Xfor\n |
Xfriend\n |
Xgoto\n |
Xif\n |
Xinline\n |
Xint\n |
Xlong\n |
Xnew\n |
Xoperator\n |
Xoverload\n |
Xprivate\n |
Xprotected\n |
Xpublic\n |
Xregister\n |
Xreturn\n |
Xshort\n |
Xsigned\n |
Xsizeof\n |
Xstatic\n |
Xstruct\n |
Xswitch\n |
Xthis\n |
Xtypedef\n |
Xtypeof\n |
Xunion\n |
Xunsigned\n |
Xvirtual\n |
Xvoid\n |
Xvolatile\n |
Xwhile\n ;
X[_A-Z0-9a-z]+\n? ;
X.|\n ;
END_OF_FILE
if test 916 -ne `wc -c <'c++.l'`; then
    echo shar: \"'c++.l'\" unpacked with wrong size!
fi
# end of 'c++.l'
fi
if test -f 'c++.gperf' -a "${1}" != "-c" ; then
  echo shar: Will not clobber existing file \"'c++.gperf'\"
else
echo shar: Extracting \"'c++.gperf'\" \(520 characters\)
sed "s/^X//" >'c++.gperf' <<'END_OF_FILE'
X__alignof
X__alignof__
X__asm
X__asm__
X__attribute
X__attribute__
X__const
X__const__
X__inline
X__inline__
X__signed
X__signed__
X__typeof
X__typeof__
X__volatile
X__volatile__
Xall
Xexcept
Xexception
Xraise
Xraises
Xreraise
Xtry
Xasm
Xauto
Xbreak
Xcase
Xcatch
Xchar
Xclass
Xconst
Xcontinue
Xdefault
Xdelete
Xdo
Xdouble
Xdynamic
Xelse
Xenum
Xextern
Xfloat
Xfor
Xfriend
Xgoto
Xif
Xinline
Xint
Xlong
Xnew
Xoperator
Xoverload
Xprivate
Xprotected
Xpublic
Xregister
Xreturn
Xshort
Xsigned
Xsizeof
Xstatic
Xstruct
Xswitch
Xthis
Xtypedef
Xtypeof
Xunion
Xunsigned
Xvirtual
Xvoid
Xvolatile
Xwhile
END_OF_FILE
if test 520 -ne `wc -c <'c++.gperf'`; then
    echo shar: \"'c++.gperf'\" unpacked with wrong size!
fi
# end of 'c++.gperf'
fi
if test -f 'gperf.c' -a "${1}" != "-c" ; then
  echo shar: Will not clobber existing file \"'gperf.c'\"
else
echo shar: Extracting \"'gperf.c'\" \(5485 characters\)
sed "s/^X//" >'gperf.c' <<'END_OF_FILE'
X/* C code produced by gperf version 2.3 (GNU C++ version) */
X/* Command-line: gperf -C -D -S1 c++.gperf  */
X
X#define TOTAL_KEYWORDS 71
X#define MIN_WORD_LENGTH 2
X#define MAX_WORD_LENGTH 13
X#define MIN_HASH_VALUE 7
X#define MAX_HASH_VALUE 167
X/* maximum key range = 161, duplicates = 3 */
X
Xstatic unsigned int
Xhash (str, len)
X     register const char *str;
X     register int len;
X{
X  static const unsigned char asso_values[] =
X    {
X     168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X     168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X     168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X     168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X     168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X     168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X     168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X     168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X     168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X     168, 168, 168, 168, 168,   0, 168,  40,  20,  45,
X      30,  10, 100,  30,  35,  65, 168,   0,  30,  40,
X     110,  60,   5, 168,  25,   0,  10,   5,  15,  20,
X     168,  15, 168, 168, 168, 168, 168, 168,
X    };
X  return len + asso_values[str[len - 1]] + asso_values[str[0]];
X}
X
Xconst char *
Xin_word_set (str, len)
X     register const char *str;
X     register int len;
X{
X  register int key = hash (str, len);
X  const char *resword;
X
X  switch (key)
X    {
X    case    7:
X      resword = "__asm__"; break;
X    case    9:
X      resword = "__const__"; break;
X    case   10:
X      resword = "__inline__";
X      if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword;
X      resword = "__typeof__";
X      if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword;
X      resword = "__signed__";
X      if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword;
X      return 0;
X    case   11:
X      resword = "__alignof__"; break;
X    case   12:
X      resword = "__volatile__"; break;
X    case   13:
X      resword = "__attribute__"; break;
X    case   14:
X      resword = "this"; break;
X    case   15:
X      resword = "short"; break;
X    case   16:
X      resword = "struct"; break;
X    case   17:
X      resword = "__const"; break;
X    case   18:
X      resword = "__inline"; break;
X    case   20:
X      resword = "__volatile"; break;
X    case   21:
X      resword = "__attribute"; break;
X    case   22:
X      resword = "private"; break;
X    case   24:
X      resword = "else"; break;
X    case   25:
X      resword = "break"; break;
X    case   26:
X      resword = "except"; break;
X    case   28:
X      resword = "try"; break;
X    case   31:
X      resword = "raises"; break;
X    case   33:
X      resword = "volatile"; break;
X    case   35:
X      resword = "while"; break;
X    case   36:
X      resword = "signed"; break;
X    case   38:
X      resword = "__signed"; break;
X    case   40:
X      resword = "raise"; break;
X    case   41:
X      resword = "switch"; break;
X    case   42:
X      resword = "reraise"; break;
X    case   43:
X      resword = "unsigned"; break;
X    case   44:
X      resword = "protected"; break;
X    case   45:
X      resword = "__asm"; break;
X    case   46:
X      resword = "delete";
X      if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword;
X      resword = "double";
X      if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword;
X      return 0;
X    case   47:
X      resword = "default"; break;
X    case   49:
X      resword = "void"; break;
X    case   50:
X      resword = "class"; break;
X    case   51:
X      resword = "static"; break;
X    case   52:
X      resword = "virtual"; break;
X    case   54:
X      resword = "enum"; break;
X    case   56:
X      resword = "public"; break;
X    case   58:
X      resword = "register"; break;
X    case   59:
X      resword = "case"; break;
X    case   60:
X      resword = "const"; break;
X    case   63:
X      resword = "continue"; break;
X    case   64:
X      resword = "long"; break;
X    case   73:
X      resword = "all"; break;
X    case   74:
X      resword = "char"; break;
X    case   78:
X      resword = "int"; break;
X    case   81:
X      resword = "inline"; break;
X    case   82:
X      resword = "dynamic"; break;
X    case   83:
X      resword = "asm"; break;
X    case   85:
X      resword = "catch"; break;
X    case   92:
X      resword = "do"; break;
X    case   93:
X      resword = "operator"; break;
X    case   94:
X      resword = "goto"; break;
X    case   98:
X      resword = "overload"; break;
X    case  104:
X      resword = "auto"; break;
X    case  106:
X      resword = "sizeof"; break;
X    case  108:
X      resword = "__typeof"; break;
X    case  109:
X      resword = "__alignof"; break;
X    case  115:
X      resword = "float"; break;
X    case  116:
X      resword = "typeof"; break;
X    case  117:
X      resword = "typedef"; break;
X    case  120:
X      resword = "union"; break;
X    case  126:
X      resword = "extern"; break;
X    case  128:
X      resword = "for"; break;
X    case  129:
X      resword = "exception"; break;
X    case  133:
X      resword = "new"; break;
X    case  136:
X      resword = "friend"; break;
X    case  141:
X      resword = "return"; break;
X    case  167:
X      resword = "if"; break;
X    default: return 0;
X    }
X  if (*str == *resword && !strcmp (str + 1, resword + 1))
X    return resword;
X  return 0;
X}
X
X/* Tests the generated perfect hash function. */
X
X#include <stdio.h>
X
X#define MAX_LEN 80
X
Xint
Xmain ()
X{
X  char buf[MAX_LEN];
X
X  while (gets (buf))
X    in_word_set (buf, strlen (buf));
X
X  return 0;
X}
END_OF_FILE
if test 5485 -ne `wc -c <'gperf.c'`; then
    echo shar: \"'gperf.c'\" unpacked with wrong size!
fi
# end of 'gperf.c'
fi
echo shar: End of shell archive.
exit 0
--
Facilis descensus Averni:                       +---------------------------
Noctes atque dies patet atri ianua Ditis;       | schmidt@ics.uci.edu (ARPA)
Sed revocare gradum superasque evadere ad auras,| office: (714) 856-4043
Hoc opus, hic labor est.                        +---------------------------