schmidt@crimee.ics.uci.edu (Doug Schmidt) (03/22/90)
Howdy,
Several weeks ago I posted the result of an empirical comparison
between a FLEX-generated scanner and a GPERF-generated recognizer for
C++ reserved words (actually GNU G++ reserved words, which has 71
total keywords).
Thanks to FLEX author Vern Paxson, who provided a more suitable flex
input specification, I've got some new results. It turns out that
given the appropriate input spec and program option, FLEX generates
extremely fast scanners. The empirical results are presented below,
followed by a shar file containing the FLEX input file, the GPERF
input file, and the GPERF-generated recognizer.
First the relative object-code sizes in bytes for the respective .o
files:
----------------------------------------
Filename Total Text Data Bss Options
gperf.o 2,576 2,576 0 0 -C -D -S1
comp-flex.o 24,320 7,824 56 16,440 -cem
mid-flex.o 41,744 3,040 22,264 16,440 -cfe
flex.o 117,360 2,816 98,104 16,440 -f
----------------------------------------
The input file contained a large number of words, both user-defined
identifiers and keywords, organized with one word per line. The words
were created by running the UNIX command
tr -cs A-Za-z_ '\012'
on the preprocessed source code for ET++. The input file
characteristic were:
Input File Identifiers Keywords Total
ET++.input 624,156 350,466 974,622
All reserved word recognizer programs were tested on an otherwise idle
Sun 4/260 with 32 megabytes of RAM; they were compiled by the GNU G++
1.37.1 compiler with the:
-O -fstrength-reduce -finline-functions -fdelayed-branch
options enabled. Here are the timing results:
% /bin/time flex.exe < ET++.input
12.6 real 9.4 user 2.3 sys
% /bin/time mid-flex.exe < ET++.input
15.8 real 12.7 user 2.5 sys
%/bin/time gperf.exe < ET++.input
19.9 real 17.2 user 1.4 sys
% /bin/time comp-flex.exe < ET++.input
40.0 real 34.8 user 2.4 sys
Tentative conclusion from all this are:
* The uncompacted table-driven flex implementation was both the fastest
and the largest implementation, illustrating the time/space trade-off
dichotomy. Applications where saving time is more important than
conserving space should benefit from this approach.
* The efficiency of FLEX-generated code is highly dependent on choosing
the appropriate set of program options and specifying the input file
correctly. The new alpha-release of FLEX 2.2 includes documentation
that explains how to specify the input description for best results.
* Despite having a load factor of only 0.45 percent the GPERF-generated
perfect hash function recognizer is both space and time efficient.
FLEX recognizers allow programmers to trade off space for
time, perfect hashing does not overly penalize either.
* Hopefully, these results will help dispel the notion that
automatically generated scanners are inherently slower than
hand-coded scanners. Vern has clearly done a great job of speeding up
FLEX-generated code (certainly compared against that produced by the
standard UNIX LEX utility!).
For a discussion of GPERF (and a more complete comparison of GPERF and
FLEX generated scanners against other common reserved word recognizer
implementations) please see my paper on GPERF appearing in the
upcoming USENIX C++ Conference proceedings.
Finally, here's a reprint of Vern's annoucement of release 2.2 of
flex:
Flex, a lex replacement, is now available via anonymous ftp to
svax.cs.cornell.edu (128.84.254.2, East coast) or ftp.ee.lbl.gov
(128.3.254.68, West coast). Retrieve flex-2.2.alpha.tar.Z, using
binary mode.
Doug
----------------------------------------
#! /bin/sh
# This is a shell archive. Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file". To overwrite existing
# files, type "sh file -c". You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g.. If this archive is complete, you
# will see the following message at the end:
# "End of shell archive."
# Contents: c++.l c++.gperf gperf.c
# Wrapped by schmidt@crimee.ics.uci.edu on Wed Mar 21 16:38:44 1990
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'c++.l' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'c++.l'\"
else
echo shar: Extracting \"'c++.l'\" \(916 characters\)
sed "s/^X//" >'c++.l' <<'END_OF_FILE'
X%{
X#ifdef __GNUG__
Xextern "C" int yylex (void);
X#endif
X%}
X%a 8000
X%p 3000
X%o 12000
X%%
X__alignof\n |
X__alignof__\n |
X__asm\n |
X__asm__\n |
X__attribute\n |
X__attribute__\n |
X__const\n |
X__const__\n |
X__inline\n |
X__inline__\n |
X__signed\n |
X__signed__\n |
X__typeof\n |
X__typeof__\n |
X__volatile\n |
X__volatile__\n |
Xall\n |
Xexcept\n |
Xexception\n |
Xraise\n |
Xraises\n |
Xreraise\n |
Xtry\n |
Xasm\n |
Xauto\n |
Xbreak\n |
Xcase\n |
Xcatch\n |
Xchar\n |
Xclass\n |
Xconst\n |
Xcontinue\n |
Xdefault\n |
Xdelete\n |
Xdo\n |
Xdouble\n |
Xdynamic\n |
Xelse\n |
Xenum\n |
Xextern\n |
Xfloat\n |
Xfor\n |
Xfriend\n |
Xgoto\n |
Xif\n |
Xinline\n |
Xint\n |
Xlong\n |
Xnew\n |
Xoperator\n |
Xoverload\n |
Xprivate\n |
Xprotected\n |
Xpublic\n |
Xregister\n |
Xreturn\n |
Xshort\n |
Xsigned\n |
Xsizeof\n |
Xstatic\n |
Xstruct\n |
Xswitch\n |
Xthis\n |
Xtypedef\n |
Xtypeof\n |
Xunion\n |
Xunsigned\n |
Xvirtual\n |
Xvoid\n |
Xvolatile\n |
Xwhile\n ;
X[_A-Z0-9a-z]+\n? ;
X.|\n ;
END_OF_FILE
if test 916 -ne `wc -c <'c++.l'`; then
echo shar: \"'c++.l'\" unpacked with wrong size!
fi
# end of 'c++.l'
fi
if test -f 'c++.gperf' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'c++.gperf'\"
else
echo shar: Extracting \"'c++.gperf'\" \(520 characters\)
sed "s/^X//" >'c++.gperf' <<'END_OF_FILE'
X__alignof
X__alignof__
X__asm
X__asm__
X__attribute
X__attribute__
X__const
X__const__
X__inline
X__inline__
X__signed
X__signed__
X__typeof
X__typeof__
X__volatile
X__volatile__
Xall
Xexcept
Xexception
Xraise
Xraises
Xreraise
Xtry
Xasm
Xauto
Xbreak
Xcase
Xcatch
Xchar
Xclass
Xconst
Xcontinue
Xdefault
Xdelete
Xdo
Xdouble
Xdynamic
Xelse
Xenum
Xextern
Xfloat
Xfor
Xfriend
Xgoto
Xif
Xinline
Xint
Xlong
Xnew
Xoperator
Xoverload
Xprivate
Xprotected
Xpublic
Xregister
Xreturn
Xshort
Xsigned
Xsizeof
Xstatic
Xstruct
Xswitch
Xthis
Xtypedef
Xtypeof
Xunion
Xunsigned
Xvirtual
Xvoid
Xvolatile
Xwhile
END_OF_FILE
if test 520 -ne `wc -c <'c++.gperf'`; then
echo shar: \"'c++.gperf'\" unpacked with wrong size!
fi
# end of 'c++.gperf'
fi
if test -f 'gperf.c' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'gperf.c'\"
else
echo shar: Extracting \"'gperf.c'\" \(5485 characters\)
sed "s/^X//" >'gperf.c' <<'END_OF_FILE'
X/* C code produced by gperf version 2.3 (GNU C++ version) */
X/* Command-line: gperf -C -D -S1 c++.gperf */
X
X#define TOTAL_KEYWORDS 71
X#define MIN_WORD_LENGTH 2
X#define MAX_WORD_LENGTH 13
X#define MIN_HASH_VALUE 7
X#define MAX_HASH_VALUE 167
X/* maximum key range = 161, duplicates = 3 */
X
Xstatic unsigned int
Xhash (str, len)
X register const char *str;
X register int len;
X{
X static const unsigned char asso_values[] =
X {
X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168,
X 168, 168, 168, 168, 168, 0, 168, 40, 20, 45,
X 30, 10, 100, 30, 35, 65, 168, 0, 30, 40,
X 110, 60, 5, 168, 25, 0, 10, 5, 15, 20,
X 168, 15, 168, 168, 168, 168, 168, 168,
X };
X return len + asso_values[str[len - 1]] + asso_values[str[0]];
X}
X
Xconst char *
Xin_word_set (str, len)
X register const char *str;
X register int len;
X{
X register int key = hash (str, len);
X const char *resword;
X
X switch (key)
X {
X case 7:
X resword = "__asm__"; break;
X case 9:
X resword = "__const__"; break;
X case 10:
X resword = "__inline__";
X if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword;
X resword = "__typeof__";
X if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword;
X resword = "__signed__";
X if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword;
X return 0;
X case 11:
X resword = "__alignof__"; break;
X case 12:
X resword = "__volatile__"; break;
X case 13:
X resword = "__attribute__"; break;
X case 14:
X resword = "this"; break;
X case 15:
X resword = "short"; break;
X case 16:
X resword = "struct"; break;
X case 17:
X resword = "__const"; break;
X case 18:
X resword = "__inline"; break;
X case 20:
X resword = "__volatile"; break;
X case 21:
X resword = "__attribute"; break;
X case 22:
X resword = "private"; break;
X case 24:
X resword = "else"; break;
X case 25:
X resword = "break"; break;
X case 26:
X resword = "except"; break;
X case 28:
X resword = "try"; break;
X case 31:
X resword = "raises"; break;
X case 33:
X resword = "volatile"; break;
X case 35:
X resword = "while"; break;
X case 36:
X resword = "signed"; break;
X case 38:
X resword = "__signed"; break;
X case 40:
X resword = "raise"; break;
X case 41:
X resword = "switch"; break;
X case 42:
X resword = "reraise"; break;
X case 43:
X resword = "unsigned"; break;
X case 44:
X resword = "protected"; break;
X case 45:
X resword = "__asm"; break;
X case 46:
X resword = "delete";
X if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword;
X resword = "double";
X if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword;
X return 0;
X case 47:
X resword = "default"; break;
X case 49:
X resword = "void"; break;
X case 50:
X resword = "class"; break;
X case 51:
X resword = "static"; break;
X case 52:
X resword = "virtual"; break;
X case 54:
X resword = "enum"; break;
X case 56:
X resword = "public"; break;
X case 58:
X resword = "register"; break;
X case 59:
X resword = "case"; break;
X case 60:
X resword = "const"; break;
X case 63:
X resword = "continue"; break;
X case 64:
X resword = "long"; break;
X case 73:
X resword = "all"; break;
X case 74:
X resword = "char"; break;
X case 78:
X resword = "int"; break;
X case 81:
X resword = "inline"; break;
X case 82:
X resword = "dynamic"; break;
X case 83:
X resword = "asm"; break;
X case 85:
X resword = "catch"; break;
X case 92:
X resword = "do"; break;
X case 93:
X resword = "operator"; break;
X case 94:
X resword = "goto"; break;
X case 98:
X resword = "overload"; break;
X case 104:
X resword = "auto"; break;
X case 106:
X resword = "sizeof"; break;
X case 108:
X resword = "__typeof"; break;
X case 109:
X resword = "__alignof"; break;
X case 115:
X resword = "float"; break;
X case 116:
X resword = "typeof"; break;
X case 117:
X resword = "typedef"; break;
X case 120:
X resword = "union"; break;
X case 126:
X resword = "extern"; break;
X case 128:
X resword = "for"; break;
X case 129:
X resword = "exception"; break;
X case 133:
X resword = "new"; break;
X case 136:
X resword = "friend"; break;
X case 141:
X resword = "return"; break;
X case 167:
X resword = "if"; break;
X default: return 0;
X }
X if (*str == *resword && !strcmp (str + 1, resword + 1))
X return resword;
X return 0;
X}
X
X/* Tests the generated perfect hash function. */
X
X#include <stdio.h>
X
X#define MAX_LEN 80
X
Xint
Xmain ()
X{
X char buf[MAX_LEN];
X
X while (gets (buf))
X in_word_set (buf, strlen (buf));
X
X return 0;
X}
END_OF_FILE
if test 5485 -ne `wc -c <'gperf.c'`; then
echo shar: \"'gperf.c'\" unpacked with wrong size!
fi
# end of 'gperf.c'
fi
echo shar: End of shell archive.
exit 0
--
Facilis descensus Averni: +---------------------------
Noctes atque dies patet atri ianua Ditis; | schmidt@ics.uci.edu (ARPA)
Sed revocare gradum superasque evadere ad auras,| office: (714) 856-4043
Hoc opus, hic labor est. +---------------------------