schmidt@crimee.ics.uci.edu (Doug Schmidt) (03/22/90)
Howdy, Several weeks ago I posted the result of an empirical comparison between a FLEX-generated scanner and a GPERF-generated recognizer for C++ reserved words (actually GNU G++ reserved words, which has 71 total keywords). Thanks to FLEX author Vern Paxson, who provided a more suitable flex input specification, I've got some new results. It turns out that given the appropriate input spec and program option, FLEX generates extremely fast scanners. The empirical results are presented below, followed by a shar file containing the FLEX input file, the GPERF input file, and the GPERF-generated recognizer. First the relative object-code sizes in bytes for the respective .o files: ---------------------------------------- Filename Total Text Data Bss Options gperf.o 2,576 2,576 0 0 -C -D -S1 comp-flex.o 24,320 7,824 56 16,440 -cem mid-flex.o 41,744 3,040 22,264 16,440 -cfe flex.o 117,360 2,816 98,104 16,440 -f ---------------------------------------- The input file contained a large number of words, both user-defined identifiers and keywords, organized with one word per line. The words were created by running the UNIX command tr -cs A-Za-z_ '\012' on the preprocessed source code for ET++. The input file characteristic were: Input File Identifiers Keywords Total ET++.input 624,156 350,466 974,622 All reserved word recognizer programs were tested on an otherwise idle Sun 4/260 with 32 megabytes of RAM; they were compiled by the GNU G++ 1.37.1 compiler with the: -O -fstrength-reduce -finline-functions -fdelayed-branch options enabled. Here are the timing results: % /bin/time flex.exe < ET++.input 12.6 real 9.4 user 2.3 sys % /bin/time mid-flex.exe < ET++.input 15.8 real 12.7 user 2.5 sys %/bin/time gperf.exe < ET++.input 19.9 real 17.2 user 1.4 sys % /bin/time comp-flex.exe < ET++.input 40.0 real 34.8 user 2.4 sys Tentative conclusion from all this are: * The uncompacted table-driven flex implementation was both the fastest and the largest implementation, illustrating the time/space trade-off dichotomy. Applications where saving time is more important than conserving space should benefit from this approach. * The efficiency of FLEX-generated code is highly dependent on choosing the appropriate set of program options and specifying the input file correctly. The new alpha-release of FLEX 2.2 includes documentation that explains how to specify the input description for best results. * Despite having a load factor of only 0.45 percent the GPERF-generated perfect hash function recognizer is both space and time efficient. FLEX recognizers allow programmers to trade off space for time, perfect hashing does not overly penalize either. * Hopefully, these results will help dispel the notion that automatically generated scanners are inherently slower than hand-coded scanners. Vern has clearly done a great job of speeding up FLEX-generated code (certainly compared against that produced by the standard UNIX LEX utility!). For a discussion of GPERF (and a more complete comparison of GPERF and FLEX generated scanners against other common reserved word recognizer implementations) please see my paper on GPERF appearing in the upcoming USENIX C++ Conference proceedings. Finally, here's a reprint of Vern's annoucement of release 2.2 of flex: Flex, a lex replacement, is now available via anonymous ftp to svax.cs.cornell.edu (128.84.254.2, East coast) or ftp.ee.lbl.gov (128.3.254.68, West coast). Retrieve flex-2.2.alpha.tar.Z, using binary mode. Doug ---------------------------------------- #! /bin/sh # This is a shell archive. Remove anything before this line, then unpack # it by saving it into a file and typing "sh file". To overwrite existing # files, type "sh file -c". You can also feed this as standard input via # unshar, or by typing "sh <file", e.g.. If this archive is complete, you # will see the following message at the end: # "End of shell archive." # Contents: c++.l c++.gperf gperf.c # Wrapped by schmidt@crimee.ics.uci.edu on Wed Mar 21 16:38:44 1990 PATH=/bin:/usr/bin:/usr/ucb ; export PATH if test -f 'c++.l' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'c++.l'\" else echo shar: Extracting \"'c++.l'\" \(916 characters\) sed "s/^X//" >'c++.l' <<'END_OF_FILE' X%{ X#ifdef __GNUG__ Xextern "C" int yylex (void); X#endif X%} X%a 8000 X%p 3000 X%o 12000 X%% X__alignof\n | X__alignof__\n | X__asm\n | X__asm__\n | X__attribute\n | X__attribute__\n | X__const\n | X__const__\n | X__inline\n | X__inline__\n | X__signed\n | X__signed__\n | X__typeof\n | X__typeof__\n | X__volatile\n | X__volatile__\n | Xall\n | Xexcept\n | Xexception\n | Xraise\n | Xraises\n | Xreraise\n | Xtry\n | Xasm\n | Xauto\n | Xbreak\n | Xcase\n | Xcatch\n | Xchar\n | Xclass\n | Xconst\n | Xcontinue\n | Xdefault\n | Xdelete\n | Xdo\n | Xdouble\n | Xdynamic\n | Xelse\n | Xenum\n | Xextern\n | Xfloat\n | Xfor\n | Xfriend\n | Xgoto\n | Xif\n | Xinline\n | Xint\n | Xlong\n | Xnew\n | Xoperator\n | Xoverload\n | Xprivate\n | Xprotected\n | Xpublic\n | Xregister\n | Xreturn\n | Xshort\n | Xsigned\n | Xsizeof\n | Xstatic\n | Xstruct\n | Xswitch\n | Xthis\n | Xtypedef\n | Xtypeof\n | Xunion\n | Xunsigned\n | Xvirtual\n | Xvoid\n | Xvolatile\n | Xwhile\n ; X[_A-Z0-9a-z]+\n? ; X.|\n ; END_OF_FILE if test 916 -ne `wc -c <'c++.l'`; then echo shar: \"'c++.l'\" unpacked with wrong size! fi # end of 'c++.l' fi if test -f 'c++.gperf' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'c++.gperf'\" else echo shar: Extracting \"'c++.gperf'\" \(520 characters\) sed "s/^X//" >'c++.gperf' <<'END_OF_FILE' X__alignof X__alignof__ X__asm X__asm__ X__attribute X__attribute__ X__const X__const__ X__inline X__inline__ X__signed X__signed__ X__typeof X__typeof__ X__volatile X__volatile__ Xall Xexcept Xexception Xraise Xraises Xreraise Xtry Xasm Xauto Xbreak Xcase Xcatch Xchar Xclass Xconst Xcontinue Xdefault Xdelete Xdo Xdouble Xdynamic Xelse Xenum Xextern Xfloat Xfor Xfriend Xgoto Xif Xinline Xint Xlong Xnew Xoperator Xoverload Xprivate Xprotected Xpublic Xregister Xreturn Xshort Xsigned Xsizeof Xstatic Xstruct Xswitch Xthis Xtypedef Xtypeof Xunion Xunsigned Xvirtual Xvoid Xvolatile Xwhile END_OF_FILE if test 520 -ne `wc -c <'c++.gperf'`; then echo shar: \"'c++.gperf'\" unpacked with wrong size! fi # end of 'c++.gperf' fi if test -f 'gperf.c' -a "${1}" != "-c" ; then echo shar: Will not clobber existing file \"'gperf.c'\" else echo shar: Extracting \"'gperf.c'\" \(5485 characters\) sed "s/^X//" >'gperf.c' <<'END_OF_FILE' X/* C code produced by gperf version 2.3 (GNU C++ version) */ X/* Command-line: gperf -C -D -S1 c++.gperf */ X X#define TOTAL_KEYWORDS 71 X#define MIN_WORD_LENGTH 2 X#define MAX_WORD_LENGTH 13 X#define MIN_HASH_VALUE 7 X#define MAX_HASH_VALUE 167 X/* maximum key range = 161, duplicates = 3 */ X Xstatic unsigned int Xhash (str, len) X register const char *str; X register int len; X{ X static const unsigned char asso_values[] = X { X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, X 168, 168, 168, 168, 168, 168, 168, 168, 168, 168, X 168, 168, 168, 168, 168, 0, 168, 40, 20, 45, X 30, 10, 100, 30, 35, 65, 168, 0, 30, 40, X 110, 60, 5, 168, 25, 0, 10, 5, 15, 20, X 168, 15, 168, 168, 168, 168, 168, 168, X }; X return len + asso_values[str[len - 1]] + asso_values[str[0]]; X} X Xconst char * Xin_word_set (str, len) X register const char *str; X register int len; X{ X register int key = hash (str, len); X const char *resword; X X switch (key) X { X case 7: X resword = "__asm__"; break; X case 9: X resword = "__const__"; break; X case 10: X resword = "__inline__"; X if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword; X resword = "__typeof__"; X if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword; X resword = "__signed__"; X if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword; X return 0; X case 11: X resword = "__alignof__"; break; X case 12: X resword = "__volatile__"; break; X case 13: X resword = "__attribute__"; break; X case 14: X resword = "this"; break; X case 15: X resword = "short"; break; X case 16: X resword = "struct"; break; X case 17: X resword = "__const"; break; X case 18: X resword = "__inline"; break; X case 20: X resword = "__volatile"; break; X case 21: X resword = "__attribute"; break; X case 22: X resword = "private"; break; X case 24: X resword = "else"; break; X case 25: X resword = "break"; break; X case 26: X resword = "except"; break; X case 28: X resword = "try"; break; X case 31: X resword = "raises"; break; X case 33: X resword = "volatile"; break; X case 35: X resword = "while"; break; X case 36: X resword = "signed"; break; X case 38: X resword = "__signed"; break; X case 40: X resword = "raise"; break; X case 41: X resword = "switch"; break; X case 42: X resword = "reraise"; break; X case 43: X resword = "unsigned"; break; X case 44: X resword = "protected"; break; X case 45: X resword = "__asm"; break; X case 46: X resword = "delete"; X if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword; X resword = "double"; X if (*str == *resword && !strcmp (str + 1, resword + 1)) return resword; X return 0; X case 47: X resword = "default"; break; X case 49: X resword = "void"; break; X case 50: X resword = "class"; break; X case 51: X resword = "static"; break; X case 52: X resword = "virtual"; break; X case 54: X resword = "enum"; break; X case 56: X resword = "public"; break; X case 58: X resword = "register"; break; X case 59: X resword = "case"; break; X case 60: X resword = "const"; break; X case 63: X resword = "continue"; break; X case 64: X resword = "long"; break; X case 73: X resword = "all"; break; X case 74: X resword = "char"; break; X case 78: X resword = "int"; break; X case 81: X resword = "inline"; break; X case 82: X resword = "dynamic"; break; X case 83: X resword = "asm"; break; X case 85: X resword = "catch"; break; X case 92: X resword = "do"; break; X case 93: X resword = "operator"; break; X case 94: X resword = "goto"; break; X case 98: X resword = "overload"; break; X case 104: X resword = "auto"; break; X case 106: X resword = "sizeof"; break; X case 108: X resword = "__typeof"; break; X case 109: X resword = "__alignof"; break; X case 115: X resword = "float"; break; X case 116: X resword = "typeof"; break; X case 117: X resword = "typedef"; break; X case 120: X resword = "union"; break; X case 126: X resword = "extern"; break; X case 128: X resword = "for"; break; X case 129: X resword = "exception"; break; X case 133: X resword = "new"; break; X case 136: X resword = "friend"; break; X case 141: X resword = "return"; break; X case 167: X resword = "if"; break; X default: return 0; X } X if (*str == *resword && !strcmp (str + 1, resword + 1)) X return resword; X return 0; X} X X/* Tests the generated perfect hash function. */ X X#include <stdio.h> X X#define MAX_LEN 80 X Xint Xmain () X{ X char buf[MAX_LEN]; X X while (gets (buf)) X in_word_set (buf, strlen (buf)); X X return 0; X} END_OF_FILE if test 5485 -ne `wc -c <'gperf.c'`; then echo shar: \"'gperf.c'\" unpacked with wrong size! fi # end of 'gperf.c' fi echo shar: End of shell archive. exit 0 -- Facilis descensus Averni: +--------------------------- Noctes atque dies patet atri ianua Ditis; | schmidt@ics.uci.edu (ARPA) Sed revocare gradum superasque evadere ad auras,| office: (714) 856-4043 Hoc opus, hic labor est. +---------------------------