[net.unix] lex

rcj@burl.UUCP (R. Curtis Jackson) (04/18/84)

Well, last year about this time I found a bug in yacc(1) [actually
in its documentation] that cost me three weeks of valuable time
to track down, since I assumed it was my program which was in error.

This year it is lex(1), same situation only I found this one quicker:

lex reserves a certain amount of space for its input buffer -- I haven't
taken the time to check how much -- that is left as an exercise for the
reader.  If you are trying to match some hellaciously long strings or,
like myself, you screw up and accidentally match a hellaciously long
string, you will get either a memory fault or a bus error in your
program.  This is because lex does not check to see whether it has
filled its input array and keeps merrily on writing into random
areas of memory.

I am going to get all the specifics on this problem to submit to the
Unix Hotline (although it has been over a year since I told them about
the bug in the yacc(1) documentation and I haven't seen that corrected).
If anyone is interested in more particulars, drop me a line and I will
send it -- the bug fix should be trivial once you know the problem.

By the way, could some lucky person who has the System V Release 2
Programmer's (not User's) Manuals look at the yacc(1) documentation and
see if anything is mentioned about token numbers over 1000 not being
allowed?  Maybe they did process the documentation MR and I just haven't
seen the results.

Thanks in advance,
-- 

The MAD Programmer -- 919-228-3313 (Cornet 291)
alias: Curtis Jackson	...![ ihnp4 ulysses cbosgd clyde ]!burl!rcj

martin@noscvax.UUCP (Douglas W. Martin) (06/10/85)

    Several weeks ago I requested info explaining lex output
statistics; e.g. the meanings of the "%" parameters,
what sorts of rules caused which tables to increase in size,
etc.  I didn't get many responses, but those I received
were most helpful.
A summary of responses follows.
Thanks much.
Doug Martin
arpanet: martin@nosc
------------------------------------------------------
From: (Vern Paxson [rtsg]) vern@lbl-csam
I can tell you what some of the mysterious %{e,p,n,...}'s in the Lex output are:

    %e is the number of NFA states needed to recognize the input patterns.  It
       is a linear function of the number of characters and operators in
       the patterns.

    %n is the number of DFA states needed for the recognizer.  It is
       potentially an exponential function of the number of NFA states,
       but in practice tends to be no greater and usually less than the
       %e figure.

    %t is the number of state transitions in the NFA.  It is a function of
       the complexity of your patterns (especially "*", and "+" operators)
       and especially the occurence of character classes inside of closures.
       One character class and closure is enough to do it.  For example,
       "keyword recognizer" inputs like:

	   begin	return( TOK_BEGIN );
	   end		return( TOK_END );
	   procedure	return( TOK_PROCEDURE );
	   .
	   .
	   .
	   [a-z]+	return( TOK_IDENTIFIER );

       will result in a very high %t figure.  It's Lex's job to compress
       all of these transitions into something much smaller, but it doesn't
       do an especially good job.  We have written our own Lex program in
       Ratfor and our experience shows that you can halve the size of the
       tables for complicated sets of rules.

    %o is basically the compressed version of %t.  The total amount of
       memory the tables generated by Lex will take is

	   3*(%n) + 2*(%o) + X  integer words

       where X is roughly 2*(%n).  Most of this stuff could be declared
       as short but isn't.  In addition, there's a constant overhead of
       two arrays of 128 characters each.

I could make guesses as to what %k, %p, and %a are, but all they'd be
is guesses.

If your concern is in staying within Lex's default maximums for these
values, the way to do it is to eliminate ambiguities and similarities between
rules.  The above example of rules that recognize keywords and identifiers
will use a great deal less of Lex's resources (and have much smaller output
tables) if you get rid of either the keyword rules or the identifier rules.
The complexity in the set of rules arises from the ambiguity of each
keyword also matching the identifier rule.

But if you want to keep your input rules intact, you can increase Lex's
maximum for a given value by putting, e.g,

    %a 3000

in the declarations section of the input.  In picking how much to raise
the value, I just increase by 500 or 1000 and see if that's enough.  Judging
from the operation of our Lex, I don't think any of these maximums have
a geometric effect on the amount of memory Lex uses.  They may very easily
have a geometric effect on Lex's run-time, though.

Hope this helps,

Vern Paxson
Real Time Systems Group
Lawrence Berkeley Laboratory

vern@lbl-csam.ARPA
From: <sdcsvax!sdcrdcf!hplabs!hao!seismo!maryland.ARPA!aplvax!ded>
I was confronted with this problem about one year ago.  I had to
increase the table parameters (I forget the name of the file, but
it was a .h file) and recompile lex.  I called my new lex "biglex" and
it worked just fine.  

How do you determine the correct size for the table parameters?  I didn't 
have the time or inclination to analyze the operation of lex, so I just 
used trial and error.  What can I say; it worked!


-- 

					Don Davis
					JHU/APL
				...decvax!harpo!seismo!umcp-cs!aplvax!ded
				...rlgvax!cvl!umcp-cs!aplvax!ded
--------------------------------------------------