[comp.bugs.sys5] Lex bug

sandel@tuvalu.sw.mcc.com (Charles Sandel) (06/23/89)

One of our programmers found a bug in the 'lex' source.
The bug report follows.

    We get various error messages from Andrew's class.  I finally tracked
    down what was the problem: a bug in lex.  class uses lex for its
    lexical input, in particular to recognize the class keywords like
    InitializeObject and FinalizeObject.  InitializeObject was not being
    recognized.  Tracing through the state machine generated by lex and
    then used to parse the input, it turns out that there are *exactly*
    255 states in the state machine.  During construction, the states are
    listed as 0 to 255.  In cmd/lex/sub2.c, line 797, lex checks to see
    if it needs a byte (char) or larger quantity to store the states:

    fprintf(fout,"# define YYTYPE %s\n",stnum+1 > NCH ? "int" : "char");

    where NCH is 256 (number of characters).  Notice that stnum+1 is
    compared rather than stnum (number of states).  This is because the
    zero state is used as an error state, and the states 0..255 are shifted
    up by 1 to 1..256.  stnum is 255 so stnum+1 is 256.  stnum+1 is not
    greater than NCH (which is 256) since they are equal, and a char is
    then used for YYTYPE which holds the state number.  As a result, lex
    creates tables which store state 256 (old state 255+1) in a char.  This
    is, of course, zero, and the lexical token ending in that state is
    not recognized.  I will submit a bug report on this today.
    
    In fact this code is badly written in several respects.  First, the
    comparison should not be against NCH.  The value of NCH varies according
    to whether NLS (with an 8-bit character code) is used or not.  However
    the size of a number that can be stored in a char is dependent upon
    the size of the char, not whether a 7-bit or 8-bit character code is
    used.  Further, we want the smallest storage unit that will hold all
    the state values.  Thus the code should be:


    fprintf(fout,"# define YYTYPE %s\n",
               (stnum+1 <= 0xFF) ? "unsigned char" 
	       :  (stnum+1 <= 0xFFFF) ? "unsigned short" 
	       :   "unsigned long");