[comp.lang.c] Lex man page flame

martin@mwtech.UUCP (Martin Weitzel) (06/02/90)
In article <116@bohra.cpg.oz> ejp@bohra.cpg.oz.au (Esmond Pitt) writes:
>In article <1990May30.174745.1161@csrd.uiuc.edu> pommu@iis.ethz.ch (Claude Pommerell) writes:
>>
[about changing start conditions at entry to yylex()]
>
>There are two even simpler ways.

`Simpler' mostly depends on your view and expectations ...
>
>Instead of effectively changing the initial condition to <Text>, either:
>
>1. Ensure each start-state is equipped with enough rules to handle any
>possible input, and, as the documentation does state, place all the
>unlabelled rules after all the labelled rules, and/or

Sounds not simpler to me.

>
>2. Label all the rules you only want applied in the INITAL state with
><INITIAL>, so they won't be applied as defaults in other states.

This trades off one undocumented feature (stuff *after* the first '%%'
line and *before* the first rule) against another undocumented feature.
But to be fair: Strictly following some man pages for lex nearly
every a nontrivial lex applications would use some undocumented
feautures.

I just looked up for the purpose of writing this:

	1.) SVID (1986)
	2.) XPG3 (1989)
	3.) ISC Programmers Reference Manual
	4.) ISC Programmers Guide (1988)

.FLAME ON

Not any single mentioning of start conditions in (4) at all (neither
the syntax in rules, nor the special action BEGIN).

Worse in some example the the advice is for a lex program to

	#define BEGIN 1

(believe it or not) as a `good programming style' for returning
tokens. This finally reveals that the author can have never heard
something about start conditions. The example of the lex program
has a line:

	begin	return (BEGIN);

Please ISC, could you send the person who has written this guide
to a lex+yacc course (BTW: I'm teaching some :-)) before a revised
version is produced. Well, there is mentioned that yacc contains
a feature to supply token-defines, but it's bad practice to give
advices that turn out to be not only unnecessary, but dangerous
too. The only advice in this guide that isn't near to worthless
is to look into the paper about Lex written by Mike Lesk. (You were
better advised printing this paper in the guide than the section
that's in by now.)

An other advice there is to check out the reference manual (3).

Be aware: IF YOU TRY TO USE LEX WITH THIS REFERENCE, YOU WILL
	  BE ABSOLUTELY LOST. (Better save your time trying,
	  rather end work soon, go out and have a nice evening
	  - or, again, look for the Lesk-paper.)

Ehhm, we are talking about start conditions. The reference manual
is *very* silent about them - in fact no mentioning. On the
other hand: The author was quite careful to mention, that the
"-r option is not yet fully operational". (What this option
tells is that lex should produce RATFOR source instead of C.
Oh, how many times I needed that an wondered why it just
didn't "fully" work - but good news, not "yet", that is, the
day will come when I finally can switch from C to RATFOR :-).)

But before we beat ISC too much: I suppose they took what they
got from somewhere (AT&T?) and only made it a little worse.
Let's look at (1) - and don't say a reference dated from 1986 is
too old today: The stuff we are talking about is *much* longer
in lex. SVID does the good job of printing a table which
shows the regular expression syntax for lex rules (it's quite
similar as the "extended regular expressions" of egrep and
awk, but there are some differences). In this table you'll
find the syntax of start conditions, but not the least
mentioning of them and BEGIN in the rest of the text. So, if
you read the table you probably think you must be stupid, if
you don't know lex and hence you don't understand what

	<s>r	the occurence of the regular expression
		r only when the program is in start
		condition (state) s

shall tell you. (Again, start conditions or states and the
special action BEGIN is not mentioned anywhere else in the
section).

Finally to (2), which seems a not so bad re-work of the SVID in
other areas. A quick scan thru the lex section reveals that it is
quite similar to (1), but the table with the regular expression
syntax is ommitted in favor of a difference list to extended regular
expressions. (BTW: The difference list is not complete.) The same
sentence concerning the <s>r-Syntax as in (1) appears but again
nothing about start conditions, states, and BEGIN in the rest of
the section.

.FLAME OFF

Hello AT&T, anybody listening: If you haven't revised the manuals
recently, please do a complete rewrite of the lex section but find
somebody as author who has sufficient experience with lex+yacc in
non-trivial applications *and* who can explain understandable to
mortals. (From many publications I know such people are working
at AT&T - I'm volunteering doing a proof-read.)

>
>Placing non-labelled rules before labelled rules is probably the single
>most common error in writing LEX scripts, even after 15 years.
>
>I don't know why.

After reading the above, you probably know why many novices struggle
with lex. Concerning your the problem, here are the

	THREE BIG DISAMBIGUATING RULES

		1) leftmost
		2) longer
		3) first in source

which tells us: Take the input stream and write it down in one line
from left to right: Then, in case two rules might match some part of
the input stream, lex chooses the rule that matches more to the left,
that matches the longer regular expression or that matches the regular
expression which appears first in (lex) source, with (1) having higher
priority than (2) having higher priority than (3).

This is well choosen, because it enables us to do the following:

	"if"	{ ... action for keyword if ... }
	[a-z]+	{ ... action for identifier ... }

which triggers the second for " fif " (because of 1), as well
as for " iff " (because of 2), but the first for " if " (because
of 3). Start conditions are no exceptions from this rule!
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83