[comp.sources.d] flex/lex and '\0' input

jdc@naucse.UUCP (John Campbell) (11/30/88)

I spent a long time today learning that flex and lex won't deal
with '\0' input.  These nulls are used as padding after line feeds
on the VMS system, or at least they show up in the log files I
wanted to scan.

Can anyone explain why a scanning tool like lex or flex would be
designed to choke on anything in the range of 0-255?  If I can 
find byte input sources that contain these things I'd sure like
to be able to deal with those input streams.

I solved the problem, BTW, by replacing the input routine to lex
and "squeezing" out any '\0's that appear.  This means, however,
that I had to scan the input one extra time before letting the
scanner do it's job.  

If you have a thought on how to change flex (I don't have a source
license to lex) so that it can handle '\0', I'd love to know.  If you
have a rationale regarding the current behavior I'd also like to know.
-- 
	John Campbell               ...!arizona!naucse!jdc
                                    CAMPBELL@NAUVAX.bitnet
	unix?  Sure send me a dozen, all different colors.

vern@sequoia.ee.lbl.gov (Vern Paxson) (12/01/88)

In article <1047@naucse.UUCP> jdc@naucse.UUCP (John Campbell) writes:
>I spent a long time today learning that flex and lex won't deal
>with '\0' input....
>....
>If you have a thought on how to change flex (I don't have a source
>license to lex) so that it can handle '\0', I'd love to know.  If you
>have a rationale regarding the current behavior I'd also like to know.

Rationale: there are two reasons why flex can't deal with nulls in its
input.  The first is historical: flex was original a Ratfor programming
running under Software Tools, and that combination made nulls problematic.
The second is performance: for fast scanning you want to eliminate the
check for "are we at the end of the current input buffer" from the inner
loop.  The way flex does this is to mark the end of the input buffer with a
null, and then each DFA state has a transition on null into an accepting
state which reads in the next buffer's worth of data and restarts the
scan.  This method requires that one character value be preempted to serve
as the end-of-buffer marker.  Null was chosen because it's already burdened
with an extra meaning as a C end-of-string, making it in general difficult
to treat properly.

Fixing it: it could be done but with a fair amount of work.  The problem is
that the internals of flex are sloppy and assume that 0 can be used for
marking unset values.  Finding and eliminating these would be tedious since
they aren't apparent unless you inspect the code line-by-line.  The scanner
skeleton already detects real nulls versus fake end-of-buffer ones, but it
does so after it has already accepted an input pattern.  Continuing the
state machine where it left off requires enough bookkeeping that it's a
slow process compared with inner-loop scanning, so if you have alot of
nulls in your input, the performance degradation would probably be
comparable to preprocessing the input to directly remove the nulls.

I hope you didn't lose too much time discovering this deficiency - it's
documented in the flex manual entry.

>I solved the problem, BTW, by replacing the input routine to lex
>and "squeezing" out any '\0's that appear.  This means, however,
>that I had to scan the input one extra time before letting the
>scanner do it's job.  

Yep, that's pretty much what you have to do.  Sorry.

		Vern

	Vern Paxson				vern@lbl-csam.arpa
	Real Time Systems			ucbvax!lbl-csam.arpa!vern
	Lawrence Berkeley Laboratory		(415) 486-6411