jdc@naucse.UUCP (John Campbell) (11/30/88)
I spent a long time today learning that flex and lex won't deal with '\0' input. These nulls are used as padding after line feeds on the VMS system, or at least they show up in the log files I wanted to scan. Can anyone explain why a scanning tool like lex or flex would be designed to choke on anything in the range of 0-255? If I can find byte input sources that contain these things I'd sure like to be able to deal with those input streams. I solved the problem, BTW, by replacing the input routine to lex and "squeezing" out any '\0's that appear. This means, however, that I had to scan the input one extra time before letting the scanner do it's job. If you have a thought on how to change flex (I don't have a source license to lex) so that it can handle '\0', I'd love to know. If you have a rationale regarding the current behavior I'd also like to know. -- John Campbell ...!arizona!naucse!jdc CAMPBELL@NAUVAX.bitnet unix? Sure send me a dozen, all different colors.
vern@sequoia.ee.lbl.gov (Vern Paxson) (12/01/88)
In article <1047@naucse.UUCP> jdc@naucse.UUCP (John Campbell) writes: >I spent a long time today learning that flex and lex won't deal >with '\0' input.... >.... >If you have a thought on how to change flex (I don't have a source >license to lex) so that it can handle '\0', I'd love to know. If you >have a rationale regarding the current behavior I'd also like to know. Rationale: there are two reasons why flex can't deal with nulls in its input. The first is historical: flex was original a Ratfor programming running under Software Tools, and that combination made nulls problematic. The second is performance: for fast scanning you want to eliminate the check for "are we at the end of the current input buffer" from the inner loop. The way flex does this is to mark the end of the input buffer with a null, and then each DFA state has a transition on null into an accepting state which reads in the next buffer's worth of data and restarts the scan. This method requires that one character value be preempted to serve as the end-of-buffer marker. Null was chosen because it's already burdened with an extra meaning as a C end-of-string, making it in general difficult to treat properly. Fixing it: it could be done but with a fair amount of work. The problem is that the internals of flex are sloppy and assume that 0 can be used for marking unset values. Finding and eliminating these would be tedious since they aren't apparent unless you inspect the code line-by-line. The scanner skeleton already detects real nulls versus fake end-of-buffer ones, but it does so after it has already accepted an input pattern. Continuing the state machine where it left off requires enough bookkeeping that it's a slow process compared with inner-loop scanning, so if you have alot of nulls in your input, the performance degradation would probably be comparable to preprocessing the input to directly remove the nulls. I hope you didn't lose too much time discovering this deficiency - it's documented in the flex manual entry. >I solved the problem, BTW, by replacing the input routine to lex >and "squeezing" out any '\0's that appear. This means, however, >that I had to scan the input one extra time before letting the >scanner do it's job. Yep, that's pretty much what you have to do. Sorry. Vern Vern Paxson vern@lbl-csam.arpa Real Time Systems ucbvax!lbl-csam.arpa!vern Lawrence Berkeley Laboratory (415) 486-6411