[comp.lang.c] LEX with all eight bits?

joel@techunix.BITNET (Yossi (Joel) Hoffman) (03/22/90)

Hi folks!  I was trying to use LEX to process a text (yes, text) file
that happens to use all eight bits (the 8th bit signifies Hebrew text).
I just inserted the 8-bit letters in the usual way, but LEX choked on
it.  (It didn't produce any C output at all.)  This couldn't just be
a coincidence; is there anyway I can tell LEX that I'm going to use
all 8 bits?
Any help will be much appreciated.

-Joel
(joel@techunix.technion.ac.il -or- joel@techunix.BITNET)



--

martin@mwtech.UUCP (Martin Weitzel) (03/23/90)

In article <9463@discus.technion.ac.il> joel%techunix.bitnet@jade.berkeley.edu (Yossi (Joel) Hoffman) writes:
>Hi folks!  I was trying to use LEX to process a text (yes, text) file
>that happens to use all eight bits (the 8th bit signifies Hebrew text).
>I just inserted the 8-bit letters in the usual way, but LEX choked on
>it.  (It didn't produce any C output at all.)  This couldn't just be
>a coincidence; is there anyway I can tell LEX that I'm going to use
>all 8 bits?
>Any help will be much appreciated.

Though there are some efforts to make U*IX '8 Bit clean' I have not
yet seen an implementation of 'lex' which gives support for 8-bit
chars. The major problem is that 'lex' uses the 8th bit for its own
purposes in the compiled representation of the regular expressions
(and it seems that no one at AT&T or the software companies which
port U*IX are willing to dig into the sources of 'lex' ... :-()

SO BE AWARE: Even if 'lex' produces a compilable 'lex.yy.c', the
behaviour may be strange if you feed input with the 8th bit set!
(This specific problem hit me some time ago and I was searching for
hours to track the roots of the behaviour: The pitty is that only
*some* few characters trigger the errative situation. So if SOME test
input seems to be processed correctly under SOME circumstances, you
have no guarantee that ALL input will be processed correctly under
ALL circumstances!)

Whether there are work arounds or not depends on your problem:
If you only want to process all chars whith the high bit set in
some more or less uniform way, you may roll your own version
of the 'input'-macro and translate the 8-Bit chars to some
other representation. Eg you can establish a buffer which
parallels 'yytext' where you store the 'real' input, but let
the macro return some common representation for all characters,
that you treat in the same way anyhow. [To the poster: If you
need any further hints mail me a little more about your problem]

As a general rule, avoid characters outside the range 1 .. 127
in your input as well as in the regular expression specification!
(BTW: Who knows how the PD Version FLEX handles this?)
-- 
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83

terry@pride386.UUCP (Terry Lyons) (03/23/90)

In article <9463@discus.technion.ac.il>, joel@techunix.BITNET (Yossi (Joel) Hoffman) writes:
>  is there anyway I can tell LEX that I'm going to use
> all 8 bits?
> Any help will be much appreciated.
>
 yes

declair all chars as unsigned
 

terry


-- 
**************************************************************************
*  UUNET	...!pride386!terry       *  FAX	(714) 739 - 2203         *
*  Pern is a dragons best freind                                         *
**************************************************************************

knighten@pinocchio (Bob Knighten) (03/26/90)

A recent posting on compi.compilers ---

From: vern@cs.cornell.edu (Vern Paxson)
Newsgroups: comp.sources.d,comp.compilers
Subject: flex 2.2 alpha release available
Summary: anonymous ftp to svax.cs.cornell.edu or ftp.ee.lbl.gov
Keywords: flex, lex, scanner
Message-ID: <1990Mar21.153942.3237@esegue.segue.boston.ma.us>
Date: 21 Mar 90 15:39:42 GMT
Reply-To: vern@cs.cornell.edu (Vern Paxson)
Followup-To: comp.sources.d
Organization: Cornell Univ. CS Dept, Ithaca NY
Lines: 46

Release 2.2 of flex, a lex replacement, is now available.  You can
get it via anonymous ftp to svax.cs.cornell.edu (128.84.254.2, East
coast) or ftp.ee.lbl.gov (128.3.254.68, West coast).  Retrieve
flex-2.2.alpha.tar.Z, using binary mode.

The more interesting changes between 2.2 and the previous 2.1 release are:

    - Full user documentation.

    - Support for 8-bit scanners.

    - Scanners now accept NUL's.

    - A facility has been added for dealing with multiple input buffers.

    - A number of changes to bring flex closer into compliance
      with the latest POSIX lex draft.

    - C++ support; generated scanners can be compiled with C++ compiler.

    - Support for MS-DOS, VMS, and Turbo-C integrated.

This is an alpha release.  There are a number of new features which may
not work quite right and which may have broken previous functionality.
Because of this, I'd like to keep the distribution of this release limited
to folks who don't mind that the software may be buggy and who are
willing to report bugs back to me so I can fix them.

Once the number of new bugs being found drops off sufficiently, a beta
release will be made and posted to the Usenet, probably to
comp.sources.unix.  If the alpha release proves particularly stable, the
beta will be skipped and 2.3 will instead be a full release.  The intent is
that in either case, 2.3 will come out by the end of May.

If you don't have anonymous ftp access, let me know and I'll mail
you the uuencoded tar file.

		Vern

	Vern Paxson			      vern@cs.cornell.edu
	Computer Science Dept.		      decvax!cornell!vern
	Cornell University		      vern@LBL (bitnet)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus}!esegue.  Meta-mail to compilers-request@esegue.
Please send responses to the author of the message, not the poster.

rsalz@bbn.com (Rich Salz) (03/26/90)

In article <9463@discus.technion.ac.il>
joel%techunix.bitnet@jade.berkeley.edu (Yossi (Joel) Hoffman) writes:
> is there anyway I can tell LEX that I'm going to use
>all 8 bits?


In <691@mwtech.UUCP> martin@mwtech.UUCP (Martin Weitzel) writes:
>(BTW: Who knows how the PD Version FLEX handles this?)

The latest version of FLEX, that just entered beta-test, handles
eight-bit input.  See comp.compilers for the test announcement.

It will appear in comp.sources.unix sometime after the beta-test is done.
	/rich $alz
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.
Use a domain-based address or give alternate paths, or you may lose out.