[comp.lang.c] Using Lex

john@basho.uucp (John Lacey) (08/10/90)

Normally, of course, one wants a scanner (and a parser) to work from 
a file, perhaps stdin.  Sigh.  Well, I want one that works from a string.

I am using Flex 2.3, and Bison 1.11.  I tried the following few #define's:

#undef  YY_INPUT
#define YY_INPUT(buf,result,max_size) \
{ \
for ( result = 0;  *ch_this && result < max_size; result ++ ) \
   buf[result] = *ch_this++; \
}

#define YY_USER_INIT \
   if ( scan_init ) { \
      if ( yy_flex_debug ) \
         printf ( "-- initializing for scan %d\n", scan_init ); \
      ch_this = inbuffer; \
      scan_init = 0; }

with the following couple of definitions and declarations in the scanner:

static char * ch_this;
extern char * inbuffer;
extern int scan_init;

and with inbuffer and scan_init defined in the code that calls yyparse().
This didn't work.  Well, actually, it works the first time yyparse() is 
called, but not again.  Now, YY_USER_INIT is used inside an if statement
that checks yy_init, so I moved it out of there in the scanner skeleton
so that YY_USER_INIT is seen every time the scanner is called.  Still 
no go.

Has anyone done this, or see a way to do it, or know a way to do it, or ....

Thanks.

-- 
John Lacey, 
   E-mail:  ...!osu-cis!n8emr!uncle!basho!john  (coming soon: john@basho.uucp)
   V-mail:  (614) 436--3773, or 487--8570
"What was the name of the dog on Rin-tin-tin?"  --Mickey Rivers, ex-Yankee CF

ptb@ittc.wec.com (Pat Broderick) (08/10/90)

In article <1990Aug10.012927.5558@basho.uucp>, john@basho.uucp (John Lacey) writes:
> Normally, of course, one wants a scanner (and a parser) to work from 
> a file, perhaps stdin.  Sigh.  Well, I want one that works from a string.
> ...

Recently I had occasion to do something similar.  What we did was
roughly as follows:

   - strings to be parsed are maintained in memory 
   - to parse a string a global pointer known to lex is set to point at
     the beginning of the string
   - the input() macro was redefined in terms of this pointer (standard
     uses getc(yyin))

The things needed might look something like:

LEX:

# define input() (((yytchar=yysptr>yysbuf?U(*--yysptr):getc(yyin))==10?(yylineno++,yytchar):yytchar)==EOF?0:yytchar)   /* standard defn from lex */

# define input() (((yytchar=yysptr>yysbuf?U(*--yysptr):(*yynyy++))==10?(yylineno++,yytchar):yytchar)==EOF?0:yytchar)                   ^^^^^^^^^^

					/* modified defn to use string */

extern char *yynyy;			/* will pt to start of string */


Function invoking parser:

char *yynyy;			/* globally visible */

....

    yynyy = start_of_string;
    yyparse();


This works fine for us, hope it helps.
-- 
Patrick T. Broderick           |ptb@ittc.wec.com |
                               |uunet!ittc!ptb   |
                               |(412)733-6265    |

bogatko@lzga.ATT.COM (George Bogatko) (08/10/90)

In article <1990Aug10.012927.5558@basho.uucp>, john@basho.uucp (John Lacey) writes:
> Has anyone done this, or see a way to do it, or know a way to do it, or ....

Put these lines in your lex file after the #include lines

%{
#include <stdio.h>
#include <y.tab.h>

extern char *mis_ptr;

#undef input
#undef unput
# define input() (*mis_ptr=='\n'?0:*mis_ptr++)
# define unput(c) (*--mis_ptr=(c) )
%}


Now have a char buff called myinputstring

char myinputstring[100];

do the following in main:

char *mis_ptr;
main()
{
	for(;;)
	{
		gets(buf);
		mis_ptr = buf;
		yylex();
	}
}

I think you get the picture now?


GB

jal@valha1.ATT.COM (Joseph A. Leggio) (08/12/90)

From article <1990Aug10.012927.5558@basho.uucp>, by john@basho.uucp (John Lacey):
> Normally, of course, one wants a scanner (and a parser) to work from 
> a file, perhaps stdin.  Sigh.  Well, I want one that works from a string.
> 

> Has anyone done this, or see a way to do it, or know a way to do it, or ....
> 
> -- 
> John Lacey, 

I have used these "input" and "unput" routines in
many programs where I wanted complete control of the
input stream.  The example here uses fgets to fill
a character array from stdin, but you could fill it
from any source you wish.  You only need point pointer "p"
to the start of the array each time you read a new line.

Only restriction: unput cannot back up past the start of a line.
(I have not found this to be a problem as I do not usually try
to match patterns which span multiple lines.)

I use System V Release 3 AT&T lex, "flex" might work the same, look
for the #defines for "input" and "unput" in your code.
==================================================
%%
	Lex reg-expr's go here
%%
#define BUFFER_SIZE 1024

char *p;
char buf[BUFFER_SIZE];

main(){
    p = buf;        /* point "p" to start of buf for first line     */
    while( fgets(buf, sizeof(buf), stdin) != NULL ) { /* read line  */
        yylex();                                      /* parse line */
        p = buf;    /* point "p" back to start of buf for next line */
    }
exit(0);
}

#ifdef input
#undef input
#endif
#ifdef unput
#undef unput
#endif

/* replacement "input" routine for lex, uses char array "buf"  */
char input()
{
    if ( p < buf + ( BUFFER_SIZE - 1 ) )
            return(*p++);
    else
            return((char)0);
}

/* replacement "unput" routine for lex, uses char array "buf"  */
unput(c)
char c;
{
    if ( p > buf )
        *(--p) = c;     
}

=============================================================
Joe Leggio WB2HOL
AT&T Customer Software Services
Valhalla, NY
att!valha1!jal

chris@mimsy.umd.edu (Chris Torek) (08/13/90)

(This topic probably belongs elsewhere; perhaps comp.lang.misc or
comp.unix.questions....  Ah well.)

In article <174@ittc.wec.com> ptb@ittc.wec.com (Pat Broderick) writes:
>The things needed might look something like:
># define input() (((yytchar=yysptr>yysbuf?U(*--yysptr):getc(yyin))==10?
 (yylineno++,yytchar):yytchar)==EOF?0:yytchar)   /* standard defn from lex */

># define input() (((yytchar=yysptr>yysbuf?U(*--yysptr):(*yynyy++))==10?
 (yylineno++,yytchar):yytchar)==EOF?0:yytchar)          ^^^^^^^^^^

>This works fine for us, hope it helps.

This should work, but is overkill.  (It also does not address a question
whose answer I myself am unsure about.)

Here is what is going on:

Lex (or Flex or other similar lexer of your choice) implements a DFSA
(Deterministic Finite State Automaton) that is simply an optimized (in
space if not time) variant on

	int table[NSTATES][128] = { huge amounts of junk };

	yylex() {
		int state = 0;	/* ignoring BEGINs that is */
		yyleng = 0;
		for (;;) {
			c = input();		/* eat the next char */
			yytext[yyleng++] = c;	/* store it in yytext */
			state = table[state][c];/* find the next state */
			switch (state) {	/* and see what to do */
			... cases that exactly match something ...
				do actions from C code;
			... cases indicating we ate too much ...
				unput(some of the things we ate);
				do actions from C code;
			... cases indicating `no match' ...
				output(the things we ate);
				break;
			}
		}
	}

One noteworthy thing about this is that lex can never unput() something
it has not input() `from the same place' (unless you put you own unput()
actions into your lexer: a dangerous practise).  Thus, if you are reading
from a string in a buffer, your `unput' action can be much simpler, and
likewise your input() macro can be simplified:

#define	input() ((yytchar = *mystring++) == '\n' ? (yylineno++, yytchar) : \
		 yytchar)
#define	unput(c) (mystring--)

These two also take advantage of the fact that Lex wants `EOF' to be
the value 0, rather than the (implementation-defined but usually) -1
that stdio returns.  The end of a C string is the character '\0' which
has the value 0.

The question I have is whether lex might call input() again after reading
EOF once.  Since the end of a real file tends to remain the end of the
file no matter how many times it is read per second, it seems possible
that the implementation might invoke input() again after input() returns
0 but without an intervening unput(0)---i.e., it may depend on EOF being
`sticky'.  In this case the input macro must be more careful:

#define input() ((yytchar = *mystr++) == 0 ? (mystr--, 0) : \
	(yytchar == '\n' ? (yylineno++, '\n') : yytchar))

or as a GCC inline function:

	static inline input() {
		int c = *mystr++;
		if (c == 0) mystr--;
		else if (c == '\n') yylineno++;
		return c;
	}

This requires a corresponding change to unput(), however, since now
unput(0) should do nothing:

#define unput(c) ((c) ? mystr-- : 0)

All of these eliminate the need for yysbuf (an array of size YYLMAX that
holds unput() characters since ungetc() only guarantees one character of
pushback).

Lex could be considerably more efficient by avoiding all this copying of
text from one place to another; I believe flex does this.  This usually
means bypassing stdio, of course....
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris
	(New campus phone system, active sometime soon: +1 301 405 2750)

chris@mimsy.umd.edu (Chris Torek) (08/14/90)

In article <25996@mimsy.umd.edu> I suggested:
>#define unput(c) (mystring--)

(and then eventually)

>#define unput(c) ((c) ? mystr-- : 0)

The first of these will fail because lex uses unput as

	unput(*--yylastch)

and

	unput(*yylastch--)

Thanks to Brad White for noticing this error.  The first one should
be `#define unput(c) ((c), mystring--)'.

(This is what I get for making changes to lex things based on theoretical
arguments without checking to see whether lex uses good programming
practises :-) .)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris
	(New campus phone system, active sometime soon: +1 301 405 2750)