[mod.compilers] YACC for dynamic grammars

johnl@ima.UUCP (01/26/87)

Is it possible to use YACC to write a parser for a simple language
whose gramatic may dynamically change, namely defining new operators?
In this interpreter, new operators (unary or binary) can be added
by the user dynamically, e.g. define the binary operator ++
with a certain precedence and then the system can recognize a ++ b.
I have no experiences with YACC, I see it in the following way:
YACC generates some tables which correspond to the grammar.
If a new operator is added to the grammar, it should be possible
to dynamically update the tables in order to recognize the new item.
If this is not possible, is there any other way?

--Micha

ARPA:		micha%ecrcvax.UUCP@seismo.CSS.gov
		micha%ecrcvax.UUCP%Germany.CSNET@CSNET-RELAY
CSNET:		micha%ecrcvax.UUCP@Germany.CSNET
UUCP:		unido!ecrcvax!micha
[I doubt it -- the calculation that produces the parsing tables from the input
grammar is pretty complicated.  If you only allow people to define new
operators that have the same precedence as existing ones, your job is pretty
easy since you haven't really changed the syntax, just added new flavors of
existing tokens from the point of view of the parser.  If you get more general
than that, you can add ambiguties and you probably need the full extremely slow
glory of Earley's algorithm, which is a long way from the efficient LALR
approach of yacc.  -John]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (01/27/87)

It looks hard.  One problem is that the tables that Yacc produces are
the result of global analysis of the grammar--single entries in the
tables aren't direct reflections of single grammar rules.  But
dynamically adding operators is easier than dynamic grammars in
general, since only a small and fairly well-defined set of parser
table entries deal with this particular situation.  Unfortunately,
after Yacc is done constructing the "theoretical" tables, it
compresses them to take up much less room.  Compression is likely to
make things considerably harder.

It sounds to me like a very interesting project.  I would suggest that
you read the appropriate sections of Aho, Sethi, and Ullman's
"Compiler Design" (second edition of the Dragon Book) to find out as
much as you can about the theory, and then get a copy of the source
for the parser driver that Yacc uses (we've got a copy, I can send it
to you), so you can see how the tables are actually stored.  This
should give you some ideas about what is involved in dynamically
modifying the parser.

Good luck!  Please keep me posted.

Dale
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (02/05/87)

For a simple dynamic grammar in YACC:

Define operator tokens as follows:

%left OL1
%right OR1
%left OL2
%right OR2
%left OL3
%right OR3
	etc.

The lexer for 'C' already has to lookup names in the symbol
table to see if they are typedef_names or identifiers.

You will now lookup operator names also to see what class (OL1, etc.) they
are currently.  The symbol table is loaded initially with
standard operators and precedences.  The currently defined class
is returned as the lexical token for yacc, and the actual
operator name is returned as the value (for semantic analysis).

For even more generality, the lexer should recognize three character
classes: alpha, whitespace and operator.  Name tokens are alpha characters
seperated by whitespace or operators.  Operator strings are sequences
of operators seperated by alpha.  Operator strings are divided into
tokens by finding the largest left substring which is currently in
the symbol table (repeat as required).  All tokens are looked up in
the symbol table to decide if they are 

	a) reserved names (e.g. register) or operators (e.g. '{');
		return appropriate token.
	b) type names;  return TYPENAME token.
	c) identifiers; return IDENTIFIER token.
	d) operators (both names and operators!); return class as token.

The symbol table is preloaded (from a control file, perhaps) with the
standard 'C' reserved words, types, and operator hierarchy.

The user can #include custom grammars for various applications:

	complex arithmetic
	complex polynomial arithmetic
	string processing
	etc.

Even reserved words can be redefined!

The compiler need only hardwire a minimal subset of reserved words.
Operators can be bound to functions or 'C-macros' based on operator
name and operand types.  Standard types are symbolic and used only for
binding purposes.  C-macros are machine dependent functions
expanded inline by the code-generator, all other operators are
function calls (or inline expansions).  A function call can be
replaced by a C-macro in a specific implementation for efficiency purposes.
The first cut of a code generator can do everything except function
calls with function calls!  The various operators can then be added
as C-macros as time permits.  (Notice that argument passing for function
calls needs to recognize types.)

A similar approach will allow the user to define his own control
structures.

The key performance issue here is the symbol table - it has got to
be fast, because it gets consulted by the lexer for every token!

Unless the programmer excerises some restraint, a language of this
nature will result in cryptic code!

Although having the lexer check the symbol table for every token is more
expensive than a fixed table driven lexer, notice that the preprocessor
has to look up every token anyway.  By combining the 'C' preprocessor
with lexical analysis, you can provide a dynamic grammar and speed up
compilation at the same time!

Every token would be looked up by the lexer to determine if it is

	a) a preprocessor symbol
	b) a typedef name
	c) a reserved word
	d) an identifier
	e) an operator

and apropriate tokens passed to YACC.
-- 
Stuart D. Gathman	<..!seismo!dgis!bms-at!stuart>
[It's certainly true that you can set up a yacc-based parser quite easily
that lets you add new tokens that behave syntactically just like other
existing tokens using this scheme.  I believe that the intent of the
original question, though, was to add whole new productions to the grammar on
the fly as is possible with a parser using Earley's algorithm.  The
consensus seems to be that since the yacc grammar tables are constructed
only after an extensive global analysis of the grammar, incremental changes
are unlikely to be possible without remaking the tables from scratch.  -John]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request