[comp.lang.prolog] Yacc in Prolog, Parsing and Dali

mccool@watvlsi.waterloo.edu (Michael McCool) (02/16/88)

	Does anyone know if there's anything like yacc written for Prolog?
I.e. not just grammar rules, but left-factoring, precedence parsing of
expressions, etc;  what I'd like is some kind of preprocessor that takes
simple non-deterministic grammar rules, with some extra
bits thrown in to indicate associativity, precedence etc, and converts it to
a grammar-rule or prolog program that will do an efficient parse.

	It seems like a useful thing to have so I'm *sure* somebody's already
written it :-) 

	My wish list is:

	1) Public domain, in a common syntax --> Quintus pref
	2) can accept a token-production function instead of a list
		this is to save memory.  WHY does everybody assume
		that the ENTIRE token list is in memory???  The token
		function would return a structure that could be recognized
		as a terminal, and the parser would poke it everytime
		it needed a new token.  
			The parser might have to maintain a list
		to back up into (unput-style).  (Of course, if we had 
		list-streams this wouldn't be a problem... but that's
		a different wish list)
	3) has a simple way to handle syntax errors --> 
		a way to resynch, e.g. skip tokens until you see
		a token.  This should be done in a way similar to
		yacc's "error ;" so I don't have to fool around with
		writing tail-recursive loops to scan for these kind of
		constructs.
	4) precedence/associativity parser         
		for operations, I would like to give only one rule like
			expr(binary(Op,E1,E2)) --> 
				$left,expr(E1),['+'],expr(E2),{Op = plus} | 
		 	        $left,expr(E1),['-'],expr(E2),{Op = minus} | 
			        $left,expr(E1),['*'],expr(E2),{Op = times} | 
				$left,expr(E1),['/'],expr(E2),{Op = divide} |...
		with "$left" indicating left associativity, and the order 
		indicating precedence.

	I'm working on a front end for a "silicon compiler", and have already
written a parser for a pascal-like language, but the grammar gets pretty
horrible after you add syntax checking and other fuzz. In the interests of
software engineering, I'd much rather have a high-level description that can
be changed easily (the operator stuff is *particularly* important) than
some grotesque Dali-like byzantine code-cathedral. 

	advTHANKSance for any help/pointers.  

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
        O_o          Michael McCool <-- mccool@watvlsi.waterloo.edu
       ={  }= oOp! aAK VLSI! gaG Symple C snARf! foo when bar LIthP! forkKKK...
       // U         __ 
     /{ }\ __     /__/|   WARNING DANGER WARNING DANGER WARNING DANGER WARNING
     \[]}/\ /===_|   ||_  ==>==>==>==>==>  Hacking Kills <==<==<==<==<==<==<==
    _^{_/  /_% /_|___|/|  DANGER    * Kick the Habit, logout NOW *      DANGER
   %   /__/   |_______|   DANGER        BEFORE IT'S TOO LATE            DANGER

ok@quintus.UUCP (Richard A. O'Keefe) (02/18/88)

In article <3833@watvlsi.waterloo.edu>,
mccool@watvlsi.waterloo.edu (Michael McCool) writes:
> Does anyone know if there's anything like YACC written for Prolog?
There are several largish grammar-handling systems written in/for Prolog,
but all the ones I know of are intended for processing natural languages.
Since the "native" method of parsing in Prolog is recursive descent, the
techniques developed for LL(k) parser generators should be relevant.
I would be interested to hear of any such programs too.

> 	My wish list is:
> 	2) can accept a token-production function instead of a list
> 		this is to save memory.  WHY does everybody assume
> 		that the ENTIRE token list is in memory???
The underlying formalism assumes no such thing, and provided you do
not use explicit lists of terminals in your rules, you should be able
to use ordinary grammar rules without ever having a list.  For
example, suppose you have a table
	token(Index, TheToken)
where Index is an integer ranging from 1 to some N with no gaps.
Then you could define

	token(Token, I0, I) :-
		token(I0, Token),
		I is I0+1.

and you could write grammar rules like

	if_statement -->
		token(if),	% would normally be [if]
		expression,
		token(then),	% would normally be [then]
		statement,
		(   token(else) ->	% " " " [else]
		    statement
		|   []		% this doesn't imply lists either
		).

This particular version would of course cost a LOT more memory than
a list, but an implicit representation might well cost very little
space.  So what?  If you are building any sort of parse tree (or
pushing attributes up and down even a virtual tree), the token list
is unlikely to be the major space consumer.

Historically, the main use of Prolog grammar rules has been for
processing natural language statements and Prolog clauses, both
of which tend to be short.  (Counting "id(" as a single token, the
clause above would have only 23 tokens in its list.)  The availability
of the tokens in SOME sort of data structure has a major beneficial
effect: it makes backtracking easy and cheap.  For example, DEC-10
Prolog syntax requires unbounded lookahead.

It's interesting that one of the nicest Algol compilers it has ever
been my pleasure to use (the Burroughs B6700 Algol compiler) built
an array of tokens.  This enabled them to do things like trying to
compile the subscripts of an array backwards, and then doing it in
the usual order if that didn't work.  {It was more economical to
keep a sequence of tokens and "backtrack" over that than it would
have been to build a parse tree and "backtrack" over that.}

> 	3) has a simple way to handle syntax errors --> 
> 		a way to resynch, e.g. skip tokens until you see
> 		a token.  This should be done in a way similar to
> 		YACC's "error ;" so I don't have to fool around with
> 		writing tail-recursive loops to scan for these kind of
> 		constructs.
I wouldn't call YACC's method "simple": a combination of falling back
and skipping "until 3 tokens have been correctly parsed", with a special
hack for stopping early.  Even the System V Programmer's Guide calls it
"crude".  I've never managed to get it to do anything more complicated
than skip to the next semicolon.  Does anyone know what the BSD EYACC
does and how to use it?  The promise of better error recovery in YACC
has been tantalising me since 4.1BSD!

If anyone is interested in adding some sort of error-handling method
to Prolog-coded parsers, I heartily recommend chapter 6 of
	Syntax of Programming Languages -- Theory and Practice,
	by Roland C. Backhouse
	Prentice-Hall, 1979
	ISBN 0-13-879999-7
	UK 12.95 pounds
There are more recent papers by him.

There is a conflict between Prolog style and the demands of error
repair:  in order to do error repair you need to KNOW that there is
an error, but failure in a Prolog grammar rule usually means no more
than "oops, wrong guess, try again".  (This is one of the reasons why
read/1 gives such poor diagnostics:  while there are some local
errors which *must* be wrong--such as two variables with no
intervening operator--many errors involving operators cannot be
locally detected.)  It would be interesting to have a tool which
checked whether the context-free skeleton of a collection of Prolog
grammar rules defined an LL(1) language--program 3.2 in chapter 3 of
the reference above would be a good place to start.

> 	4) precedence/associativity parser         
> 		for operations, I would like to give only one rule like
> 			expr(binary(Op,E1,E2)) --> 
> 				$left,expr(E1),['+'],expr(E2),{Op = plus} | 
> 		 	        $left,expr(E1),['-'],expr(E2),{Op = minus} | 
> 			        $left,expr(E1),['*'],expr(E2),{Op = times} | 
> 				$left,expr(E1),['/'],expr(E2),{Op = divide} |...
> 		with "$left" indicating left associativity, and the order 
> 		indicating precedence.
I do hope that you don't put the vertical bars at the right like that;
they are very hard to see there.

The usual way of doing this in Prolog is to have a table such as

	operator(Token, Priority, Type, Op).

e.g.
	operator(+, 3, left,  plus).
	operator(-, 3, left,  minus).
	operator(*, 2, left,  times).
	operator(/, 2, left,  divide).
	operator(^, 1, right, power).

then one has rules such as

	expr(Expr) -->
		expr(9, Expr).

	expr(MaxLevel, Expr) -->
		primary(Expr0),
		rest_expr(MaxLevel, Expr0, Expr).

	rest_expr(MaxLevel, Expr0, Expr) -->
		[ Token ],
		{ operator(Token, Priority, Type, Op) },
		{ Priority < MaxLevel },
		!,
		{ argument_priority(Type, Priority, ArgumentPriority) },
		expr(ArgumentPriority, Expr1),
		rest_expr(MaxLevel, binary(Op,Expr0,Expr1), Expr).
	rest_expr(_, Expr, Expr) -->
		/* no next token, not an operator, or wrong priority */
		[].

	argument_priority(left, X, X).
	argument_priority(right, X, Y) :- Y is X+1.

Testing this (with primary//1 recognising integers):
	| ?- expr(X, [2,^,3,^,4,*,5,*,6], []).
	X = binary(times,binary(times,binary(power,2,binary(power,3,4)),5),6)

The really nice thing about this is that it separates the operators
themselves out, making it crystal clear what they have in common
(the *shape* of the rule is the same for all the operators) and
where they are different (what they look like and their priority).
It also makes it easier to add new operators, even dynamically.
The fact that YACC forces you to write a separate rule for each
priority level (you can fold operators of the same priority together)
is one of the reasons why I don't use it.