mccool@watvlsi.waterloo.edu (Michael McCool) (02/16/88)
Does anyone know if there's anything like yacc written for Prolog? I.e. not just grammar rules, but left-factoring, precedence parsing of expressions, etc; what I'd like is some kind of preprocessor that takes simple non-deterministic grammar rules, with some extra bits thrown in to indicate associativity, precedence etc, and converts it to a grammar-rule or prolog program that will do an efficient parse. It seems like a useful thing to have so I'm *sure* somebody's already written it :-) My wish list is: 1) Public domain, in a common syntax --> Quintus pref 2) can accept a token-production function instead of a list this is to save memory. WHY does everybody assume that the ENTIRE token list is in memory??? The token function would return a structure that could be recognized as a terminal, and the parser would poke it everytime it needed a new token. The parser might have to maintain a list to back up into (unput-style). (Of course, if we had list-streams this wouldn't be a problem... but that's a different wish list) 3) has a simple way to handle syntax errors --> a way to resynch, e.g. skip tokens until you see a token. This should be done in a way similar to yacc's "error ;" so I don't have to fool around with writing tail-recursive loops to scan for these kind of constructs. 4) precedence/associativity parser for operations, I would like to give only one rule like expr(binary(Op,E1,E2)) --> $left,expr(E1),['+'],expr(E2),{Op = plus} | $left,expr(E1),['-'],expr(E2),{Op = minus} | $left,expr(E1),['*'],expr(E2),{Op = times} | $left,expr(E1),['/'],expr(E2),{Op = divide} |... with "$left" indicating left associativity, and the order indicating precedence. I'm working on a front end for a "silicon compiler", and have already written a parser for a pascal-like language, but the grammar gets pretty horrible after you add syntax checking and other fuzz. In the interests of software engineering, I'd much rather have a high-level description that can be changed easily (the operator stuff is *particularly* important) than some grotesque Dali-like byzantine code-cathedral. advTHANKSance for any help/pointers. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ O_o Michael McCool <-- mccool@watvlsi.waterloo.edu ={ }= oOp! aAK VLSI! gaG Symple C snARf! foo when bar LIthP! forkKKK... // U __ /{ }\ __ /__/| WARNING DANGER WARNING DANGER WARNING DANGER WARNING \[]}/\ /===_| ||_ ==>==>==>==>==> Hacking Kills <==<==<==<==<==<==<== _^{_/ /_% /_|___|/| DANGER * Kick the Habit, logout NOW * DANGER % /__/ |_______| DANGER BEFORE IT'S TOO LATE DANGER
ok@quintus.UUCP (Richard A. O'Keefe) (02/18/88)
In article <3833@watvlsi.waterloo.edu>, mccool@watvlsi.waterloo.edu (Michael McCool) writes: > Does anyone know if there's anything like YACC written for Prolog? There are several largish grammar-handling systems written in/for Prolog, but all the ones I know of are intended for processing natural languages. Since the "native" method of parsing in Prolog is recursive descent, the techniques developed for LL(k) parser generators should be relevant. I would be interested to hear of any such programs too. > My wish list is: > 2) can accept a token-production function instead of a list > this is to save memory. WHY does everybody assume > that the ENTIRE token list is in memory??? The underlying formalism assumes no such thing, and provided you do not use explicit lists of terminals in your rules, you should be able to use ordinary grammar rules without ever having a list. For example, suppose you have a table token(Index, TheToken) where Index is an integer ranging from 1 to some N with no gaps. Then you could define token(Token, I0, I) :- token(I0, Token), I is I0+1. and you could write grammar rules like if_statement --> token(if), % would normally be [if] expression, token(then), % would normally be [then] statement, ( token(else) -> % " " " [else] statement | [] % this doesn't imply lists either ). This particular version would of course cost a LOT more memory than a list, but an implicit representation might well cost very little space. So what? If you are building any sort of parse tree (or pushing attributes up and down even a virtual tree), the token list is unlikely to be the major space consumer. Historically, the main use of Prolog grammar rules has been for processing natural language statements and Prolog clauses, both of which tend to be short. (Counting "id(" as a single token, the clause above would have only 23 tokens in its list.) The availability of the tokens in SOME sort of data structure has a major beneficial effect: it makes backtracking easy and cheap. For example, DEC-10 Prolog syntax requires unbounded lookahead. It's interesting that one of the nicest Algol compilers it has ever been my pleasure to use (the Burroughs B6700 Algol compiler) built an array of tokens. This enabled them to do things like trying to compile the subscripts of an array backwards, and then doing it in the usual order if that didn't work. {It was more economical to keep a sequence of tokens and "backtrack" over that than it would have been to build a parse tree and "backtrack" over that.} > 3) has a simple way to handle syntax errors --> > a way to resynch, e.g. skip tokens until you see > a token. This should be done in a way similar to > YACC's "error ;" so I don't have to fool around with > writing tail-recursive loops to scan for these kind of > constructs. I wouldn't call YACC's method "simple": a combination of falling back and skipping "until 3 tokens have been correctly parsed", with a special hack for stopping early. Even the System V Programmer's Guide calls it "crude". I've never managed to get it to do anything more complicated than skip to the next semicolon. Does anyone know what the BSD EYACC does and how to use it? The promise of better error recovery in YACC has been tantalising me since 4.1BSD! If anyone is interested in adding some sort of error-handling method to Prolog-coded parsers, I heartily recommend chapter 6 of Syntax of Programming Languages -- Theory and Practice, by Roland C. Backhouse Prentice-Hall, 1979 ISBN 0-13-879999-7 UK 12.95 pounds There are more recent papers by him. There is a conflict between Prolog style and the demands of error repair: in order to do error repair you need to KNOW that there is an error, but failure in a Prolog grammar rule usually means no more than "oops, wrong guess, try again". (This is one of the reasons why read/1 gives such poor diagnostics: while there are some local errors which *must* be wrong--such as two variables with no intervening operator--many errors involving operators cannot be locally detected.) It would be interesting to have a tool which checked whether the context-free skeleton of a collection of Prolog grammar rules defined an LL(1) language--program 3.2 in chapter 3 of the reference above would be a good place to start. > 4) precedence/associativity parser > for operations, I would like to give only one rule like > expr(binary(Op,E1,E2)) --> > $left,expr(E1),['+'],expr(E2),{Op = plus} | > $left,expr(E1),['-'],expr(E2),{Op = minus} | > $left,expr(E1),['*'],expr(E2),{Op = times} | > $left,expr(E1),['/'],expr(E2),{Op = divide} |... > with "$left" indicating left associativity, and the order > indicating precedence. I do hope that you don't put the vertical bars at the right like that; they are very hard to see there. The usual way of doing this in Prolog is to have a table such as operator(Token, Priority, Type, Op). e.g. operator(+, 3, left, plus). operator(-, 3, left, minus). operator(*, 2, left, times). operator(/, 2, left, divide). operator(^, 1, right, power). then one has rules such as expr(Expr) --> expr(9, Expr). expr(MaxLevel, Expr) --> primary(Expr0), rest_expr(MaxLevel, Expr0, Expr). rest_expr(MaxLevel, Expr0, Expr) --> [ Token ], { operator(Token, Priority, Type, Op) }, { Priority < MaxLevel }, !, { argument_priority(Type, Priority, ArgumentPriority) }, expr(ArgumentPriority, Expr1), rest_expr(MaxLevel, binary(Op,Expr0,Expr1), Expr). rest_expr(_, Expr, Expr) --> /* no next token, not an operator, or wrong priority */ []. argument_priority(left, X, X). argument_priority(right, X, Y) :- Y is X+1. Testing this (with primary//1 recognising integers): | ?- expr(X, [2,^,3,^,4,*,5,*,6], []). X = binary(times,binary(times,binary(power,2,binary(power,3,4)),5),6) The really nice thing about this is that it separates the operators themselves out, making it crystal clear what they have in common (the *shape* of the rule is the same for all the operators) and where they are different (what they look like and their priority). It also makes it easier to add new operators, even dynamically. The fact that YACC forces you to write a separate rule for each priority level (you can fold operators of the same priority together) is one of the reasons why I don't use it.