worley@compass.com (Dale Worley) (08/04/89)
The idea of adding a 'yyexpected' function might sound good at first, but there are a number of problems with it. As was pointed out before, the number of expected terminals will probably be quite large, as this will include terminals on which a shift is possible and terminals on which a reduction is possble. For a grammar for Pascal, for example, this could be 30 or more terminals in some spots. Also, you would probably have to modify the Yacc source code to do this because Yacc inserts default reduction actions. These default reduction actions eliminate the lookahead information associated with reductions. I've used similar techniques, and a few simple fixes can solve a lot of these problems. The first is to define "token groups", which are groups of tokens the user thinks of as being similar. For instance, '+', '-', '*', etc. can be included in the group 'operator'; '(', 'IDENTIFIER', '-', etc., can be included in the group 'expression'. Now, when a syntax error is found, the error-reporter assembles a list of acceptable tokens. Then it checks each token group, and if the group is contained in the list of acceptable tokens, all those tokens are removed from the list, and the name of the token group is added. (In practice, only about three members of the group need to be checked for membership, and if one token group overlaps another, you have to be careful about the order in which the groups are tested.) The result is that you can usually reduce the list of allowed tokens to about three items, and produce error messages like: 34 i = j k; ^ Syntax error. Found: IDENTIFIER 'k' Expecting: OPERATOR ; END The trick of only checking a few members of the group can be used to defeat the problem of losing information because of default reductions. The way to do this is to only test members whose validity will not be eliminated by the default reductions. In the above example, 'j' may be reduced to 'expression' before the error is found, but if the 'operator' group is represented only by '+', then the fact that '*' cannot follow 'expression' will not stop the 'operator' group from showing in the list of expected tokens. Dale -- Dale Worley, Compass, Inc. worley@compass.com No one hates so bitterly as the lazy and poor contemplating the energetic and prosperous. -- -- Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU { decvax | harvard | yale | bbn }!ima. Meta-mail to ima!compilers-request. Please send responses to the author of the message, not the poster.
djones@megatest.uucp (Dave Jones) (08/06/89)
Mr. Worley has generously taken the time to post a long article about getting yacc to give up a list of token-categories which would be legal when a syntax error is found. The idea of grouping tokens into categories is a good one. But the technique in no way solves the "default reduction" problem, as he claims it does. The technique he describes suffers from the exact same problem that all the other attempts have done so far. I will demonstrate that even the small example he gives will not work, by inspecting the actual output from yacc. From article <1989Aug4.032053.753@esegue.uucp>, by worley@compass.com (Dale Worley): > ... [ concerning how to get yacc to generate lists of legal tokens when a syntax error is encounterd ... ] > > I've used similar techniques, and a few simple fixes can solve a lot > of these problems. The first is to define "token groups", which are > groups of tokens the user thinks of as being similar. For instance, > '+', '-', '*', etc. can be included in the group 'operator'; '(', > 'IDENTIFIER', '-', etc., can be included in the group 'expression'. > Now, when a syntax error is found, the error-reporter assembles a list > of acceptable tokens. Then it checks each token group, and if the > group is contained in the list of acceptable tokens, all those tokens > are removed from the list, and the name of the token group is added. > (In practice, only about three members of the group need to be checked > for membership, and if one token group overlaps another, you have to > be careful about the order in which the groups are tested.) The > result is that you can usually reduce the list of allowed tokens to > about three items, and produce error messages like: > > 34 i = j k; > ^ > Syntax error. > Found: IDENTIFIER 'k' > Expecting: OPERATOR ; END > > The trick of only checking a few members of the group can be used to > defeat the problem of losing information because of default > reductions. > The way to do this is to only test members whose validity > will not be eliminated by the default reductions. > In the above > example, 'j' may be reduced to 'expression' before the error is found, > but if the 'operator' group is represented only by '+', then the fact > that '*' cannot follow 'expression' will not stop the 'operator' group > from showing in the list of expected tokens. > In the above example, when the yacc-generated compier discovers the syntax error, the lookahead set under consideration would contain only ';' and END. It would not contain either '+' or '*'. Having done the default reduction, the compiler would have discarded the state or states which could shift '+' or '*'. That's the whole point. Nothing in the algorithm described above retains or recovers the lost information. To make it more concrete, let's actually try to parse the example fragment: BEGIN i = j k END I've written a grammar for us to consider. In the interest of brevity, and "doing it right", the grammar employs precedence-rules, but the result would be the same if we used separate "expr", "term", and "factor" productions. (If you are skeptical, try it.) %token BEGIN END ID %left '+' %left '*' %% prog : BEGIN list END list : stat | list ';' stat ; stat : /* empty */ | ID '=' expr ; expr : ID | expr '+' expr | expr '*' expr ; %% Say, "yacc -vd gram.y" From the resulting y.output file, we observe the following: state 10 stat : ID = expr_ (5) expr : expr_+ expr expr : expr_* expr + shift 12 * shift 13 . reduce 5 We see that when the parser, in state 10, does not see a '+' or a '*', it will make the default reduction (5): stat : ID = expr Doing so, it will discard three states from the top of the stack, including state 10, which is the one which "knows about" the '+' and the '*'. Next it will do another default reduction, producing a "list". Then it will be in state 3, as listed in the y.output file: state 3 prog : BEGIN list_END list : list_; stat END shift 6 ; shift 7 . error Only now will it discover that the next symbol, another ID, is not legal. Note that the only symbols which it now considers to be valid are ';' and END, although '+' and '*' would have been valid before the default reductions were made. How are we to infer that a class of symbols which we choose to call "operators" would also be valid? The information is gone. In this example, state 3 can only be entered after having done the default reductions which we saw. Thus the LALR(1) state 3 only stands for one LR(1) state, which additionally has the lookahead tokens '+' and '*'. But in general, a given LALR(1) state can stand in for several LR(1) states with different LR(1)-lookahead sets. I have demonstrated in previous postings how the entire LR(1)-lookahead set can be calculated at runtime, as the file is parsed. But any such calculation must take into account the input (or the states resulting from the input). There is not enough information in the LALR(1) state alone. [From djones@megatest.uucp (Dave Jones)] -- -- Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU { decvax | harvard | yale | bbn }!ima. Meta-mail to ima!compilers-request. Please send responses to the author of the message, not the poster.
lai@mips.com (David Lai) (08/08/89)
In <1989Aug6.024931.10014@esegue.uucp>, Dave Jones gives a rebuttal stating how a default reduction in yacc has eliminated acceptable tokens from the token list when an error occurs. The problem can be solved by generating the list of acceptable tokens *before* returning the error token. This method requires that you test the tokens before returning them to yacc (in the scanner). In the case given 'i = j k', the scanner upon reading 'k' checks the parser state to see whether an identifier is acceptable in this state. This will be the state prior to the default reduction, where + and * are still acceptable. The list of acceptable tokens is remembered and the erroneous token is returned. The 'yyerror' function then prints out the (perhaps decoded) list of acceptable tokens. -- David Lai (lai@mips.com || {ames,prls,pyramid,decwrl}!mips!lai) -- -- Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU { decvax | harvard | yale | bbn }!ima. Meta-mail to ima!compilers-request. Please send responses to the author of the message, not the poster.
heirich@cs.ucsd.edu (Alan Heirich) (08/08/89)
This is the first of two postings. This posting describes a workaround for the default reduction problem with yacc. The second posting describes code modifications to DECUS yacc which provides meaningful diagnostic messages both for describing a parse (debugging a grammar) and for an end-user at run time. In article <1989Aug6.024931.10014@esegue.uucp> djones@megatest.uucp (Dave Jones) writes: >Mr. Worley has generously taken the time to post a long article about getting >yacc to give up a list of token-categories which would be legal when a syntax >error is found. The idea of grouping tokens into categories is a good one. >But the technique in no way solves the "default reduction" problem, as >he claims it does. Mr. Jones is correct. However, the default reduction problem is not hard to work around. The problem: yacc violates correct shift-reduce parsing rules by assigning "default reductions". For the following item: A --> B C . the corresponding rule should be reduced only when the lookahead token is in the follow set of A. A given automaton state might contain many such items, and also many items on which a shift is anticipated, e.g. Q --> R S . T (in which case a shift would be expected on when the lookahead token is in the first set of T). The yacc parser should report an error whenever the lookahead token is neither in a follow set of a rule like "A" above, nor in a first set for a rule like "Q" above. But yacc builds its automaton in such a way that this is not the case. Instead, a rule like "A" would get reduced whenever the lookahead token can't be handled in any other way. In other words, a reduction happens when an error should be reported. These "default reductions" continue until the parser is in a state with no reductions, at which point it reports an error. Any diagnostic message based on the state where the error is reported are thus potentially incorrect, since the lookahead set might be different from the lookahead set of the state where the error was actually encountered. Consequences of the problem: there are three types of states: shift only; reduce only; and combined shift & reduce states. It turns out that this "problem" in irrelevant to two of the kinds of states; and is easily worked around for the case of combined shift & reduce states. case 1: a state contains only shift items. Then there is no problem, because the error is reported instantly. case 2: a state contains only reduce items. The default reductions will change the state before the error gets reported. But this state will have the same lookahead set as the state where the error was encountered. This is because the reduction will return the parser to a state where the lookahead set contains the first set of the reduced nonterminal. The reductions don't change the lookahead set. case 3: combined shift & reduce items. This is the potentially problematic state, which Mr. Jones illustrated in his posting. As he explained, the default reductions move the parser to a state in which the lookahead set is not the same as when the error was encountered. Specifically, the new lookahead set will contain the union of the first sets of the reduced nonterminal(s), but will not contain the tokens which could have been shifted in the state where the error was encountered. Case 3 has an easy fix. Modify the yacc parser to save the current state *after* it obtains a new token from yylex. (This occurs near the beginning of the parse code, right after the debug code that prints out the state number). Then, when an error is encountered, yyerror traps the current state and compares it to this saved state. The set of expected tokens is the union of the lookahead sets of these two states. (In cases 1 and 2 above, these will be the same sets of tokens). The next posting describes a set of mods I have made to DECUS yacc to implement automatic diagnostics based on this approach. ------------------------- Alan Heirich Comp. Sci. & Eng., Cognitive Science C-014 University of California, San Diego 92093 heirich@cs.ucsd.edu aheirich@ucsd.bitnet -- -- Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU { decvax | harvard | yale | bbn }!ima. Meta-mail to ima!compilers-request. Please send responses to the author of the message, not the poster.
heirich@cs.ucsd.edu (Alan Heirich) (08/08/89)
This is the second of two postings. The first posting dealt with objections to automatic error recovery in yacc. This posting describes modifications to DECUS yacc to permit automatic diagnostic generation. In an earlier message I mentioned that I have modified yacc to provide useful diagnostic messages to describe the parse (for debugging a grammar) and to describe the expected tokens when a syntax error occurs (to debug a user). I received many requests for the source code. I have been advised that the version of DECUS yacc which I modified is in fact pirate AT&T code, so I don't think I can redistribute it. But I will describe the mods, which are easy to make. [If you have AT&T source, these changes should be straightforward to apply. -John] The general goal is to encode the information from the parser description file, describing states, goto sets, and lookahead sets, in a compact form that can be accessed at run time. Once you have done this you can use any number of schemes of varying complexity to generate messages describing the parse process and the expectations of the parser. I will leave most of this to the ingenuity of the reader, and simply describe how to get the essential information out of the yacc data structures. The changes are nearly all in the routine "output". This routine writes out the parser description to the description file. You will want to modify it to write out five pieces of information: (1) a set of strings containing token names (2) a set of strings containing nonterminal names (3) a set of states containing items sets (4) a set of states containing lookahead sets (5) a set of states containing goto sets (1) and (2) are easy. The token names are in an array of records. tokset[i].name is the string containing the name of token #i. Use the TLOOP macro on i to print out all the token names, i.e. TLOOP(i) { printf ("\"%s\",\n", tokset [i].name); } Since TLOOP starts at 1, print a null string before you start this loop to account for token #0 (which apparently is nonexistent as far as yacc is concerned). The nonterminal names are in another array of records, nontrst. Use the NTLOOP macro, which starts at 0, as follows: NTLOOP(i) { printf ("\"%s\",\n", nontrst [i].name); } The item sets are useful so that the run time parser can show you a description of the current state, rather than just emit a message "in state x". You guessed it, the SLOOP macro steps through all of the states: SLOOP(i) { writeState (i); } the writeState function (which you must create) is a bit trickier than simply printing out a string. Copy the code from the beginning of output() which writes out the state description to the description file, then selectively delete to get what you want. You should end up with approximately the following: SLOOP(i) { nolook = !(tystate [i] == MUSTLOOKAHEAD); closure(i); nolook = 1; ITMLOOP(i,pp,qq) { writeItem (pp-> pitem); printf ("-99999,\n"); } if (tystate [i] == MUSTLOOKAHEAD) WSLOOP (wsets + (pstate[i+1]-pstate[i]), u) if (*(u->pitem) < 0) { writeItem (u->pitem); printf ("-99999,\n"); } /* if */ What is happening here? Well, we first recompute closure for the state. That sets important flags in the tystate array. We then loop through all the normal items in the state, thanks to ITMLOOP, and write them out (more on this below). Then, if the state requires lookahead we write out another group of items (exactly what they are escapes me at the moment. I think they are items that reduce on the empty string, or something like that). What must writeItem(pp) do? Approximately this (again, adapted from the code that writes out the state descriptions to the description file): writeItem (pp) int *pp; { int *p; for (p = pp; *p > 0; ++p) { } /* (skip ahead until *p > 0) */ p = prdptr [-*p]; printf ("%d,", *p); while (1) { p++; if (p == pp) printf ("-88888,"); if (*p <= 0) break; else printf ("%d,", *p); } /* while */ if (*pp < 0) { printf ("%d,", *pp); } } /* writeItem */ This writes out the elements of a single item, encoded as integers. -88888 represents the location of the . (or underscore) in the item, i.e. the current position of the parse. A positive number is the number of a token or nonterminal name; if the number is > NTBASE (a yacc constant) then it represents a nonterminal (and NTBASE must be subtracted to get the number of the nonterminal); if the number is < NTBASE then it can be used directly as an index into the token name table (described above). A negative number indicates a rule that is the be reduced; it always appears at the end of an item, and is negated (to get the actual rule number, just negate the value printed out). Notice that after writeItem is called, the value -99999 is printed out. This is simply an end-of-item marker. For (4), the lookahead sets: use SLOOP again, and for each state i, compute closure(i). Well, here is the code, including some important details: SLOOP(i) { nolook = !(tystate [i] == MUSTLOOKAHEAD); closure (i); nolook = 1; aryfil (temp1, ntokens + nnonter + 1, 0); WSLOOP (wsets,u) { c = *(u -> pitem); if (c > 1 && c < NTBASE && temp1 [c] == 0) { printf ("%d,", c); temp1 [c] = 1; } /* if */ } /* WSLOOP */ } /* SLOOP */ Finally, (5), the goto sets: SLOOP(i) { nolook = !(tystate [i] == NOLOOKAHEAD); closure (i); nolook = 1; aryfil (temp1, ntokens + nnonter + 1, 0); WSLOOP (wsets, u) { c = *(u -> pitem); if (c > NTBASE && temp1 [(c -= NTBASE) + ntokens] == 0) { temp1 [c + ntokens] = 1; printf ("%d,", c); } /* if */ } /* WSLOOP */ } /* SLOOP */ This basically does it. What you get from all this is a lot of printed information. Write this to a text file and use it to initialize a set of arrays which you compile into your parser. You can then access the information while your parser runs. This allows it to describe states in the same form as in the description file; describe the expected lookahead tokens or goto items (i.e. the expected nonterminals, which are often much more explanatory than the lookahead tokens); and describe the tokens as they are obtained from yylex (very helpful). This method certainly isn't perfect -- the biggest problem is that an end user can be overwhelmed with information, since a programming language like C or Pascal can generate states with 30 valid lookahead tokens. (Using the nonterminal names from the goto sets can help in this situation, although only partially). Still, it's a useful start, and does provide helpful information. Also, be aware that program errors tend to be static semantic errors rather than syntax errors, so the gravity of the situation is somewhat less than one might think... But it is true, what we all *really* want is a son-of-yacc (yayacc?) which allows us to specify textual error messages from within the grammar... and which doesn't generate stupid reduce/reduce conflicts when we put lots of error productions into the grammar! ------------------------- Alan Heirich Comp. Sci. & Eng., Cognitive Science C-014 University of California, San Diego 92093 heirich@cs.ucsd.edu aheirich@ucsd.bitnet -- -- Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU { decvax | harvard | yale | bbn }!ima. Meta-mail to ima!compilers-request. Please send responses to the author of the message, not the poster.
rusty@garnet.Berkeley.EDU (Rusty C. Wright) (08/10/89)
So with all this thrashing about yacc's error reporting, does anybody know if bison does the "right thing"? -- rusty c. wright, rusty@violet.berkeley.edu ucbvax!violet!rusty -- -- Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU { decvax | harvard | yale | bbn }!ima. Meta-mail to ima!compilers-request. Please send responses to the author of the message, not the poster.
djones@megatest.com (Dave Jones) (08/11/89)
From article <1989Aug8.130702.957@esegue.uucp), by lai@mips.com (David Lai): ) In <1989Aug6.024931.10014@esegue.uucp), Dave Jones gives a rebuttal stating ) how a default reduction in yacc has eliminated acceptable tokens from the ) token list when an error occurs. The problem can be solved by generating ) the list of acceptable tokens *before* returning the error token. This ) method requires that you test the tokens before returning them to yacc (in ) the scanner). ) ) In the case given 'i = j k', the scanner upon reading 'k' checks the parser ) state to see whether an identifier is acceptable in this state. This will ) be the state prior to the default reduction, where + and * are still ) acceptable. The list of acceptable tokens is remembered and the erroneous ) token is returned. The 'yyerror' function then prints out the (perhaps ) decoded) list of acceptable tokens. As I (and one or two other people) pointed out the last time this subject came up, it is not necessary for the parser to keep a set of acceptable tokens. It can keep a set of states which are discarded by default reductions, (which should be much quicker to calculate), then calculate the tokens only if they are needed when an error is detected. Didn't I point this out recently also? Dave Jones [From djones@megatest.com (Dave Jones)] -- Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU { decvax | harvard | yale | bbn }!ima. Meta-mail to ima!compilers-request. Please send responses to the author of the message, not the poster.
djones@megatest.com (Dave Jones) (08/11/89)
From article <1989Aug8.131112.1081@esegue.uucp>, by heirich@cs.ucsd.edu (Alan Heirich): > This posting describes modifications to > DECUS yacc to permit automatic diagnostic generation. ... > > The changes are nearly all in the routine "output". This routine writes out > the parser description to the description file. You will want to modify it > to write out five pieces of information: > > (1) a set of strings containing token names > (2) a set of strings containing nonterminal names > (3) a set of states containing items sets > (4) a set of states containing lookahead sets > (5) a set of states containing goto sets > All this info is in the y.output file from standard yacc. There's no need to (ahem) hack yacc. If there's a demand, and I can find the time, I'll package up and post some a nawk (new awk) scripts and so forth that use the y.output file to generate a procedural, rather than table-based, parser. They could be modified easily enough to write the above info in any format you might want. But be warned, however you obtain this info, you still have to calculate the legal-token-sets dynamically. There is not enough info in the LALR(1) item-sets to calculate them from the state-number alone. You have to keep up with the default-reduction states as they are popped. I hope I have said this often enough and loudly enough now. I did the scripts partly as an exercise in learning nawk, partly to get a faster parser, but mostly to aid in debugging compilers. It's impossible to pick through the coded tables in a debugger and make any kind of sense, but it is easy to single step through code that looks like the following, which is cut and pasted from a compiler I'm in the process of writing: switch(state) { ... case 3:switch(lookahead){ /* * file : $$2 _ declarations * declaration_list : _ (28) */ case EOF: YREDUCE(28,0,NT_declaration_list); case error: YSHIFT(7); case CONST: YREDUCE(28,0,NT_declaration_list); case INSERT: YREDUCE(28,0,NT_declaration_list); case FUNCTION: YREDUCE(28,0,NT_declaration_list); case LABEL: YREDUCE(28,0,NT_declaration_list); case PROCEDURE: YREDUCE(28,0,NT_declaration_list); case TYPE: YREDUCE(28,0,NT_declaration_list); case VAR: YREDUCE(28,0,NT_declaration_list); case WITH: YREDUCE(28,0,NT_declaration_list); default: YERROR(); }; The scripts "wrote" the above code from inspection of the y.output file. [From djones@megatest.com (Dave Jones)] -- Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU { decvax | harvard | yale | bbn }!ima. Meta-mail to ima!compilers-request. Please send responses to the author of the message, not the poster.
djones@megatest.com (Dave Jones) (08/11/89)
The errors in the following article all seem to be due to the mistaken impression that tokens which can legally follow the production of a non-terminal symbol are independant of the context of that non-terminal. From article <1989Aug8.130922.1019@esegue.uucp), by heirich@cs.ucsd.edu (Alan Heirich): ... ) ) The problem: yacc violates correct shift-reduce parsing rules by assigning ) "default reductions". For the following item: ) ) A --) B C . ) ) the corresponding rule should be reduced only when the lookahead token is in ) the follow set of A. I've got a little trouble with the terminology. I'm going to assume "Follow set of A" means all the terminals which can follow A in any sentential form. (This is the definition is most applicable to the theory of SLR parsers, but I can't think of any alternative meaning.) "... should be reduced" is a little trickier. Except for an obscure bug related to the synthesized error token, yacc does what it "should do". I am assuming that you are talking about what a full LR(1) parser should do. But in that case you should say that it should reduce the rule A <= BC if and only if the LR-stack and lookahead togther form a valid prefix with this form: x B C t, and x A t is a valid prefix. What you actually describe is what an SLR parser "should do". ) A given automaton state might contain many such items, ) and also many items on which a shift is anticipated, e.g. ) ) Q --) R S . T ) ) (in which case a shift would be expected on when the lookahead token is in ) the first set of T). ) ) The yacc parser should report an error whenever the lookahead token is neither ) in a follow set of a rule like "A" above, nor in a first set for a rule like ) "Q" above. The above, too, is a little vague. But consider this: G : '[' A ']' | A ';' ; A : BC ; [Before puzzling over this too much, read on. A similar grammar comes into play later.] ) ... ) ) Consequences of the problem: there are three types of states: shift only; ) reduce only; and combined shift & reduce states. It turns out that this ) "problem" in irrelevant to two of the kinds of states; and is easily worked ) around for the case of combined shift & reduce states. ... ) ) case 2: a state contains only reduce items. ) The default reductions will change the state before the error gets reported. ) But this state will have the same lookahead set as the state where the error ) was encountered. Counter-example: G : '(' A ')' | A ';' ; A : 'a' ; Run yacc on this and you'll see this state: State 4: A : a _ (3) . reduce 3 In state 4, the only one which produces A, the only action is a default reduction. The state has no follow-set coded for it by yacc, but in the sense of the above comments, its follow set contains ')' and ';' because it is used in both of these rules: G : '(' A ')' | A ';' ; Using either sense of the term "follow set", we can, by means of the default reduction, uncover a state with a different set of legal tokens. Consider the input '(' A x After doing the default reduction in state 4, we reach state 5: state 5 G : ( A_) ) shift 7 . error Its only legal next token is ')'. ) ) case 3: combined shift & reduce items. ) This is the potentially problematic state, which Mr. Jones illustrated in ) his posting. As he explained, the default reductions move the parser to a ) state in which the lookahead set is not the same as when the error was ) encountered. Specifically, the new lookahead set will contain the union of ) the first sets of the reduced nonterminal(s), but will not contain the tokens ) which could have been shifted in the state where the error was encountered. Ahem. G: '(' Q ')' ; Q: P | P ';' ; P : 'p' | 'p' '!' ; Need I say more? ) ) Case 3 has an easy fix. Modify the yacc parser to save the current state ) *after* it obtains a new token from yylex. (This occurs near the beginning of ) the parse code, right after the debug code that prints out the state number). ) Then, when an error is encountered, yyerror traps the current state and ) compares it to this saved state. The set of expected tokens is the union of ) the lookahead sets of these two states. Almost. Make the parser save the intire set of intermediate states popped by default reductions, not just the first one, and use the union of all their shiftable tokens and the shiftable tokens from the error state. This is, effectively, the only workable algorithm I've seen so far, although I and others have presented variations on it. [From djones@megatest.com (Dave Jones)] -- Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU { decvax | harvard | yale | bbn }!ima. Meta-mail to ima!compilers-request. Please send responses to the author of the message, not the poster.
markg@well.sf.ca.us (Mark Grand) (08/13/89)
Generating a list of acceptable tokens before allowing YACC to perform a default reduction is expensive. A cheaper way (assumimg a fast implementation of memcpy) is to take a snapshot of YACC's state stack every time it gets a new token. That way you can generate a list of the expected tokens from the snapshot and only have to do it when actually needed. -- Mark Grand markg@well.sf.ca.usa GeoMaker Software {apple,pacbell,hplabs,ucbvax}!well!markg P.O. Box 273124 415-680-1964 Concord CA 94527-3124 -- Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU { decvax | harvard | yale | bbn }!ima. Meta-mail to ima!compilers-request. Please send responses to the author of the message, not the poster.
eachus@mbunix.mitre.org (Robert Eachus) (08/15/89)
In article <1989Aug12.201931.4857@esegue.uucp> Mark Grand writes: >Generating a list of acceptable tokens before allowing YACC to perform a >default reduction is expensive. A cheaper way (assumimg a fast >implementation of memcpy) is to take a snapshot of YACC's state stack every >time it gets a new token. That way you can generate a list of the expected >tokens from the snapshot and only have to do it when actually needed. There is an even faster way. Pat Prange and I used it in a parser generator and driver {LALR} which is available on Multics. If the next action to be taken by the compiler is a reduction (other than into an accepting state) keep a list of states but do nothing. If it is the first read (shift) following an accepting state hold it to one side also. When a succeeding legal token has been read and is ready to shift onto the stack (or an accepting state is reached), do the pending read (if any) then do all pending reductions up to the current read. With this scheme, error correction can start from the point just before the last legal token was read. This allows all sorts of possible error recovery strategies to be tried. We had twelve combinations that we tried including deleting the previously read token and swaping the previous and current tokens. All this was "free" in that it required a few bytes of storage in the drivers stack, and about 3 or 4 extra instructions per successful shift or reduction. (Failures are another story, we allowed ourselves about a tenth of a CPU second per correction, but it was worth it since the compilers could often continue through several difficult syntax errors and do a meaningful error scan thorugh an entire program. Two of my favorite corrections came from the Ada Compiler Validation Suite: legal-statement; IFB THEN ... was corrected to: legal-statement; if IFB then ... and X: INTEGER ran ge 1..10; was corrected to: X: INTEGER range 1..10; All without special error tables, and without looking at the spellings of non-terminals! In fact the parser tables for the Ada grammar above were about 3500 36-bit words, including 36 words for error recovery (closures for constructs to be deleted if panic mode was used to throw a sequence of invalid tokens). As far as I know this tool is still used to maintain the parsers for several Honeywell compilers (especially for the DPS6 and DPS6+, but not the Ada compiler from DDC), and the source (highly Multics specific PL/1) is on every existing machine running Multics. (There just aren't too many of them any more...sigh!) Robert I. Eachus eachus@mbunix.mitre.org -- Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU { decvax | harvard | yale | bbn }!ima. Meta-mail to ima!compilers-request. Please send responses to the author of the message, not the poster.