meyer@fuhainf2.uucp (Bernd Meyer) (10/18/90)
Reply-To: meyer@fernuni-hagen.de Julian Smart writes: > I have recently started to use Prolog DCGs (Definite Clause Grammars) > to implement a command parser, but have found it very hard to give > effective error messages besides the uninformative 'Parse Error'. > The trouble seems to be that grammar rules are _expected_ to fail, to allow > backtracking to work, but then Prolog can't tell when a rule was _nearly_ > fulfilled, after all rules have failed. For example a sentence may be > fine except a noun was not in the dictionary. Surely, we should be able > to tell the user we expected to find a noun. > > Does anyone have any ideas about this or perhaps pointers to the > literature? I've tried the usual Prolog introductory books and > 'Prolog and Natural Language Analysis' (Pereira and Shieber), > but error handling doesn't seem to get mentioned, unless I've missed > something. > > Could the problem be partly due to the top-down nature of Prolog DCGs, > or perhaps it requires a chart parser which remembers parse > constituents? we had nearly the same problem when implementing an entire compiler in prolog about a year ago. of course, problems will be a little different when dealing with natural language. there are only few references to literature, they are given below in bibtex format. we did not only want a little bit more than only "yes" or "no". we wanted the parser to perform error detection and error *recovery*, i.e. we wanted a dcg or a general implementation of dcgs that was able to skip errors and continue parsing as near to the error position as possible. basically we pursued two different ideas: semi-automatic grammar transformation and meta interpretation, which we will explain roughly. 1. semi-automatic grammar transformation In contrast to the second method, we have to modify a given DCG in certain places in this approach. Unfortunately, this cannot be done fully automatically, but we developed some tools supporting this task. The basic idea is to insert error productions for every nonterminal symbol. They are inserted as the last production (or clause) in order to be applied only if all other productions before failed (and they are *expected* to fail in the error case). These error productions start the error recovery phase which is to some respect a "panic-mode recovery" (by the classification of [Aho, Sethi, Ullman: Compilers, 1987]). For this approach it is neccessary to compute the first and follow sets for the nonterms to be able to say when the recovery phase has finished and parsing can be continued. However, these sets are defined for CFGs but not for things like DCGs, which are capable of handling contex-sensitive information, and which can have arbitrary Prolog subgoals on the rhs of the productions, to name only two important features that cause difficulties in our case. So we have to get from a DCG to something similar as a CFG which we call a "skeleton grammar" (SkG). Roughly speaking, it is the original DCG whithout "unimportant" arguments and without any Prolog subgoals. (DCGs with productions of the form p, [t] --> ... are not permitted.) The "important" parameters have to be specified like mode declarations. An example of such a parameter would be the number of a noun and a verb, that restricts the applicable productions for verb phrases having found e.g. a plural form of a noun. Example for the conversion of a DCG production to a SkG production: Given the production a(X, Y) --> [t], b(Y), { p(X) } and the "mode declaration" important(a(X, Y), a(X)) the generated SkG production is a(X) --> [t], b. This conversion is done by a program. The remaining task is to add some conditions for the parameters of the nonterms ('X' in the example) in order to achieve a finite grammar. In the example this could be the additional subgoal member(X, [sing, plur]). Now the required sets can be computed. (Doing this manually is very difficult, boring, and prone to errors in non-toy grammars!) The generated facts of the running example are first(a(sing), [t]). follow(a(sing), []). first(a(plur), [t]). follow(a(plur), []). and the error production a(X, Y) --> panic(a(X, Y)) which has to be inserted into the original DCG as described above. The DCG has to be modified for the epsilon productions, too. These do not fail, but they should only be applied if the next input token is an element of the follow set of the actual nonterm. Therefore, an epsilon production p --> [] is replaced by something like p, [X] --> [X], { can_follow(X, p) }. Now the 'panic' predicate can start the recovery phase. We check for three different kinds of errors: (1) a possibly misspelled keyword can be replaced by the most similar correct one (similarity computed by weighted Levenshtein distance). (2) If (1) is not the case one or more tokens could be missing in the input stream. This is assumed if the next input token is an element of the follow set of the actual nonterm. Then the actual nonterm is regarded a fullfilled whithout applying any of its productions. A message like "missing 'verbal_phrase' inserted" is printed. In addition, a special "status" token (e.g. '@@@') is inserted behind the actual input token which serves to supress further error messages due to the same error. Only if the actual input token can be read successfully by another production the "status" token is deleted and he parsing status is regarded as stable. (3) If the next token of the input stream is not in the follow set of the actual nonterm we assume it to be superfluous and delete it (and possibly more) from the input stream until an appropriate synchronizing token is found. (That makes the name of the game: "panic mode".) An improvement is to regard not only the elements of the follow sets as synchronizing but to stop skipping the input if e.g. a "strong" keyword is found which is likely not to be forgotten. Otherwise you run the risk of skipping very large parts of the input. 2. meta interpretation this second way was pursued with the idea in mind that no modification of the grammar should be neccessary. it should be possible to enter arbitrary dcgs (or at least a big subclass of dcgs) and have them executed with automatic error handling, without any further intervention of the programmer. sounds strange, but it is possible. of course, only basic error message like "missing then" or the like can be produced. in our context meta interpreters are prolog programs that execute other prolog programs, i.e. the clauses of the original program are to be viewed as input to another program, which has the same (or hopefully a somehow enhanced) functionality as the original program. a good reference to meta interpreters is \cite{Ne88}. such meta interpreters have successfully been implemented for explanation generation in reasoning tools and debugging of programs, what brings us next to our case. as you wrote, productions are *expected* to fail, if parsing cannot be continued. that would not be a problem, if the dcg was strictly LL(k) and is written such that there is a unique production for every input prefix (or no production at all). then, all we would have to do, would be to memorize the last production which has successfully been executed. if parsing cannot be continued in the next step, we back up to that production and try to continue after discarding one input token. if that does not solve the problem, we restart again, discarding the next symbol. this process can be continued until parsing succeeds, or a number of input tokens have been discarded, let us say five of them (use a heuristic or insert your lucky number). in the latter case, discarding does not seem to solve the problem, so perhaps there were tokens missing rather than extra tokens. in that case we restart, this time inserting variable tokens (unbound variables) in the input stream to simulate the possibly missing tokens. we prefered another way to test for missing tokens: the next subgoal (production) to be proved is assumed to succeed without actually executing it. those token that should have been consumed by this skipped production are handled by the deletion procedure described above. of course, empty productions (p -> []) may never be skipped, because they are applicable everywhere. if we cannot recover this way, neither, we have a third chance: to test for misspelt tokens by simultanously deleting and inserting. there must be a way to decide when we consider parsing to have recovered. we chose to consider the parser in a stable state again, when the next production is applicable without changing the input token stream. this procedure is likely to recover from errors correctly, if the dcg is strictly LL(k) and uses only unification to determine the next applicable production, i.e. if there are no extra subgoals (those enclosed in curly braces) that can make a production fail. those corrections leading to recovery are stored by assert and retract to generate error messages like "extra token xxx" or "yyy missing". problems are likely to emerge, if the dcg contains extra subgoals or cuts. we did not allow extra subgoals to make a production fail, other subgoals do not cause problems. cuts are treated by the ancestor cut !(X). this sounds easy, but causes some difficulties. beside the "normal" difficulties in interpreting cuts \cite{oK85}, we are not allowed to execute cuts, while parsing continues and has not yet reached a stable state again. but we have to execute those cuts, which we ignored, as soon as a stable state is encountered, pruning the proof tree at the nodes where those cuts appeared. hence, the nodes where the cuts appeared have to be stored sideways. as soon as the parser has successfully recovered, they are retrieved and the proof tree is pruned at those nodes using the ancestor cut before parsing is continued. the last complication to be mentioned is the case of dcgs containing non-deterministic productions. in this case errors may be detected far too late, after a number of wrong productions have already been applied. then it is not possible to determine the last correctly applied production with a simple operation scheme like this. the final version of our compiler was implemented using the concept of semi-automatic grammar transformation, due to time and space complexity of meta interpretation (the final version of the meta interpreter looked really *horrible*, because we tried every trick to save time and space - in fact, it did not look much like a *logic* program at last) grammar transformation seems to be sufficient to perform acceptable error handling and should not be a problem, if the grammar is fixed once for all. quite another way to embed error handling in prolog parsers would be to parse bottom-up instead of top-down or to automatically generate table parsers from dcgs. with such parsers it would be possible to use well known conventional error handling techniques. possibilities to implement such parsers are discussed in \cite{MTHMY83} and \cite{Ni86}, but error handling is not discussed there. we would greatly appreciate to hear of any other idea... bernd meyer & gerhard d. westerman p.s. please do *not* use the email address in the From: line of the header. due to some problems of our mailer it is incorrect. please use the address given in the signature. tnx. ARTICLE{MTHMY83, AUTHOR = {Yuji Matsumoto and Hozumi Tanaka and Hideki Hirakawa and Hideo Miyoshi and Hideki Yasukav}, TITLE = {{BUP}: A Bottom-Up Parser Embedded in {P}rolog}, JOURNAL = {New Generation Computing}, VOLUME = 1, PAGES = {145 -- 158}, YEAR = 1983 } @ARTICLE{Ni86, AUTHOR = {Ulf Nilson}, TITLE = {{AID}: An Alternative Implementation of {DCGs}}, JOURNAL = {New Generation Computing}, VOLUME = 4, PAGES = {383 -- 399}, YEAR = 1986 } @ARTICLE{Wa80, AUTHOR = {David H.D. Warren}, TITLE = {Logic Programming and Compiler Writing}, JOURNAL = {Software --- Practice and Experience}, VOLUME = 10, PAGES = {97 -- 125}, YEAR = 1980 } @ARTICLE{PW80, AUTHOR = {D.H.D. Warren and F.C.N. Pereira}, TITLE = {Parsing as Deduction}, JOURNAL = {Artificial Intelligence}, NUMBER = 13, PAGES = {231 -- 278}, YEAR = 1980 } @ARTICLE{CH87, AUTHOR = {Jacques Cohen and Timothy J. Hickey}, TITLE = {Parsing and Compiling Using {P}rolog}, JOURNAL = {ACM Transactions on Programming Languages and Systems}, VOLUME = 9, MONTH = {April}, PAGES = {125 -- 163}, YEAR = 1987 } @BOOK{Ne88, AUTHOR = {Gustaf Neumann}, TITLE = {{M}eta-{P}rogrammierung und {P}rolog}, PUBLISHER = {Addison Wesley}, ADDRESS = {Bonn}, YEAR = 1988 } @INPROCEEDINGS{oK85, AUTHOR = {Richard A. O'Keefe}, TITLE = {On the Treatment of Cuts in {P}rolog Source-Level Tools}, BOOKTITLE = {Proc. of the Int. IEEE Symposium on Logic Programming}, PAGES = {68 -- 72}, Bernd Meyer, LG Praktische Informatik IV, FernUniversitaet Hagen, D-5800 Hagen Bahnhofstrasse 46/48, FRG, Phone: +49 2331 181677, ** db6ag@db0sgl ** meyer@fernuni-hagen.de or meyer@dhafeu61.bitnet s=meyer; ou=vax1; ou=informatik; p=fernuni-hagen; a=dbp; c=de