[comp.lang.prolog] DCGs and parse error handling

meyer@fuhainf2.uucp (Bernd Meyer) (10/18/90)
Reply-To: meyer@fernuni-hagen.de


Julian Smart writes:

> I have recently started to use Prolog DCGs (Definite Clause Grammars)
> to implement a command parser, but have found it very hard to give
> effective error messages besides the uninformative 'Parse Error'.
> The trouble seems to be that grammar rules are _expected_ to fail, to allow
> backtracking to work, but then Prolog can't tell when a rule was _nearly_
> fulfilled, after all rules have failed.  For example a sentence may be
> fine except a noun was not in the dictionary.  Surely, we should be able
> to tell the user we expected to find a noun.
> 
> Does anyone have any ideas about this or perhaps pointers to the
> literature?  I've tried the usual Prolog introductory books and
> 'Prolog and Natural Language Analysis' (Pereira and Shieber),
> but error handling doesn't seem to get mentioned, unless I've missed
> something.
> 
> Could the problem be partly due to the top-down nature of Prolog DCGs,
> or perhaps it requires a chart parser which remembers parse
> constituents?


we had nearly the same problem when implementing an entire compiler in prolog
about a year ago. of course, problems will be a little 
different when dealing with natural language. there are only few references 
to literature, they are given below in bibtex format.

we did not only want a little bit more than only "yes" or "no". we wanted the
parser to perform error detection and error *recovery*, i.e. we wanted a dcg 
or a general implementation of dcgs that was able to skip errors and continue
parsing as near to the error position as possible.
basically we pursued two different ideas: semi-automatic grammar transformation
and meta interpretation, which we will explain roughly.

1. semi-automatic grammar transformation

In contrast to the second method, we have to modify a given DCG in certain 
places in this approach. Unfortunately, this cannot be done fully 
automatically, but we developed some tools supporting this task.

The basic idea is to insert error productions for every nonterminal 
symbol. They are inserted as the last production (or clause) in order to 
be applied only if all other productions before failed (and they are 
*expected* to fail in the error case). These error productions start the 
error recovery phase which is to some respect a "panic-mode recovery" (by 
the classification of [Aho, Sethi, Ullman: Compilers, 1987]).

For this approach it is neccessary to compute the first and follow sets 
for the nonterms to be able to say when the recovery phase has finished 
and parsing can be continued. However, these sets are defined for CFGs but 
not for things like DCGs, which are capable of handling contex-sensitive 
information, and which can have arbitrary Prolog subgoals on the rhs of 
the productions, to name only two important features that cause 
difficulties in our case. So we have to get from a DCG to something 
similar as a CFG which we call a "skeleton grammar" (SkG). Roughly 
speaking, it is the original DCG whithout "unimportant" arguments and 
without any Prolog subgoals. (DCGs with productions of the form
	p, [t] --> ...
are not permitted.)

The "important" parameters have to be specified like mode declarations. 
An example of such a parameter would be the number of a noun and a verb, 
that restricts the applicable productions for verb phrases having found 
e.g. a plural form of a noun.

Example for the conversion of a DCG production to a SkG production:
	Given the production
		a(X, Y) --> [t], b(Y), { p(X) }
	and the "mode declaration"
		important(a(X, Y), a(X))
	the generated SkG production is
		a(X) --> [t], b.

This conversion is done by a program.

The remaining task is to add some conditions for the parameters of the 
nonterms ('X' in the example) in order to achieve a finite grammar. In the 
example this could be the additional subgoal 
	member(X, [sing, plur]).
Now the required sets can be computed. (Doing this manually is very 
difficult, boring, and prone to errors in non-toy grammars!)

The generated facts of the running example are 
	first(a(sing), [t]).	follow(a(sing), []).
	first(a(plur), [t]).	follow(a(plur), []).
and the error production
	a(X, Y) --> panic(a(X, Y))
which has to be inserted into the original DCG as described above.

The DCG has to be modified for the epsilon productions, too. These do not 
fail, but they should only be applied if the next input token is an 
element of the follow set of the actual nonterm. Therefore, an epsilon 
production
	p --> []
is replaced by something like
	p, [X] --> [X], { can_follow(X, p) }.

Now the 'panic' predicate can start the recovery phase. We check for three 
different kinds of errors:

(1)	a possibly misspelled keyword can be replaced by the most similar 
	correct one (similarity computed by weighted Levenshtein distance).

(2)	If (1) is not the case one or more tokens could be missing in the 
	input stream. This is assumed if the next input token is an element of 
	the follow set of the actual nonterm. Then the actual nonterm is regarded 
	a fullfilled whithout applying any of its productions. A message like 
	"missing 'verbal_phrase' inserted" is printed.
	
	In addition, a special "status" token (e.g. '@@@') is inserted behind the 
	actual input token which serves to supress further error messages 
	due to the same error. Only if the actual input token can be read 
	successfully by another production the "status" token is deleted 
	and he parsing status is regarded as stable.
	
(3)	If the next token of the input stream is not in the follow set of the 
	actual nonterm we assume it to be superfluous and delete it (and 
	possibly more) from the input stream until an appropriate synchronizing 
	token is found. (That makes the name of the game: "panic mode".)
	
	An improvement is to regard not only the elements of the follow sets as 
	synchronizing but to stop skipping the input if e.g. a "strong" keyword 
	is found which is likely not to be forgotten. Otherwise you run the risk 
	of skipping very large parts of the input.


2. meta interpretation

this second way was pursued with the idea in mind that no modification
of the grammar should be neccessary. it should be possible to enter 
arbitrary dcgs (or at least a big subclass of dcgs) and have them executed
with automatic error handling, without any further intervention of the 
programmer. sounds strange, but it is possible.
of course, only basic error message like "missing then" or the like can be 
produced.

in our context meta interpreters are prolog programs that 
execute other prolog programs, i.e. the clauses of the original program
are to be viewed as input to another program, which has the same (or hopefully
a somehow enhanced) functionality as the original program.
a good reference to meta interpreters is \cite{Ne88}.

such meta interpreters have successfully been implemented for 
explanation generation in reasoning tools and debugging of programs, what
brings us next to our case. 

as you wrote, productions are *expected* to fail, if parsing cannot be 
continued. that would not be a problem, if the dcg was strictly LL(k) and
is written such that there is a unique production for every input prefix (or no
production at all). then, all we would have to do, would be to memorize the
last production which has successfully been executed. if parsing cannot
be continued in the next step, we back up to that production and try to 
continue after discarding one input token. if that does not solve
the problem, we restart again, discarding the next symbol. this process can
be continued until parsing succeeds, or a number of input tokens
have been discarded, let us say five of them (use a heuristic or insert your 
lucky number). in the latter case, discarding does not seem to solve the 
problem, so perhaps there were tokens missing rather than extra tokens.
in that case we restart, this time inserting variable
tokens (unbound variables) in the input stream to simulate the possibly missing
tokens. we prefered another way to test for missing tokens: the next subgoal
(production) to be proved is assumed to succeed without actually executing it.
those token that should have been consumed by this skipped production are
handled by the deletion procedure described above. of course, empty
productions (p -> []) may never be skipped, because they are applicable
everywhere. 
if we cannot recover this way, neither, we have a third chance: to test
for misspelt tokens by simultanously deleting and inserting.
there must be a way to decide when we consider parsing to have recovered.
we chose to consider the parser in a stable state again, when the next
production is applicable without changing the input token stream. this 
procedure is likely to recover from errors correctly, if the dcg is strictly 
LL(k) and uses only unification to determine the next applicable production, 
i.e. if there are no extra subgoals (those enclosed in curly braces) 
that can make a production fail. 

those corrections leading to recovery are stored by assert and retract 
to generate error messages like "extra token xxx" or "yyy missing".

problems are likely to emerge, if the dcg contains extra subgoals or cuts. 
we did not allow extra subgoals to make a production fail, other
subgoals do not cause problems. cuts are treated by the ancestor cut !(X).
this sounds easy, but causes some difficulties. beside the "normal"
difficulties in interpreting cuts \cite{oK85}, we are not allowed to 
execute cuts, while parsing continues and has not yet reached a stable state
again. but we have to execute those cuts, which we ignored, as soon as a
stable state is encountered, pruning the proof tree at the nodes where those
cuts appeared. hence, the nodes where the cuts appeared have to be stored
sideways. as soon as the parser has successfully recovered, they are
retrieved and the proof tree is pruned at those nodes using the ancestor cut 
before parsing is continued. 

the last complication to be mentioned is the case of dcgs containing
non-deterministic productions. in this case errors may be detected far too 
late, after a number of wrong productions have already been applied.
then it is not possible to determine the last correctly applied production
with a simple operation scheme like this.



the final version of our compiler was implemented using the concept of 
semi-automatic grammar transformation, due to time and space complexity
of meta interpretation (the final version of the meta interpreter
looked really *horrible*, because we tried every trick to save time and 
space - in fact, it did not look much like a *logic* program at last)
grammar transformation seems to be sufficient to perform acceptable 
error handling and should not be a problem, if the grammar is fixed 
once for all.

quite another way to embed error handling in prolog parsers would be to
parse bottom-up instead of top-down or to automatically generate
table parsers from dcgs. with such parsers it would be possible
to use well known conventional error handling techniques. possibilities
to implement such parsers are discussed in \cite{MTHMY83} and \cite{Ni86}, but
error handling is not discussed there.


we would greatly appreciate to hear of any other idea...

bernd meyer & gerhard d. westerman

p.s. please do *not* use the email address in the From: line of the header. 
     due to some problems of our mailer it is incorrect. please use the address 
     given in the signature. tnx.


ARTICLE{MTHMY83,
	AUTHOR		= {Yuji Matsumoto and Hozumi Tanaka and Hideki Hirakawa
			   and Hideo Miyoshi and Hideki Yasukav},
	TITLE		= {{BUP}: A Bottom-Up Parser Embedded in {P}rolog},
	JOURNAL		= {New Generation Computing},
	VOLUME		= 1,
	PAGES		= {145 -- 158},
	YEAR		= 1983
      }

@ARTICLE{Ni86,
	AUTHOR		= {Ulf Nilson},
	TITLE		= {{AID}: An Alternative Implementation of {DCGs}},
	JOURNAL		= {New Generation Computing},
	VOLUME		= 4,
	PAGES		= {383 -- 399},
	YEAR		= 1986
      }

@ARTICLE{Wa80,
	AUTHOR		= {David H.D. Warren},
	TITLE		= {Logic Programming and Compiler Writing},
	JOURNAL		= {Software --- Practice and Experience},
	VOLUME		= 10,
	PAGES		= {97 -- 125},
	YEAR		= 1980
      }

@ARTICLE{PW80,
	AUTHOR		= {D.H.D. Warren and F.C.N. Pereira},
	TITLE		= {Parsing as Deduction},
	JOURNAL		= {Artificial Intelligence},
	NUMBER		= 13,
	PAGES		= {231 -- 278},
	YEAR		= 1980
      }

@ARTICLE{CH87,
	AUTHOR		= {Jacques Cohen and Timothy J. Hickey},
	TITLE		= {Parsing and Compiling Using {P}rolog},
	JOURNAL		= {ACM Transactions on Programming Languages and Systems},
	VOLUME		= 9,
	MONTH		= {April},
	PAGES		= {125 -- 163},
	YEAR		= 1987
      }

@BOOK{Ne88,
	AUTHOR		= {Gustaf Neumann},
	TITLE		= {{M}eta-{P}rogrammierung und {P}rolog},
	PUBLISHER	= {Addison Wesley},
	ADDRESS		= {Bonn},
	YEAR		= 1988
      }

@INPROCEEDINGS{oK85,
	AUTHOR		= {Richard A. O'Keefe},
	TITLE		= {On the Treatment of Cuts in {P}rolog Source-Level
			   Tools},
	BOOKTITLE	= {Proc. of the Int. IEEE Symposium
			   on Logic Programming},
	PAGES		= {68 -- 72},

Bernd Meyer, LG Praktische Informatik IV, FernUniversitaet Hagen, D-5800 Hagen
Bahnhofstrasse 46/48, FRG, Phone: +49 2331 181677,    ** db6ag@db0sgl **
meyer@fernuni-hagen.de     or    meyer@dhafeu61.bitnet 
s=meyer; ou=vax1; ou=informatik; p=fernuni-hagen; a=dbp; c=de