[comp.lang.prolog] Grammar rule translator.

ok@quintus (07/27/88)

In his recent posting about the Oxford Prolog Library, Jocelyn Paine
said that one of the components was a grammar rule translator.  The
description says 'The translator is essentially the same as that
published in "Programming in Prolog", by Clocksin & Mellish'.

The grammar rule translator which appeared in the 1st and 2nd editions
of Clocksin & Mellish had the following flaws:

-- there was a variable name spelling mistake in one of the clauses,
   which broke the translation of rules with pushback
   (fixed in the 3rd edition, I think);

-- the translator did not expand (If->Then;Else)s
   (still not handled in the 3rd edition);

-- the translator did not use the 'C'/3 expansion, but tried to do
   the terminal matching in the translator, which meant that code with
   cuts and ``meta-logical'' operations did not work as expected
   (still a problem in the 3rd edition);

-- the translator does nothing useful with variables appearing as
   non-terminals (in the 3rd edition it goes into a loop);

-- the definition of phrase/2 is wrong--it only accepts nonterminals
   but should accept grammar rule bodies (i.e. phrase((p,q), S) will
   call ','(p,q,S,[]) when it should call (p(S,S1),q(S1,[])))
   (still the case in the 3rd edition).

This is reasonable because Clocksin & Mellish do not purport to provide
a specification of the Prolog language, only instruction in how to use
it.  Their grammar rule translator is presented only as the answer to
an exercise.

The lack of a reliable specification means that some vendors have left
grammar rules out entirely, and others have botched the job.  Others
have added "features".  For example, ALS decided to make calls to certain
built-in predicates are quietly translated as themselves.  The trouble
with this is that this is done _quietly_ (so you never know when it has
happened without looking at the code in the data base) and that there is
no apparent pattern to the choice of which builtins are left alone and
which are changed (so that you, or at any rate I, can never recall which
are which). In the absence of an agreed standard, however, nobody can say
that this was wrong.

There is a group of people who purport to be developing a Prolog
standard at the taxpayer's expense.  You would expect that group to
have addressed the grammar rule translation issue, ideally by
distributing a draft specification in the form of usable Prolog code.
The BSI Prolog substandard committee, however, in their great wisdom,
have preferred to do other things.

There's only one thing to do, then, and that's to publish source code.
The code which follows in this message was rewritten from scratch for
this purpose.  The purpose of this code is to serve as a workable
specification of
	phrase/3
	phrase/2
	'C'/3
and the translation from Lhs-->Rhs to Head:-Body form.  Error checking
and reporting has quite deliberately been left out, and a couple of cases
have been over-generalised to accomodate this.  I have supplied the
dcg rule translation as a separate predicate 'dcg rule'/2 to decouple
the question of what DCG rules are translated to from the question of
how DCG rule translation is integrated into the rest of a Prolog system;
I am not proposing that 'dcg rule'/2 should be directly available under
that name.   phrase/[2,3] and 'C'/3 _should_ be directly available.

There may be mistakes in the code that follows.  There may be things it
does that it shouldn't do.  There may be things it doesn't do that you
think it should.  Let's hear about them.  There are only 66 lines of
Prolog code; if we can't get something this size satisfactory, we'll
_deserve_ the BSI substandard.

%   File   : DCG.PL
%   Author : Richard A. O'Keefe
%   Updated: Tuesday July 26th, 1988.
%   Purpose: Definite Clause Grammar rule to Prolog clause translator.

/*  This file is written in the ISO 8859/1 character set.  The "Copyright"
    line contains after the right parenthesis a character which when
    transmitted was character 169, the international copyright symbol.

    Copyright (C)) 1988, Quintus Computer Systems, Inc.

    This file is distributed in the hope that it will be useful,
    but without any warrantee.  You are expressly warned that it is meant
    to serve as a model, not for direct use:  all error checking and
    reporting has been omitted, and mistakes are almost surely present.
    Permission is granted to anyone to distribute verbatim copies of
    this source code as received, in any medium, provided that the
    copyright notice, the nonwarrantee warning, and this permission
    notice are preserved.  Permission is granted to distribute modified
    versions of this source code, or of portions of it, under the above
    conditions, plus the conditions that all changed files carry
    prominent notices stating who last changed them and that the derived
    material is subject to this same permission notice.  Permission is
    granted to include this material in products for which money is
    charged, provided that the customer is given written notice that the
    code is (or is derived from) material provided by Quintus Computer
    Systems, Inc., and that the customer is given this source code on
    request.

	----------------------------------------------------------------

    Now that we've got that (adapted from the GNU copyright notice)
    out of the way, here are the technical comments.

    The predicates are all named 'dcg <something>'/<some arity> in order
    to keep out of the way, with the exception of phrase/2 and phrase/3
    which bear their proper names.  Only phrase/[2,3] and 'dcg rule'/2
    are meant to be called directly, and 'dcg rule'/2 is meant to be called
    from expand_term/2.  You need to keep 'dcg body'/4 and its dependents
    around at run time so that variables as nonterminals in DCG rule bodies
    will work correctly.

    So that Quintus have _something_ left to sell, this code has been
    rewritten from scratch with no error checking or reporting code at
    all, and a couple of places accept general grammar rule bodies where
    they are really supposed to demand lists of terminals.  However, any
    rule which is valid according to the Quintus Prolog manual will be
    translated correctly, except that this code makes no attempt to handle
    module: prefixes.  (The change is trivial.)	

    Note that 'dcg rule'/2 and phrase/[2,3] are steadfast.
*/

%   dcg rule(+Grammar Rule, -Equivalent Clause)

'dcg rule'(-->(Head0,Body0), Clause) :-
	'dcg head'(Head0, Head, PushBack, S0, S),
	'dcg body'(Body0, Body1, S0, S),
	'dcg conj'(Body1, PushBack, Body),
	Clause = :-(Head,Body).


%   dcg head(+Head0, -Head, -PushBack, -S0, -S)
%   recognises both
%	NonTerminal, [PushBackList] -->
%   and
%	NonTerminal -->
%   It returns the difference pair S0\S which the body is to parse.
%   To avoid error checking, it will accept an arbitrary body in place
%   of a pushback list, but it should demand a proper list.

'dcg head'((Head0,PushBack0), Head, PushBack, S0, S1) :- !,
	'dcg goal'(Head0, Head, S0, S),
	'dcg body'(PushBack0, PushBack, S, S1).
'dcg head'(Head0, Head, true, S0, S) :-
	'dcg goal'(Head0, Head, S0, S).


%   dcg goal(+Goal0, -Goal, +S0, +S)
%   adds the arguments S0, S at the end of Goal0, giving Goal.
%   It should check that Goal0 is a callable term.

'dcg goal'(Goal0, Goal, S0, S) :-
	functor(Goal0, F, N),
	N1 is N+1,
	N2 is N+2,
	functor(Goal, F, N2),
	arg(N2, Goal, S),
	arg(N1, Goal, S0),
	'dcg args'(N, Goal0, Goal).


%   dcg args(+N, +Goal0, +Goal)
%   copies the first N arguments of Goal0 to Goal.

'dcg args'(N, Goal0, Goal) :-
	(   N =:= 0 -> true
	;   arg(N, Goal0, Arg),
	    arg(N, Goal,  Arg),
	    M is N-1,
	    'dcg args'(M, Goal0, Goal)
	).


%   dcg body(+Body0, -Body, +S0, +S)
%   translates Body0 to Body, adding arguments as needed to parse S0\S.
%   It should complain about bodies (such as 2) which are not callable
%   terms, and about lists of terminals which are not proper lists.
%   To avoid error checking, [a|foo] is accepted as [a],foo, but it
%   really should complain.  ONLY the forms lists here should be treated;
%   other non-terminals which look like calls to built-ins could well be
%   commented on (no error reporting here) but should be expanded even
%   so.  Thus X=Y as a nonterminal is to be rewritten as =(X,Y,S0,S),
%   perhaps with a warning.  If you want the translation X=Y, use {X=Y}.

'dcg body'(Var, Body, S0, S) :- var(Var), !,
	Body = phrase(Var,S0,S).
'dcg body'((A0,B0), Body, S0, S) :- !,
	'dcg body'(A0, A, S0, S1),
	'dcg body'(B0, B, S1, S),
	'dcg conj'(A, B, Body).
'dcg body'(->(A0,B0), ->(A,B), S0, S) :- !,
	'dcg body'(A0, A, S0, S1),
	'dcg body'(B0, B, S1, S).
'dcg body'(;(A0,B0), ;(A,B), S0, S) :- !,
	'dcg disj'(A0, A, S0, S),
	'dcg disj'(B0, B, S0, S).
'dcg body'({}(A), A, S, S) :- !.
'dcg body'(!, !, S, S) :- !.
'dcg body'([], true, S, S) :- !.
'dcg body'([H0|T0], Body, S0, S) :- !,
	'dcg term'(H0, H, S0, S1),
	'dcg body'(T0, T, S1, S),
	'dcg conj'(H, T, Body).
'dcg body'(NT0, NT, S0, S) :-
	'dcg goal'(NT0, NT, S0, S).


%   dcg term(+T0, -T, +S0, +S)
%   generates code (T) which succeeds when there is a terminal T0
%   between S0 and S.  This version uses the DEC-10 Prolog predicate
%   'C'/3 for compatibility with DEC-10 Prolog, C Prolog, Quintus Prolog.
%   This is the only place that knows how terminals are translated, so
%   you could supply instead the definition
%	'dcg term'(T0, S0=[T0|S], S0, S).
%   and reap the same benefits.  The one thing you must not do is
%   NO! 'dcg term'(T0, true, [T0|S], S). DON'T DO THAT!

'dcg term'(T0, 'C'(S0,T0,S), S0, S).


%  To see why dcg disj/4 is needed, consider the translation of
%  ( [] | [a] ).  We have to insert S1=S0 somewhere, but we do it at
%  "compile"-time if we can.

'dcg disj'(Body0, Body, S0, S) :-
	'dcg body'(Body0, Body1, S1, S),
	(   S1 == S -> 'dcg conj'(S1=S0, Body1, Body)
	;   S1 = S0, Body = Body1
	).


%   dcg conj(+A, +B, -C)
%   combines two conjunctions A, B, giving C.  Basically, we want to
%   ensure that there aren't any excess 'true's kicking around (in a
%   compiled system, that shouldn't matter).  There is room for some
%   leeway here: I have chosen to flatten A completely.

'dcg conj'(A, true, A) :- !.
'dcg conj'(A, B, C) :-
	'dcg CONJ'(A, B, C).

'dcg CONJ'(true, C, C) :- !.
'dcg CONJ'((A,As), C0, (A,C)) :- !,
	'dcg CONJ'(As, C0, C).
'dcg CONJ'(A, C, (A,C)).


%   'C'(S0, T, S)
%   is true when the terminal T "Connects" the "string positions" S0 and S.

'C'([T|S], T, S).


%   phrase(+NT0, ?S0)
%   is true when the list S0 is in the language defined by the
%   grammar rule body NT0.  E.g. phrase(([a],[b]), [a,b]).

phrase(NT0, S0) :-
	phrase(NT0, S0, []).


%   phrase(+NT0, ?S0, ?S)
%   is true when the list S0\S is in the language defined by the
%   grammar rule body NT0.  E.g. phrase(([a],[b]), [a,b|X], X).

phrase(NT0, S0, S) :-
	'dcg body'(NT0, NT, T0, T),
	T0 = S0, T = S,
	call(NT).

evan@csli.STANFORD.EDU (Evan Kirshenbaum) (07/28/88)

In article <196@quintus.UUCP> ok@quintus () writes:

>-- the translator did not use the 'C'/3 expansion, but tried to do
>   the terminal matching in the translator, which meant that code with
>   cuts and ``meta-logical'' operations did not work as expected
>   (still a problem in the 3rd edition);

A couple of years ago, I was working on a project which involved a
lot of dcg's.  The prolog I was working on (a DEC-10 prolog) used
'C'/3 for terminals, and I was going crazy debugging the parser (some
of the nonterms had twenty or so rules, almost all of them
distinguishable by the first terminal, which I had to step through
each time).  I found that the prolog on another DEC-20 (an earlier
version of DEC-10 prolog) did what I considered "the right thing" and
folded terminals into the head.  However one of my parsers, which had
rules like
	def(pure_funct(F))-->[id(pure)],!,
                             [id(function)],funct_id(F).
no longer worked (it still worked when run under the other system).
What this dcg-translator was doing, of course was folding the
terminals even over a cut, so this was translated to

	def(pure_funct(F),[id(pure),id(function)|S0],S):-!,
		funct_id(F,S0,S).

(The string "pure procedure" should match this clause and fail.  It
didn't match and so matched a default case and did the wrong thing.)

I figured there had to be a middle ground.  I wrote my own translator
which runs rougly as follows (this is from memory):

    'dcg body'(V,B,S0,S):- var(V),!,
	    B = phrase(V,S0,S).
    'dcg body'((A0,B0),(A,B),S0,S):-!,
	    'dcg body'(A0,A,S0,S1),
	    'dcg body'(B0,B,S1,S).
    'dcg body'([],true,S,S):-!.
    'dcg body'([T|Ts],B,[T|S0],S):-!,
	    'dcg body'(Ts,B,S0,S).
    'dcg body'(!,(!,S0=S),S0,S):-!.
    'dcg body'({A},(A,S0=S),S0,S):-!.
    'dcg body'(T,B,S0,S):-
	    T=..L0,
	    append(L0,[S0,S],L1),
	    B=..L1.

(roughly; it pulled out true's and used functor rather than =.., but
this is the gist).  This allows the terminal folding into the head,
but treats cut (and anything which can contain cut) as explicitly
blocking such folding.  With this translator, the above code becomes

	def(pure_funct(F),[id(pure)|S0],S):- !,S0=[id(function)|S1],
		funct_id(F,S1,S).

which seems closer to what is wanted.  

This scheme has worked quite well.  I'm kind of curious as to whether
there's a good reason not to do it this way.

---
Evan Kirshenbaum
Stanford University
  evan@csli.Stanford.EDU
  ...!ucbvax!csli.stanford.edu!evan

If you think my opinions represent this university, 
you haven't been on campus recently!

ok@quintus.uucp (Richard A. O'Keefe) (07/28/88)

In article <4751@csli.STANFORD.EDU> evan@csli.UUCP (Evan Kirshenbaum) writes:
>In article <196@quintus.UUCP> ok@quintus () writes:
>>-- the translator did not use the 'C'/3 expansion, but tried to do
>>   the terminal matching in the translator, which meant that code with
>>   cuts and ``meta-logical'' operations did not work as expected
>>   (still a problem in the 3rd edition);

>A couple of years ago, I was working on a project which involved a
>lot of dcg's.   ...
>I figured there had to be a middle ground.  I wrote my own translator
>which runs roughly as follows (this is from memory):
	...
>    'dcg body'([],true,S,S):-!.
>    'dcg body'([T|Ts],B,[T|S0],S):-!,
>	    'dcg body'(Ts,B,S0,S).
>    'dcg body'(!,(!,S0=S),S0,S):-!.
...
>This allows the terminal folding into the head,
>but treats cut (and anything which can contain cut) as explicitly
>blocking such folding.
...
>This scheme has worked quite well.  I'm kind of curious as to whether
>there's a good reason not to do it this way.

When compiled with the current Quintus Prolog compiler,
	p(..., S0, S) :- 'C'(S0, X, S1), ...
and
	p(..., [X|S1], S) :- ...
run at the same speed.  I do not know whether this was the case in
DEC-10 Prolog.  Anyrate, moving the terminals into the head should
not be neceesary for efficiency.

Kirshenbaum's problem wasn't really with the translation as such, but
with what the debugger showed him.  Speaking for myself, I _like_ seeing
the calls to 'C'/3, and wish I could set a spy-point on it.  There is a
small set of built-in operations (such as ','/2) which the debugger does
not display:  if the debugger would let Kirshenbaum say "don't show me
'C'/3" he would have got pretty much the behaviour that he wanted.  

The trouble with Kirshenbaum's translator is that cut is not the only
thing which can go wrong.  Consider a stupid example:
	p --> {var(X)}, [X].
His translator will turn that into
	p(S0, S) :- var(X), S0=[X|S].
which is correct.  But now suppose it is
	p --> var(X), [X].
	var(X) --> {var(X)}.
which his translator will turn into
	p([X|S0], S) :- var(X, S0, S).
	var(X, S0, S) :- var(X), S0 = S.
This is not a correct translation.  It is only safe to move the
unifications back over pure predicates (and pure nonterminals).

I think that the grammar rule translator is not the place for
optimisations like unfolding 'C'/3 and pushing it back into the head,
but that transformations like that belong in some other tool which
can apply them to all sorts of clauses, not just former grammar rules.
NU Prolog :- pure declarations would be important for such a tool.

lee@mulga.oz (Lee Naish) (07/29/88)

In article <202@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>When compiled with the current Quintus Prolog compiler,
>	p(..., S0, S) :- 'C'(S0, X, S1), ...
>and
>	p(..., [X|S1], S) :- ...
>run at the same speed.

That is because (correct me if I am wrong) Quintus currently indexes
only on the top level functor of the first argument.

This has several consequences:
	1) there is no point is pushing back 'C'
	2) the most naturally written grammars are not efficient
	3) grammars are rewritten so that simple indexing is effective
		p --> "+"...
		p --> "-"...
	   is rewritten
		p --> [X], p1(X).
		p1(0'+) --> ...
		p1(0'-) --> ...
	4) it is best to put S0 (and S) after other arguments

In NU-Prolog 'C' is pushed back (when its definitely safe to do so) and
it is possible to do indexing on subterms of any argument:
	?- p(..., [X|S1], S) when X.
will cause indexing on X (it also delays calls to p until X is
instantiated, so the grammar cannot be used as a generator).  It would
seem like a good idea to do that kind of indexing on DCG rules by
default.

Anyway, the fact that most Prolog systems currently don't have good
indexing and some people write INELEGANT grammars (:-) is not a good
reason for a standard DGC translator to convert decent grammars into
code which decent compilers can't take advantage of, though, if you can
understand this sentence without backtracking, you probably don't need
the advantage of simpler grammars so you can write grammars which don't
backtrack using Quintus Prolog DGCs.

	lee

ok@quintus.uucp (Richard A. O'Keefe) (07/30/88)

In article <2924@mulga.oz> lee@mulga.UUCP (Lee Naish) writes:
>In article <202@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>>When compiled with the current Quintus Prolog compiler,
>>	p(..., S0, S) :- 'C'(S0, X, S1), ...
>>and
>>	p(..., [X|S1], S) :- ...
>>run at the same speed.
>
>That is because (correct me if I am wrong) Quintus currently indexes
>only on the top level functor of the first argument.

Lee Naish is saying
1)  Quintus Prolog currently indexes on the principal functor of
    the first argument (this is true)
2)  My statement above is true (which is so)
3)  1) explains 2).  This is absolutely wrong.

The reason that the two fragments I sketched run at the same speed is that
they turn into identical code, and will still turn into identical code if/when
better indexing is provided.  This applies to *everything* that uses 'C'/3,
not just things that happen to use it in the head of a grammar rule.

That was my point:  the grammar rule translator should not know how smart
the compiler is, neat hacks for speeding things up belong in optimising
tools that apply to *every* sort of code.

jonasn@ttds.UUCP (Jonas Nygren) (08/18/88)

In article <202@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>In article <4751@csli.STANFORD.EDU> evan@csli.UUCP (Evan Kirshenbaum) writes:
>When compiled with the current Quintus Prolog compiler,
>	p(..., S0, S) :- 'C'(S0, X, S1), ...
>and
>	p(..., [X|S1], S) :- ...
>run at the same speed.  I do not know whether this was the case in
>DEC-10 Prolog.  Anyrate, moving the terminals into the head should
>not be neceesary for efficiency.
>

I tried Kirshenbaum's translator with a lexical analyzer for C (sans preprocessor)
and compared with AAIS DCG translator which uses 'C'/3. The time for breaking
sieve.c into tokens where as follows:

	AAIS DCG		17.0 s
	Kirshenbaum's translator10.6 s

which amounts to a 38% speedup which seems worthwhile to me.

Jonas Nygren