[comp.ai] How to parse ambiguity: a parsing algorithm

markh@csd4.milw.wisc.edu (Mark William Hopkins) (12/23/88)

     I would like to illustrate, by way of a case study, a new parsing 
algorithm which is well suited to efficiently handle ambiguity.  This 
alrogithm is essentially based on the algorithm developed by Masaru Tomita at
CMU several years back, as part of his Japanese Language Interface project.

     The essential idea is to use breadth first search, and to avoid that
situation in which the parser backtracks only to re-parse a portion of what it
backtracked across.  This situation occurs whevener there is "local ambiguity".
Tomita's algorithm is capable of handling an exponential ambiguity in
polynomial space, on account of several localizing optimizations.

     The output of this algorithm, as for Tomita's algorithm, is a parse
forest: a graph consisting of all the possible parse trees of the input,
in which there is a high degree of sharing of identical nodes.
     It features an optimization, which Tomita calls "local ambiguity
packing" in which input such as:

		     the old english teacher

would be represented as the graph:

			      NP
                            /   \
                          D      [N'   N']
                          |      /  \ /  \
                         the    /    X    \
                               /   A' \    \
                              /   / |  N'   \
                             A'  /  |  | \   N'
                             |  /   | A'  \  |
                              A     | |    \ |
                              |      A       N
                             old     |       |
                                  english  teacher

The two N2 nodes are packed into one, so that higher level nodes will only
make single references to the locally ambiguous phrase [old english teacher].

     The graph is represented in the following fashion:

     (0) D ["the"]           (4) A' [1]        (8) N' [2 3]
     (1) A ["old"]           (5) A' [2]        (9) N' [4 8][6 7]
     (2) A ["english"]       (6) A' [1 2]     (10) NP [0 9]
     (3) N ["teacher"]       (7) N' [3]

The components inside the square brackets are either lexical items, such as
"the", or pointers to other nodes in the graph, such as the 2 in [2 3].
The packed node in the graph is represented as node (9), consisting of 2
strings, to represent each distinct parse of the phrase [old english teacher].

     In the example above, we are using the grammar:

     NP --> D N'     N' --> A' N' | N     A' --> A | A A

which is obviously ambiguous.  The parser works in two stages.  In stage
I, Lexical Analysis is performed.  The result of Lexical Analysis in the
input phrase above is the following:

             D:                      A:                         N:
      (0)  (0, 1) ["the"]    (1)  (1, 2) ["old"]       (3)  (2, 3) ["english"]
                             (2)  (2, 3) ["english"]   (4)  (3, 4) ["teacher"]

The numberes inside the round brackets determine the boundaries of the parsed
constituent, as follows:

                         0   1   2       3       4 
		          the old english teacher

This representation is essentially an *inverse file*.  Note, also, the word
"english" has been parsed as both a noun (N) and adjective (A).

     The parsing algorithm has already taken the grammar and has ordered the
rules into the following proceedure:

            (a)       REDUCE ALL (N) TO N';
            (b)       REDUCE ALL (A) TO A';
            (c)       REDUCE ALL (A A) TO A';
                      repeat
              (d)        REDUCE ALL (A' N') TO N'
                      until EXHAUSTED;
            (e)       REDUCE ALL (D N') TO NP

This part of the algorithm has not yet been fully worked out.  A graph
algorithm (to find strong components of a directed graph generated by the
grammar) is used.

     After reduction (a) is carried out, we get two new N' nodes:

		     N': (5) (2, 3) [3]
			 (6) (3, 4) [4]
With (b), we get:
		     A': (7) (1, 2) [1]
			 (8) (2, 3) [2]
and with (c) we get:
		         (9) (1, 3) [1 2]

in addition to the other two A' nodes. 
   Each of the non-terminals in the grammar is represented by a sparse 
matrix.  For example, N' is represented as the matrix:

		       N'(2, 3) = (5): [3]
		       N'(3, 4) = (6): [4]

Each reduction rule can then be represented by matrix multiplication, so
that REDUCE (A A) TO A' becomes A' := A' + A*A.  The meaning of the addition
operator will be detailed below; it figures heavily in dealing with ambiguity.

     Reduction (d) is carried out until no further (x, y) subranges can be
produced.  Further reductions may occur in the very last step, but they
will not cause a new step to be generated if they do not produce a new subrange.
This optimization is nothing but the "local ambiguity packing optimization"
referred to above. After one step, we get 3 new N' nodes:

		       (10) (1, 3) [7 5]
                       (11) (2, 4) [8 6]
		       (12) (1, 4) [9 6]

After a secons step, we get one new addition:

		       (12) (1, 4) [9 6][7 11]

This second node is appended to the list of N' nodes which span from 1 to 4.
At this stage no further steps are taken.  No new subranges had been generated
in the previous step, therefore the EXHAUSTED predicate becomes true.
     Each component in the sparse matrix is a list of strings.  We see here,
that INSERTING a new element into this list is what the "+" operator means.
Two sparse matrices are added component-wise, and two components are added
by forming the UNION of their lists.

     The final stage of the parser produces one new NP node:

	     NP: (13) (0, 4) [0 12]

The resulting graph is represented as:

      (13): NP [0 12]        (8): A' [3]          (2): A ["english"]
      (12): N' [9 6][7 11]   (7): A' [1]          (1): A ["old"]
      (11): N' [8 6]         (6): N' [4]          (0): D ["the"]
      (9) : A' [1 2]         (4): N  ["teacher"]

Nodes (10), (5), and (3) represent the parser's failed attempt from assuming 
that the word "english" was a noun.  It correctly recognized "english" as an
adjective in this context.  This graph is the same as the one presented
previously.