markh@csd4.milw.wisc.edu (Mark William Hopkins) (12/23/88)
I would like to illustrate, by way of a case study, a new parsing algorithm which is well suited to efficiently handle ambiguity. This alrogithm is essentially based on the algorithm developed by Masaru Tomita at CMU several years back, as part of his Japanese Language Interface project. The essential idea is to use breadth first search, and to avoid that situation in which the parser backtracks only to re-parse a portion of what it backtracked across. This situation occurs whevener there is "local ambiguity". Tomita's algorithm is capable of handling an exponential ambiguity in polynomial space, on account of several localizing optimizations. The output of this algorithm, as for Tomita's algorithm, is a parse forest: a graph consisting of all the possible parse trees of the input, in which there is a high degree of sharing of identical nodes. It features an optimization, which Tomita calls "local ambiguity packing" in which input such as: the old english teacher would be represented as the graph: NP / \ D [N' N'] | / \ / \ the / X \ / A' \ \ / / | N' \ A' / | | \ N' | / | A' \ | A | | \ | | A N old | | english teacher The two N2 nodes are packed into one, so that higher level nodes will only make single references to the locally ambiguous phrase [old english teacher]. The graph is represented in the following fashion: (0) D ["the"] (4) A' [1] (8) N' [2 3] (1) A ["old"] (5) A' [2] (9) N' [4 8][6 7] (2) A ["english"] (6) A' [1 2] (10) NP [0 9] (3) N ["teacher"] (7) N' [3] The components inside the square brackets are either lexical items, such as "the", or pointers to other nodes in the graph, such as the 2 in [2 3]. The packed node in the graph is represented as node (9), consisting of 2 strings, to represent each distinct parse of the phrase [old english teacher]. In the example above, we are using the grammar: NP --> D N' N' --> A' N' | N A' --> A | A A which is obviously ambiguous. The parser works in two stages. In stage I, Lexical Analysis is performed. The result of Lexical Analysis in the input phrase above is the following: D: A: N: (0) (0, 1) ["the"] (1) (1, 2) ["old"] (3) (2, 3) ["english"] (2) (2, 3) ["english"] (4) (3, 4) ["teacher"] The numberes inside the round brackets determine the boundaries of the parsed constituent, as follows: 0 1 2 3 4 the old english teacher This representation is essentially an *inverse file*. Note, also, the word "english" has been parsed as both a noun (N) and adjective (A). The parsing algorithm has already taken the grammar and has ordered the rules into the following proceedure: (a) REDUCE ALL (N) TO N'; (b) REDUCE ALL (A) TO A'; (c) REDUCE ALL (A A) TO A'; repeat (d) REDUCE ALL (A' N') TO N' until EXHAUSTED; (e) REDUCE ALL (D N') TO NP This part of the algorithm has not yet been fully worked out. A graph algorithm (to find strong components of a directed graph generated by the grammar) is used. After reduction (a) is carried out, we get two new N' nodes: N': (5) (2, 3) [3] (6) (3, 4) [4] With (b), we get: A': (7) (1, 2) [1] (8) (2, 3) [2] and with (c) we get: (9) (1, 3) [1 2] in addition to the other two A' nodes. Each of the non-terminals in the grammar is represented by a sparse matrix. For example, N' is represented as the matrix: N'(2, 3) = (5): [3] N'(3, 4) = (6): [4] Each reduction rule can then be represented by matrix multiplication, so that REDUCE (A A) TO A' becomes A' := A' + A*A. The meaning of the addition operator will be detailed below; it figures heavily in dealing with ambiguity. Reduction (d) is carried out until no further (x, y) subranges can be produced. Further reductions may occur in the very last step, but they will not cause a new step to be generated if they do not produce a new subrange. This optimization is nothing but the "local ambiguity packing optimization" referred to above. After one step, we get 3 new N' nodes: (10) (1, 3) [7 5] (11) (2, 4) [8 6] (12) (1, 4) [9 6] After a secons step, we get one new addition: (12) (1, 4) [9 6][7 11] This second node is appended to the list of N' nodes which span from 1 to 4. At this stage no further steps are taken. No new subranges had been generated in the previous step, therefore the EXHAUSTED predicate becomes true. Each component in the sparse matrix is a list of strings. We see here, that INSERTING a new element into this list is what the "+" operator means. Two sparse matrices are added component-wise, and two components are added by forming the UNION of their lists. The final stage of the parser produces one new NP node: NP: (13) (0, 4) [0 12] The resulting graph is represented as: (13): NP [0 12] (8): A' [3] (2): A ["english"] (12): N' [9 6][7 11] (7): A' [1] (1): A ["old"] (11): N' [8 6] (6): N' [4] (0): D ["the"] (9) : A' [1 2] (4): N ["teacher"] Nodes (10), (5), and (3) represent the parser's failed attempt from assuming that the word "english" was a noun. It correctly recognized "english" as an adjective in this context. This graph is the same as the one presented previously.