markh@csd4.milw.wisc.edu (Mark William Hopkins) (12/23/88)
Presented below is a outline for a simple algorithm that performs a restricted kind of syntatic pattern recognition. In this algorithm, the follwing assumptions are made: (1) We have available a context free grammar, (2) The input is represented as a graph of strings, each string contains lexical items *AND* pointers to other strings as components. For example, with a simplified LISP grammar (with outer brackets ignored): List --> (Atom | '(' List ')')* ( X* denotes zero or more repititions of X.), the input string (COND ((NULL a) nil) (T (CONS (CAR a) (f (CDR a))) ) ) would be represented as the string (0), in the following graph: vertex string vertex string (0) COND (1) (3) (4) CONS (5) (6) (1) (2) nil (5) CAR a (2) NULL a (6) f (7) (3) T (4) (7) CDR a The components (n) in the strings represent pointers to the string with the same label. The graph look like this: COND @ @ | | @ nil <------------* *----------> T @ | | V *----> CONS @ @ NULL a | | CAR a <------* *--> f @ | CDR a <----------* In general, given a general context-free grammar, each string will be labelled by a non-terminal and will consist of components, which will either be lexical items or pointers to other (labelled) strings. The algorithm consists of two parts. The first part will determine which of those string nodes in the input graph are identical. The second part will then merge identical string nodes in the graph. This graph structure (which is actually a fancy tree structure) and convert it into a smaller graph, by unifying identical strings. Two strings a = a1 ... am and b = b1 ... bn are considered identical if (1) m = n (2) For all i = 1 .. m: if ai is a string pointer then bi is a pointer to an identical string else if ai is an lexical item, then ai = bi. Call this relation Eq*. The algorithm consists of forming a nested sequence of equivalence relations, Eq0, Eq1, Eq2, ..., Eqx, defined over the nodes of the graph, where x is the number of nodes in the graph. They are recursively defined as follows: Eq0(a, b) := true Under Eq0, all strings are identical. Eq<n+1>(a, b) if and only if (1) a, b have the same length (m), with a = a1 ... am, b = b1 ... bm (2) For all i = 1 .. m: if ai is a lexical item then ai = bi else if ai is a pointer to a string node then ai and bi point to strings equivalent under the relation Eq<n>. I pose the following theorem without proof, which provides the basis for the algorithm: Eq<x> is the relation Eq*, where x is the number of nodes in the graph. This, then, is the first part of the algorithm: to determine the equivalence classes of "identical" nodes. The second part will merge these strings and thereby eliminate duplications. Define a function f(x) to map the range 1..x to 1..x, with: f(i) = least label of the string equivalent of the ith string. This is a "choice function" that squashes each class of identical strings down to a single string. For each string a = a1 ... am, in the graph define a* to be a* = [a1] ... [am] where if ai is a pointer to a string Sj, then [ai] is a pointer to the stirng S<f(j)>, else if ai is a lexical item, then [ai] = ai [ai] = ai if ai is a lexical item Consider the following example, using the "LISP grammar" above: INPUT: (CONS (CAR a) (CONS (CAR a) (f (CDR a)))) GRAPH: (0) CONS (1) (2) (1) CAR a (2) CONS (3) (4) (3) CAR a (4) f (5) (5) CDR a Under the relation Eq0, the equivalence class is [0, 1, 2, 3, 4, 5]. Under Eq1, strings (1) and (3) will remain identical, as will (0) and (2). The equivalence classes are Eq1: [0, 2], [1, 3], [4], [5] Under Eq2, strings (0) and (2) will become distinct. Eq3, Eq4, Eq5 will all be identical to Eq2: Eq2: [0], [1, 3], [2], [4], [5] In the second part of the algorithm, the function f will be defined as: f(0) = 1, f(1) = f(3) = 1, f(2) = 2, f(4) = 4, f(5) = 5, thus the graph will become: GRAPH: (0) CONS (1) (2) (1) CAR a (2) CONS (1) (4) (4) f (5) (5) CDR a Nodes (1) and (3) have been merged. At this point, we can see the limits to this algorithm's ability. It did not recognize the similarity between strings (0) and (2), due to the fact that their last components (the pointers to strings (2) and (4)) were different. The algorithm can only recognize *exact* matches, but refuses to do anything with near matches. A related problem is that it is not able to form "parametrized" patterns. These considerations will lead us to an extended algorithm which will be capable of handling *inexact* matches. In order to do this, we will need to use the Lambda Calculus.