[comp.ai] A pattern recognition algorithm, I

markh@csd4.milw.wisc.edu (Mark William Hopkins) (12/23/88)
     Presented below is a outline for a simple algorithm that performs a
restricted kind of syntatic pattern recognition.  In this algorithm, the
follwing assumptions are made:

	 (1) We have available a context free grammar,
	 (2) The input is represented as a graph of strings,
	     each string contains lexical items *AND* pointers to
	     other strings as components.

For example, with a simplified LISP grammar (with outer brackets ignored):

		 List --> (Atom | '(' List ')')*

( X* denotes zero or more repititions of X.), the input string

	     (COND
		((NULL a) nil)
                (T (CONS (CAR a) (f (CDR a))) )
             )

would be represented as the string (0), in the following graph:

	  vertex   string              vertex    string
          (0)      COND (1) (3)        (4)       CONS (5) (6)
	  (1)      (2) nil             (5)       CAR a
	  (2)      NULL a              (6)       f (7)
	  (3)      T (4)               (7)       CDR a

The components (n) in the strings represent pointers to the string with the
same label.  The graph look like this:

		       COND  @  @
                             |  |
        @  nil  <------------*  *----------> T @
        |                                      |
        V                                      *---->  CONS  @  @
     NULL a                                                  |  |
                                                CAR a <------*  *--> f @
                                                                       |
                                                      CDR a <----------*

     In general, given a general context-free grammar, each string will be
labelled by a non-terminal and will consist of components, which will either
be lexical items or pointers to other (labelled) strings.

     The algorithm consists of two parts.  The first part will determine
which of those string nodes in the input graph are identical.  The second
part will then merge identical string nodes in the graph.

     This graph structure (which is actually a fancy tree structure) and
convert it into a smaller graph, by unifying identical strings.  Two strings

                    a = a1 ... am  and b = b1 ... bn

are considered identical if

	     (1) m = n
	     (2) For all i = 1 .. m:
		    if ai is a string pointer then bi is a pointer to an
		    identical string
		    else if ai is an lexical item, then ai = bi.

Call this relation Eq*.

     The algorithm consists of forming a nested sequence of equivalence
relations, Eq0, Eq1, Eq2, ..., Eqx, defined over the nodes of the graph,
where x is the number of nodes in the graph.  They are recursively defined
as follows:

		    Eq0(a, b) := true

Under Eq0, all strings are identical.

		Eq<n+1>(a, b) if and only if

		(1) a, b have the same length (m),
                    with
		       a = a1 ... am, b = b1 ... bm
                (2) For all i = 1 .. m:
		       if ai is a lexical item 
			  then ai = bi
		       else if ai is a pointer to a string node
		          then ai and bi point to strings
			  equivalent under the relation Eq<n>.

I pose the following theorem without proof, which provides the basis for the
algorithm:

     Eq<x> is the relation Eq*, where x is the number of nodes in the graph.

     This, then, is the first part of the algorithm: to determine the 
equivalence classes of "identical" nodes.  The second part will merge these
strings and thereby eliminate duplications.

     Define a function f(x) to map the range 1..x to 1..x, with:

	  f(i) = least label of the string equivalent of the ith string.

This is a "choice function" that squashes each class of identical strings
down to a single string.  For each string a = a1 ... am, in the graph define
a* to be

		       a* = [a1] ... [am]
                       where
                          if ai is a pointer to a string Sj,
			     then [ai] is a pointer to the stirng S<f(j)>,
                          else if ai is a lexical item,
			     then [ai] = ai
			  [ai] = ai if ai is a lexical item

   Consider the following example, using the "LISP grammar" above:

	       INPUT:  (CONS (CAR a) (CONS (CAR a) (f (CDR a))))

	       GRAPH:  (0) CONS (1) (2)
		       (1) CAR a
		       (2) CONS (3) (4)
		       (3) CAR a
		       (4) f (5)
		       (5) CDR a

Under the relation Eq0, the equivalence class is [0, 1, 2, 3, 4, 5].  Under
Eq1, strings (1) and (3) will remain identical, as will (0) and (2).  The
equivalence classes are

	       Eq1:    [0, 2], [1, 3], [4], [5]

Under Eq2, strings (0) and (2) will become distinct.  Eq3, Eq4, Eq5 will 
all be identical to Eq2:

	       Eq2:   [0], [1, 3], [2], [4], [5]

In the second part of the algorithm, the function f will be defined as:

	     f(0) = 1, f(1) = f(3) = 1, f(2) = 2, f(4) = 4, f(5) = 5,

thus the graph will become:

	       GRAPH:  (0) CONS (1) (2)
		       (1) CAR a
		       (2) CONS (1) (4)
		       (4) f (5)
		       (5) CDR a

Nodes (1) and (3) have been merged.

   At this point, we can see the limits to this algorithm's ability.  It did
not recognize the similarity between strings (0) and (2), due to the fact that
their last components (the pointers to strings (2) and (4)) were different.

   The algorithm can only recognize *exact* matches, but refuses to do anything
with near matches.  A related problem is that it is not able to form 
"parametrized" patterns.

   These considerations will lead us to an extended algorithm which will be
capable of handling *inexact* matches.  In order to do this, we will need to
use the Lambda Calculus.