[comp.ai] Parsing Natural Language Using Generalized Mutual Information

finin@prc.unisys.com (Tim Finin) (04/12/90)

The standard approach to parsing natural language is using grammar-based
algorithms.  While effective at characterizing and classifying sentences
using toy grammars, grammar-based parsing techniques are only as robust as
the grammars they use.  But characterizing concisely the entire grammar of
a natural language is an extremely difficult task.

In this talk, I will present an alternative to the grammar-based approach,
a stochastic parsing method based on finding constituent boundaries, or
distituents, using a generalized mutual information statistic.  This
method, called distituent parsing, is based on the hypothesis that
constituent boundaries can be extracted from a given part-of-speech n-gram
by analyzing the mutual information values within the n-gram.  This
hypothesis is supported by the performance of an implementation of this
parsing algorithm which determines all levels of sentence structure from a
variety of English text with a relatively low error rate.  During this
talk, I will derive the generalized mutual information statistic, describe
the parsing algorithm, and present results and sample output from the
parser.  I will then discuss the potential applications of this approach
in conjunction with traditional grammar-based techniques.
				     
		     11:00 am Tuesday, April 17, 1990
			   CAIT Conference Room
	    Unisys Center for Advanced Information Technology
		       Great Valley Laboratories #1
			  70 E. Swedesford Road
			      Paoli PA 19301
				     
     -- non-Unisys visitors who are interested in attending should --
     --   send email to finin@prc.unisys.com or call 215-648-2480  --
-- 
 Tim Finin                                   finin@prc.unisys.com
 Center for Advanced Information Technology  215-648-2840, 215-648-2288 (fax)
 Unisys, PO Box 517, Paoli, PA 19301         215-386-1749 (home)