finin@prc.unisys.com (Tim Finin) (04/12/90)
The standard approach to parsing natural language is using grammar-based
algorithms. While effective at characterizing and classifying sentences
using toy grammars, grammar-based parsing techniques are only as robust as
the grammars they use. But characterizing concisely the entire grammar of
a natural language is an extremely difficult task.
In this talk, I will present an alternative to the grammar-based approach,
a stochastic parsing method based on finding constituent boundaries, or
distituents, using a generalized mutual information statistic. This
method, called distituent parsing, is based on the hypothesis that
constituent boundaries can be extracted from a given part-of-speech n-gram
by analyzing the mutual information values within the n-gram. This
hypothesis is supported by the performance of an implementation of this
parsing algorithm which determines all levels of sentence structure from a
variety of English text with a relatively low error rate. During this
talk, I will derive the generalized mutual information statistic, describe
the parsing algorithm, and present results and sample output from the
parser. I will then discuss the potential applications of this approach
in conjunction with traditional grammar-based techniques.
11:00 am Tuesday, April 17, 1990
CAIT Conference Room
Unisys Center for Advanced Information Technology
Great Valley Laboratories #1
70 E. Swedesford Road
Paoli PA 19301
-- non-Unisys visitors who are interested in attending should --
-- send email to finin@prc.unisys.com or call 215-648-2480 --
--
Tim Finin finin@prc.unisys.com
Center for Advanced Information Technology 215-648-2840, 215-648-2288 (fax)
Unisys, PO Box 517, Paoli, PA 19301 215-386-1749 (home)