finin@prc.unisys.com (Tim Finin) (04/12/90)
The standard approach to parsing natural language is using grammar-based algorithms. While effective at characterizing and classifying sentences using toy grammars, grammar-based parsing techniques are only as robust as the grammars they use. But characterizing concisely the entire grammar of a natural language is an extremely difficult task. In this talk, I will present an alternative to the grammar-based approach, a stochastic parsing method based on finding constituent boundaries, or distituents, using a generalized mutual information statistic. This method, called distituent parsing, is based on the hypothesis that constituent boundaries can be extracted from a given part-of-speech n-gram by analyzing the mutual information values within the n-gram. This hypothesis is supported by the performance of an implementation of this parsing algorithm which determines all levels of sentence structure from a variety of English text with a relatively low error rate. During this talk, I will derive the generalized mutual information statistic, describe the parsing algorithm, and present results and sample output from the parser. I will then discuss the potential applications of this approach in conjunction with traditional grammar-based techniques. 11:00 am Tuesday, April 17, 1990 CAIT Conference Room Unisys Center for Advanced Information Technology Great Valley Laboratories #1 70 E. Swedesford Road Paoli PA 19301 -- non-Unisys visitors who are interested in attending should -- -- send email to finin@prc.unisys.com or call 215-648-2480 -- -- Tim Finin finin@prc.unisys.com Center for Advanced Information Technology 215-648-2840, 215-648-2288 (fax) Unisys, PO Box 517, Paoli, PA 19301 215-386-1749 (home)