Michael.Mauldin@CAD.CS.CMU.EDU (05/19/86)
What: Thesis Proposal: Information Retrieval By Text Skimming Who: Michael L. Mauldin (MLM@CAD) When: May 29, 1986 At 3pm Where: In Wean Hall 5409 Most information retrieval systems today are word based. But simple word searches and frequency distributions do not provide these systems with an understanding of their texts. Full natural language parsers are capable of deep understanding within limited domains, but are too brittle and slow for general information retrieval. The proposed dissertation attempts to bridge this gap by using a text skimming parser as the basis for an information retrieval system that partially understands the texts stored in it. The objective is to develop a system capable of retrieving a significantly greater fraction of relevant documents than is possible with a keyword based approach, without retrieving a larger fraction of irrelevant documents. As part of my dissertation, I will implement a full-text information retrieval system called FERRET (Flexible Expert Retrieval of Relevant English Texts). FERRET will provide information retrieval for the UseNet News system, a collection of 247 news groups covering a wide variety of topics. Initially FERRET will cover NET.ASTRO, the Astronomy news group, and part of my investigation will be to demonstrate the addition of new domains with only minimal hand coding of domain knowledge. FERRET will acquire the details of a domain automatically using a script learning component. FERRET will consist of a text skimming parser (based on DeJong's FRUMP program), a case frame matcher that compares the parse of the user's query with the parses of each text stored in the retrieval system, and a user interface. The parser relies on two knowledge sources for its operation: the sketchy script database, which encodes domain knowledge, and the lexicon. The lexicon from FRUMP will be extended both by hand and automatically with syntax and synonym information from an on-line English dictionary. The script database from FRUMP will be extended both by hand and automatically by a learning component that generates new scripts based on texts that have been parsed. The learning component will evaluate the new scripts using feedback from the user, and retain the best performers for future use. The resulting information retrieval system will be evaluated by determining its performance on queries of the UseNet database, both in terms of relevant texts not retrieved and irrelevant texts that are retrieved. Over six million characters appear on UseNet each week, so there should be enough data to study performance on a large database. The main contribution of the work will be a demonstration that a text skimming retrieval system can make distinctions based on semantic roles and information that word based systems cannot make. The script learning and dictionary access are new capabilities that will be widely useful in other natural language applications.