[mod.ai] Seminar - Information Retrieval by Text Skimming

Michael.Mauldin@CAD.CS.CMU.EDU (05/19/86)
What:	Thesis Proposal: Information Retrieval By Text Skimming
Who:	Michael L. Mauldin (MLM@CAD)
When:	May 29, 1986 At 3pm
Where:	In Wean Hall 5409


Most  information  retrieval  systems  today  are word based.  But simple word
searches and frequency distributions do not  provide  these  systems  with  an
understanding  of  their  texts.  Full natural language parsers are capable of
deep understanding within limited domains, but are too brittle  and  slow  for
general information retrieval.

The proposed dissertation attempts to bridge this gap by using a text skimming
parser as the  basis  for  an  information  retrieval  system  that  partially
understands  the  texts  stored  in  it.  The objective is to develop a system
capable of retrieving a significantly greater fraction of  relevant  documents
than  is  possible  with a keyword based approach, without retrieving a larger
fraction of irrelevant  documents.    As  part  of  my  dissertation,  I  will
implement  a  full-text  information  retrieval system called FERRET (Flexible
Expert Retrieval of Relevant English Texts).  FERRET will provide  information
retrieval for the UseNet News system, a collection of 247 news groups covering
a wide variety  of  topics.    Initially  FERRET  will  cover  NET.ASTRO,  the
Astronomy  news group, and part of my investigation will be to demonstrate the
addition of new domains with only minimal hand  coding  of  domain  knowledge.
FERRET  will  acquire  the  details  of  a domain automatically using a script
learning component.

FERRET will consist of  a  text  skimming  parser  (based  on  DeJong's  FRUMP
program),  a  case  frame  matcher that compares the parse of the user's query
with the parses of each text stored  in  the  retrieval  system,  and  a  user
interface.  The parser relies on two knowledge sources for its operation:  the
sketchy script database, which encodes domain knowledge, and the lexicon.  The
lexicon from FRUMP will be extended both by hand and automatically with syntax
and synonym information from  an  on-line  English  dictionary.    The  script
database  from  FRUMP  will  be  extended  both by hand and automatically by a
learning component that generates new scripts based on texts  that  have  been
parsed.    The learning component will evaluate the new scripts using feedback
from the user, and retain the best performers for future use.

The resulting information retrieval system will be  evaluated  by  determining
its  performance  on queries of the UseNet database, both in terms of relevant
texts not retrieved and irrelevant texts that are retrieved.  Over six million
characters appear on UseNet each week, so there should be enough data to study
performance on a large database.

The main contribution of the work will be a demonstration that a text skimming
retrieval system can make distinctions based on semantic roles and information
that word based systems cannot make.    The  script  learning  and  dictionary
access  are  new  capabilities  that  will  be  widely useful in other natural
language applications.