finin@prc.unisys.com (Tim Finin) (09/27/88)
CALL FOR PARTICIPATION Workshop on Evaluation of Natural Language Processing Systems December 8-9, 1988 Wayne Hotel, Wayne, PA (Suburban Philadelphia) There has been much recent interest in the difficult problem of evaluating natural language systems. With the exception of natural language interfaces there are few working systems in existence, and they tend to be concerned with very different tasks and use equally different techniques. There has been little agreement in the field about training sets and test sets, or about clearly defined subsets of problems that constitute standards for different levels of performance. Even those groups that have attempted a measure of self-evaluation have often been reduced to discussing a system's performance in isolation - comparing its current performance to its previous performance rather than to another system. As this technology begins to move slowly into the marketplace, the need for useful evaluation techniques is becoming more and more obvious. The speech community has made some recent progress toward developing new methods of evaluation, and it is time that the natural language community followed suit. This is much more easily said than done and will require a concentrated effort on the part of the field. There are certain premises that should underly any discussion of evaluation of natural language processing systems: o It should be possible to discuss system evaluation in general without having to state whether the purpose of the system is "question-answering" or "text processing." Evaluating a system requires the definition of an application task in terms of I/O pairs which are equally applicable to question-answering, text processing, or generation. o There are two basic types of evaluation: a) "black box evaluation" which measures system performance on a given task in terms of well-defined I/O pairs; and b) "glass box evaluation" which examines the internal workings of the system. For example, glass box performance evaluation for a system that is supposed to perform semantic and pragmatic analysis should include the examination of predicate-argument relations, referents, and temporal and causal relations. Given these premises, the workshop will be structured around the following three sessions: (1) Defining "glass box evaluation" and "black box evaluation."; (2) Defining criteria for "black box evaluation", (A Proposal for establishing task oriented benchmarks for NLP Systems, Session Chair - Beth Sundheim); (3) Defining criteria for "glass box evaluation." (Session Chair - Jerry Hobbs). Several different types of systems will be discussed, including question-answering systems, text processing systems and generation systems. Researchers interested in participating should submit a short (250-500 word) description of their experience and interests, and expected contributions to the workshop. In particular, if they have been involved in any evaluation efforts that they would like to report on, they should include a short abstract (500-1000 words) as well. The number of participants at the workshop must be restricted due to limited room size. The descriptions and abstracts will be reviewed by the following committee: Martha Palmer (Unisys), Beth Sundheim (NOSC), Ed Hovy (ISI), Tim Finin (Unisys), and Lynn Bates (BBN). This material should arrive at the address given below no later than October 1st. Responses to all who submit abstracts or descriptions will be sent by November 1st. Martha Palmer Unisys Paoli Research Center PO Box 517 Paoli, PA 19301 palmer@prc.unisys.com 215-648-7228