[news.announce.conferences] Workshop on Evaluation of Natural Language Processing Systems

finin@prc.unisys.com (Tim Finin) (09/27/88)
			CALL FOR PARTICIPATION

			     Workshop on
	  Evaluation of Natural Language Processing Systems

			  December 8-9, 1988
			Wayne Hotel, Wayne, PA
		       (Suburban Philadelphia)

There has been much recent interest in the difficult problem of
evaluating natural language systems.  With the exception of natural
language interfaces there are few working systems in existence, and
they tend to be concerned with very different tasks and use equally
different techniques.  There has been little agreement in the field
about training sets and test sets, or about clearly defined subsets of
problems that constitute standards for different levels of
performance.  Even those groups that have attempted a measure of
self-evaluation have often been reduced to discussing a system's
performance in isolation - comparing its current performance to its
previous performance rather than to another system.  As this
technology begins to move slowly into the marketplace, the need for
useful evaluation techniques is becoming more and more obvious.	 The
speech community has made some recent progress toward developing new
methods of evaluation, and it is time that the natural language
community followed suit.  This is much more easily said than done and
will require a concentrated effort on the part of the field.

There are certain premises that should underly any discussion
of evaluation of natural language processing systems:

   o It should be possible to discuss system evaluation in general without
     having to state whether the purpose of the system is
     "question-answering" or "text processing."	 Evaluating a system
     requires the definition of an application task in terms of I/O pairs
     which are equally applicable to question-answering, text processing,
     or generation.

   o There are two basic types of evaluation: a) "black box evaluation"
     which measures system performance on a given task in terms of
     well-defined I/O pairs; and b) "glass box evaluation" which examines
     the internal workings of the system.  For example, glass box
     performance evaluation for a system that is supposed to perform
     semantic and pragmatic analysis should include the examination of
     predicate-argument relations, referents, and temporal and causal
     relations.

Given these premises, the workshop will be structured around the
following three sessions: (1) Defining "glass box evaluation" and
"black box evaluation."; (2) Defining criteria for "black box
evaluation", (A Proposal for establishing task oriented benchmarks for
NLP Systems, Session Chair - Beth Sundheim); (3) Defining criteria for
"glass box evaluation." (Session Chair - Jerry Hobbs).	Several
different types of systems will be discussed, including
question-answering systems, text processing systems and generation
systems.

Researchers interested in participating should submit a short (250-500
word) description of their experience and interests, and expected
contributions to the workshop.	In particular, if they have been
involved in any evaluation efforts that they would like to report on,
they should include a short abstract (500-1000 words) as well.	The
number of participants at the workshop must be restricted due to
limited room size.  The descriptions and abstracts will be reviewed by
the following committee: Martha Palmer (Unisys), Beth Sundheim (NOSC),
Ed Hovy (ISI), Tim Finin (Unisys), and Lynn Bates (BBN).

This material should arrive at the address given below no later than
October 1st.  Responses to all who submit abstracts or descriptions
will be sent by November 1st.

  Martha Palmer
  Unisys Paoli Research Center
  PO Box 517
  Paoli, PA 19301
  palmer@prc.unisys.com
  215-648-7228