palmer@PRC.UNISYS.COM (09/02/88)
CALL FOR PARTICIPATION
Workshop on
Evaluation of Natural Language Processing Systems
Dec 8-9
Wayne Hotel, Wayne, PA (Philadelphia)
There has been much recent interest in the difficult
problem of evaluating natural language systems. With the
exception of natural language interfaces there are few work-
ing systems in existence, and they tend to be concerned with
very different tasks and use equally different techniques.
There has been little agreement in the field about training
sets and test sets, or about clearly defined subsets of
problems that constitute standards for different levels of
performance. Even those groups that have attempted a meas-
ure of self-evaluation have often been reduced to discussing
a system's performance in isolation - comparing its current
performance to its previous performance rather than to
another system. As this technology begins to move slowly
into the marketplace, the need for useful evaluation tech-
niques is becoming more and more obvious. The speech com-
munity has made some recent progress toward developing new
methods of evaluation, and it is time that the natural
language community followed suit. This is much more easily
said than done and will require a concentrated effort on the
part of the field.
There are certain premises that should underly any dis-
cussion of evaluation of natural language processing sys-
tems:
(1) It should be possible to discuss system evaluation
in general without having to state whether the pur-
pose of the system is "question-answering" or "text
processing." Evaluating a system requires the
definition of an application task in terms of I/O
pairs which are equally applicable to question-
answering, text processing, or generation.
(2) There are two basic types of evaluation: a) "black box
evaluation" which measures system performance on a
given task in terms of well-defined I/O pairs; and b)
"glass box evaluation" which examines the internal
workings of the system. For example, glass box per-
formance evaluation for a system that is supposed
to perform semantic and pragmatic analysis should
include the examination of predicate-argument rela-
tions, referents, and temporal and causal relations.
Given these premises, the workshop will be structured
around the following three sessions: 1) Defining "glass box
evaluation" and "black box evaluation." 2) Defining criteria
for "black box evaluation." _A _P_r_o_p_o_s_a_l _f_o_r _e_s_t_a_b_l_i_s_h_i_n_g _t_a_s_k
_o_r_i_e_n_t_e_d _b_e_n_c_h_m_a_r_k_s _f_o_r _N_L_P _S_y_s_t_e_m_s (Session Chair - Beth
Sundheim) 3) Defining criteria for "glass box evaluation."
(Session Chair - Jerry Hobbs) Several different types of
systems will be discussed, including question-answering sys-
tems, text processing systems and generation systems.
Researchers interested in participating are requested
to submit a short (250-500 word) description of their
experience and interests, and what they could contribute to
the workshop. In particular, if they have been involved in
any evaluation efforts that they would like to report on,
they should include a short abstract (500-1000 words) as
well. The number of participants at the workshop must be
restricted due to limited room size. The descriptions and
abstracts will be reviewed by the following committee: Mar-
tha Palmer (Unisys), Mitch Marcus (University of Pennsyl-
vania), Beth Sundheim (NOSC), Ed Hovy (ISI), Tim Finin
(Unisys), Lynn Bates (BBN). They should arrive at the
address given below no later than October 1st. Responses to
all who submit abstracts or descriptions will be sent by
November 1st.
Martha Palmer
Unisys
Research & Development
PO Box 517
Paoli, PA 19301
palmer@prc.unisys.com
(215) 648-7228
9
9