palmer@PRC.UNISYS.COM (09/02/88)
CALL FOR PARTICIPATION Workshop on Evaluation of Natural Language Processing Systems Dec 8-9 Wayne Hotel, Wayne, PA (Philadelphia) There has been much recent interest in the difficult problem of evaluating natural language systems. With the exception of natural language interfaces there are few work- ing systems in existence, and they tend to be concerned with very different tasks and use equally different techniques. There has been little agreement in the field about training sets and test sets, or about clearly defined subsets of problems that constitute standards for different levels of performance. Even those groups that have attempted a meas- ure of self-evaluation have often been reduced to discussing a system's performance in isolation - comparing its current performance to its previous performance rather than to another system. As this technology begins to move slowly into the marketplace, the need for useful evaluation tech- niques is becoming more and more obvious. The speech com- munity has made some recent progress toward developing new methods of evaluation, and it is time that the natural language community followed suit. This is much more easily said than done and will require a concentrated effort on the part of the field. There are certain premises that should underly any dis- cussion of evaluation of natural language processing sys- tems: (1) It should be possible to discuss system evaluation in general without having to state whether the pur- pose of the system is "question-answering" or "text processing." Evaluating a system requires the definition of an application task in terms of I/O pairs which are equally applicable to question- answering, text processing, or generation. (2) There are two basic types of evaluation: a) "black box evaluation" which measures system performance on a given task in terms of well-defined I/O pairs; and b) "glass box evaluation" which examines the internal workings of the system. For example, glass box per- formance evaluation for a system that is supposed to perform semantic and pragmatic analysis should include the examination of predicate-argument rela- tions, referents, and temporal and causal relations. Given these premises, the workshop will be structured around the following three sessions: 1) Defining "glass box evaluation" and "black box evaluation." 2) Defining criteria for "black box evaluation." _A _P_r_o_p_o_s_a_l _f_o_r _e_s_t_a_b_l_i_s_h_i_n_g _t_a_s_k _o_r_i_e_n_t_e_d _b_e_n_c_h_m_a_r_k_s _f_o_r _N_L_P _S_y_s_t_e_m_s (Session Chair - Beth Sundheim) 3) Defining criteria for "glass box evaluation." (Session Chair - Jerry Hobbs) Several different types of systems will be discussed, including question-answering sys- tems, text processing systems and generation systems. Researchers interested in participating are requested to submit a short (250-500 word) description of their experience and interests, and what they could contribute to the workshop. In particular, if they have been involved in any evaluation efforts that they would like to report on, they should include a short abstract (500-1000 words) as well. The number of participants at the workshop must be restricted due to limited room size. The descriptions and abstracts will be reviewed by the following committee: Mar- tha Palmer (Unisys), Mitch Marcus (University of Pennsyl- vania), Beth Sundheim (NOSC), Ed Hovy (ISI), Tim Finin (Unisys), Lynn Bates (BBN). They should arrive at the address given below no later than October 1st. Responses to all who submit abstracts or descriptions will be sent by November 1st. Martha Palmer Unisys Research & Development PO Box 517 Paoli, PA 19301 palmer@prc.unisys.com (215) 648-7228 9 9