hemphill@csc000.csc.ti.com (Charles Hemphill) (05/13/89)
The speech research branch at Texas Instruments would like information about public domain databases. A database is needed to support research toward combining speech and natural language systems (SLSs) to produce spoken language systems. For more details, continue reading. Our primary goal is to promote research on SLS algorithms across sites through objective evaluation using a recorded corpus of speech. Ideally, evaluation would proceed by comparing the logical form produced by the SLS under test to a predetermined logical form corresponding to the utterance. Unfortunately, comparison of logical forms is an undecidable problem (Boolos and Jeffrey, 1980). Instead, we propose to compare the answer from the query (a set of tuples, yes/no, or numbers). This allows us to eliminate all variables, quantifiers, and logical connectives, substantially simplifying the comparison problem. The underlying database should support a number of experimental scenarios for a wide class of novice users. Two examples of databases include personnel databases and airline travel databases. Personnel databases are readily available, but do not easily sustain interest in subjects who are not personnel administrators. Airline travel databases could provide many scenarios for a wide class of users, but the data and database scheme are normally proprietary. Since the sponsor is DARPA, databases from government applications are especially desirable. The ideal database should contain on the order of a few thousand records, at least five tables (with interesting relationships between them) and at least 30 attributes. Corpus collection will proceed using an SLS simulation to elicit natural speech and syntax. During the simulation, a human expert will translate spoken input into the appropriate DB queries. This should allow limited collection of dialog phenomena where anaphora and ellipsis refer to previous queries and answers. Queries referring to graphics or report generation will be eliminated. The corpus collected will include orthographic transcriptions and should be useful to both speech and natural language researchers separately. A previously collected speech corpus, containing read speech under studio quality conditions from several hundred speakers of various dialects, is now available from the National Institute for Standards and Technology (NIST, formerly NBS). This corpus supports research for continuous speech recognition for either speaker dependent or speaker independent algorithms.