[comp.databases] public domain database needed

hemphill@csc000.csc.ti.com (Charles Hemphill) (05/13/89)

The speech research branch at Texas Instruments would like information
about public domain databases.  A database is needed to support
research toward combining speech and natural language systems (SLSs)
to produce spoken language systems.  For more details, continue
reading.

Our primary goal is to promote research on SLS algorithms across sites
through objective evaluation using a recorded corpus of speech.
Ideally, evaluation would proceed by comparing the logical form
produced by the SLS under test to a predetermined logical form
corresponding to the utterance.  Unfortunately, comparison of logical
forms is an undecidable problem (Boolos and Jeffrey, 1980).  Instead,
we propose to compare the answer from the query (a set of tuples,
yes/no, or numbers).  This allows us to eliminate all variables,
quantifiers, and logical connectives, substantially simplifying the
comparison problem.

The underlying database should support a number of experimental
scenarios for a wide class of novice users.  Two examples of databases
include personnel databases and airline travel databases.  Personnel
databases are readily available, but do not easily sustain interest in
subjects who are not personnel administrators.  Airline travel
databases could provide many scenarios for a wide class of users, but
the data and database scheme are normally proprietary.  Since the
sponsor is DARPA, databases from government applications are
especially desirable.  The ideal database should contain on the order
of a few thousand records, at least five tables (with interesting
relationships between them) and at least 30 attributes.

Corpus collection will proceed using an SLS simulation to elicit
natural speech and syntax.  During the simulation, a human expert will
translate spoken input into the appropriate DB queries.  This should
allow limited collection of dialog phenomena where anaphora and
ellipsis refer to previous queries and answers.  Queries referring to
graphics or report generation will be eliminated.  The corpus
collected will include orthographic transcriptions and should be
useful to both speech and natural language researchers separately.

A previously collected speech corpus, containing read speech under
studio quality conditions from several hundred speakers of various
dialects, is now available from the National Institute for Standards
and Technology (NIST, formerly NBS).  This corpus supports research
for continuous speech recognition for either speaker dependent or
speaker independent algorithms.