sidney%MIT-OZ@MIT-MC.ARPA (08/25/84)
From: Sidney Markowitz <sidney%MIT-OZ@MIT-MC.ARPA> Date: 22 Aug 1984 22:05:18-PDT From: doshi%umn-cs.csnet@csnet-relay.arpa Subject: Question about HEARSAY-II. I have a question about the HEARSAY-II system [Erman et.al.1980]. What exactly is the HEARSAY system required/supposed to do ? i.e. what is the meaning of the phrase : "Speech Understanding system" I am not familiar with the HEARSAY-II system, however I am answering your question based on the following lines from the quotes you provided, and some comments of yours that indicate that you are not familiar with certain points of view common among natural language researchers. The quotes: (1) page 213 : "The HEARSAY-II reconstructs an intention .... " (2) on the strong syntactic/semantic/task constraints (3) - with a slightly artificial syntax and highly constrained task (4) - tolerating less than 10 % semantic error Researchers pretty much agree that in order to understand natural language, we need an understanding of the meaning and context of the communication. It is not enough to simply look up words in a dictionary, and/or apply rules of grammar to sentences. A classic example is the pair of sentences: "Time flies like an arrow." and "Fruit flies like a banana." The problem with speech is even worse -- It turns out that even to separate the syllables in continuous speech you need to have some understanding of what the speaker is talking about! You can discover this for yourself by trying to hear the sounds of the words when someone is speaking a foreign language. You can't even repeat them correctly as nonsense syllables. What this implies is an approach to speech recognition that goes beyond pattern recognition to include understanding of utterances. This in turn implies that the system has some understanding of the "world view" of the speaker, i.e., common sense knowledge and the probable intentions of the speaker. AI researchers have attempted to make the problem tractable by restricting the "domain" of a system. A famous example is the "blocks world" used by Terry Winograd in his doctoral thesis on a natural langugage understanding system, SHRDLU. All SHRDLU knew about was its little world of various shapes and colors of blocks, its robot arm and the possible actions and interactions of those elements. Given those limitations, and the additional assumption that anything said to it was either a question about the state of its world or else a command, Winograd was able to devise a system in which syntax, semantics and task performance all interacted. For example, an ambiguity in syntax could be resolved if only one grammatical interpretation made semantic sense. You can see how this approach is implied by the four quotes above. With this as background, lets proceed to your questions... Let me explain my confusion with examples. Does the system do one of the following : - 1) Accepts speech as input; Then, tries to output what (ever) was spoken or might have been spoken ? - 2) Or, accept speech as input and UNDERSTAND it ? Now, the 1) above is, I think speech RECOGNITION. DARPA did not want just that. Then, what is(are) the meaning(s) of UNDERSTAND ? - if I say "Alligators can fly", should the system repeat this and also tell me that that is "not true"; is this called UNDERSTANDING? - if I say "I go house", should the system repeat this and also add that there is a "grammetical error"; is this called UNDERSTANDING? - Or, if HAYES-ROTH claims "I am ERMAN", the system should say "No, You are not ERMAN" - I dont think that HEARSAY was supposedd to do this (it does not have Vision etc). But you will agree that that is also UNDERSTANDING. Note that the above claim by HAYES-ROTH would be true if : - he had changed his last name - he was merely QUOTING what ERMAN might have said somewhere - etc In light of the above examples, what does it mean by saying that HEARSAY-II understands speech ? The references to "tasks" in the quotes you provided are a clue that the authors are thinking of "understanding" in terms of the ability to perform a task that is requested by the speaker. The examples in your questions are statements that would need to be reframed as tasks. It is possible that the system could be set up so that a statement like "Alligators can fly" is an implied command to add that fact to the knowledge base, perhaps first checking for contradictions. But you probably ought to think of an example of a restricted task domain first, and then think about what "understanding" would mean in that context. For example, given a blocks world domain the system might respond to a statement such as "Place a blue cube on the red pyramid" by saying "I can't place anything on top of a pyramid". There's much that can be done with modelling the speaker's intentions and assumptions which would affect the sophistication of the resulting system, but that's the general idea. -- Sidney Markowitz <sidney%mit-oz@mit-mc.ARPA>
mmt@dciem.UUCP (Martin Taylor) (08/31/84)
================ It turns out that even to separate the syllables in continuous speech you need to have some understanding of what the speaker is talking about! You can discover this for yourself by trying to hear the sounds of the words when someone is speaking a foreign language. You can't even repeat them correctly as nonsense syllables. ================ I used to believe this myth myself, but my various visits to Europe for short (1-3 week periods, mostly) trips have convinced me otherwise. There is no point trying to repeat syllables as nonsense, partly because the sounds are not in your phonetic vocabulary. More to the point, syllable separation definitely preceded understanding. I HAD to learn to separate syllables of German long before I could understand anything (I still understand only a tiny fraction, but now I can parse most sentences into kernel and bound morphemes because I now know most of the common bound ones). My understanding of written German is a little better, and when I do understand a German sentence, it is because I can transcribe it into a visual representation with some blanks. (Incidentally, I also do some research in speech recognition, so I am well aware of the syllable segmentation problem. There do exist segmentation algorithms that correctly segment over 95% of the syllables in connected speech without any attempt to identify phonemes, let alone words or the "meaning" of speech. Mermelstein, now in Montreal, and Mangold in Ulm, Germany, are names that come to mind.) -- Martin Taylor {allegra,linus,ihnp4,floyd,ubc-vision}!utzoo!dciem!mmt {uw-beaver,qucis,watmath}!utcsrgv!dciem!mmt