AFB%MIT-OZ@MIT-MC.ARPA (08/29/84)
From: Aaron F. Bobick <AFB%MIT-OZ@MIT-MC.ARPA> About speech recognition: From: Sidney Markowitz <sidney%MIT-OZ@MIT-MC.ARPA> It turns out that even to separate the syllables in continuous speech you need to have some understanding of what the speaker is talking about! You can discover this for yourself by trying to hear the sounds of the words when someone is speaking a foreign language. You can't even repeat them correctly as nonsense syllables. What this implies is an approach to speech recognition that goes beyond pattern recognition to include understanding of utterances. This in turn implies that the system has some understanding of the "world view" of the speaker, i.e., common sense knowledge and the probable intentions of the speaker..... Many psycho-lingists would dispute this. The problem with the foreign language example is that you don't recognize WORDS, not that you don't understand the utterance (for now let us define understanding as building some sort of SEMANTIC model, the details don't matter). Consider the classic: "Green ideas sleep furiously." I doubt one can "understand" this in any plausible way yet it's encoding is easy. Even if one removes grammar and is listening to a randomized listing of Websters dictionary, one can easily parse the string into syllables and words. In fact, *except under noise conditions much worse than normal conversation*, there is psycho-linguistic evidence that context does not greatly affect word recognition by humans in terms of the parsing of the input signal. .... (I am over simplifying a little; there is also evidence that context can help you make judgements about incoming words and syllables. However, this may be a post-access phenomena, sort of a surprise effect when an anomalous word or syllable is encountered; the jury is still out. Regardless, it is certainly reasonable to consider a context independent word recognition system. ) ..... Therefore, it is clearly possible to consider speech *recognition* as separate from understanding. Hearsay (I or II) does not; some psychologists (and, by the way, many AI speech hackers) do. Stuck in the middle again ...... aaron bobick (afb%mit-oz@mit-mc)
polard@fortune.UUCP (Henry Polard) (09/05/84)
<fowniymz for dh6 layn iyt6r> Which hip was burned? Which ship was burned? Which chip was burned? and Which Chip was spurned? all sound the same when spoken at the speed of conversational speech. This is evidence that in order to recognize words in continuous speech you (and presumably a speech-recognition apparatus) need to understand what the speaker is talking about. There seem to be two reasons why understanding is necessary for word recognition in continuous speech: 1. The existence of homonyms. This is why "It's a good read." sounds the same as: "It's a good reed," and why the two sentences could not be distinguished without a knowledge of the context. 2. Sandhi, or sound changes at word boundaries. The sounds at the end of a word tend to blend into the sounds at the beginning of the next word in conversation, making words sound as if they ran into each other and making words sound differently than they would when said in isolation. The resulting ambiguities are ususally resolved by context. Speech rarely occurs without some sort of context, and even then the first thing that usually happens is to establish a context for what is to follow. To paraphrase Edsgar Dijkstra: "Asking whether computers will understand speech is like asking whether submarines swim." -- Henry Polard (You bring the flames - I'll bring the marshmallows.) {ihnp4,cbosgd,amd}!fortune!polard