waibel@atr-la.atr.junet (Alex Waibel) (11/09/87)
A few weeks ago, there was a discussion on AI-list, about connectionist (neural) networks being afflicted by an inability to handle shifted patterns. Indeed, shift-invariance is of critical importance to applications such as speech recognition. Without it a speech recognition system has to rely on precise segmentation and in practice reliable errorfree segmentation cannot be achieved. For this reason, methods such as dynamic time warping and now Hidden Markov Models have been very successful and achieved high recognition performace. Standard neural nets have done well in speech so far, but due to this lack of shift-invariance (as discussed on AI-list a number of these nets have been limping along in comparison to these other techniques. Recently, we have implemented a time-delay neural network (TDNN) here at ATR, Japan, and demonstrate that it is shift invariant. We have applied it to speech and compared it to the best of our Hidden Markov Models. The results show, that its error rate is four times better than the best of our Hidden Markov Models. The abstract of our report follows: Phoneme Recognition Using Time-Delay Neural Networks A. Waibel, T. Hanazawa, G. Hinton^, K. Shikano, K.Lang* ATR Interpreting Telephony Research Laboratories Abstract In this paper we present a Time Delay Neural Network (TDNN) approach to phoneme recognition which is characterized by two important properties: 1.) Using a 3 layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces. The TDNN learns these decision surfaces automatically using error backpropagation. 2.) The time-delay arrangement enables the network to discover acoustic-phonetic features and the temporal relationships between them independent of position in time and hence not blurred by temporal shifts in the input. As a recognition task, the speaker-dependent recognition of the phonemes "B", "D", and "G" in varying phonetic contexts was chosen. For comparison, several discrete Hidden Markov Models (HMM) were trained to perform the same task. Performance evaluation over 1946 testing tokens from three speakers showed that the TDNN achieves a recognition rate of 98.5 % correct while the rate obtained by the best of our HMMs was only 93.7 %. Closer inspection reveals that the network "invented" well-known acoustic-phonetic features (e.g., F2-rise, F2-fall, vowel-onset) as useful abstractions. It also developed alternate internal representations to link different acoustic realizations to the same concept. ^ University of Toronto * Carnegie-Mellon University For copies please write or contact: Dr. Alex Waibel ATR Interpreting Telephony Research Laboratories Twin 21 MID Tower, 2-1-61 Shiromi, Higashi-ku Osaka, 540, Japan phone: +81-6-949-1830 Please send Email to my net-address at Carnegie-Mellon University: ahw@CAD.CS.CMU.EDU