[comp.ai.digest] Shift-Invariant Neural Nets for Speech Recognition

waibel@atr-la.atr.junet (Alex Waibel) (11/09/87)

A few weeks ago, there was a discussion on AI-list,  about connectionist
(neural) networks being afflicted by an inability to handle shifted patterns.
Indeed, shift-invariance is of critical importance to applications such as
speech recognition.  Without it a speech recognition system has to rely
on precise segmentation and in practice reliable errorfree segmentation
cannot be achieved.  For this reason, methods such as dynamic time warping
and now Hidden Markov Models have been very successful and achieved high
recognition performace.  Standard neural nets have done well in speech
so far, but due to this lack of shift-invariance (as discussed on AI-list
a number of these nets have been limping along in comparison to these other
techniques.

Recently, we have implemented a time-delay neural network (TDNN) here at
ATR, Japan, and demonstrate that it is shift invariant.  We have applied
it to speech and compared it to the best of our Hidden Markov Models.  The
results show, that its error rate is four times better than the best of our
Hidden Markov Models.
The abstract of our report follows:

	      Phoneme Recognition  Using Time-Delay Neural Networks

	      A. Waibel, T. Hanazawa, G. Hinton^, K. Shikano, K.Lang*
		 ATR Interpreting Telephony Research Laboratories

				Abstract

	In this paper we present a Time Delay Neural Network (TDNN) approach
	to phoneme recognition which is characterized by two important
	properties: 1.) Using a 3 layer arrangement of simple computing
	units, a hierarchy can be constructed that allows for the formation
	of arbitrary nonlinear decision surfaces.  The TDNN learns these
	decision surfaces automatically using error backpropagation.
	2.) The time-delay arrangement enables the network to discover
	acoustic-phonetic features and the temporal relationships between
	them independent of position in time and hence not blurred by
	temporal shifts in the input.

	As a recognition task, the speaker-dependent recognition of the
	phonemes "B", "D", and "G" in varying phonetic contexts was chosen.
	For comparison, several discrete Hidden Markov Models (HMM) were
	trained to perform the same task.  Performance evaluation over 1946
	testing tokens from three speakers showed that the TDNN achieves a
	recognition rate of 98.5 % correct while the rate obtained by the
	best of our HMMs was only 93.7 %.  Closer inspection reveals that
	the network "invented" well-known acoustic-phonetic features (e.g.,
	F2-rise, F2-fall, vowel-onset) as useful abstractions.  It also
	developed alternate internal representations to link different
	acoustic realizations to the same concept.

^ University of Toronto
* Carnegie-Mellon University

For copies please write or contact:
Dr. Alex Waibel
ATR Interpreting Telephony Research Laboratories
Twin 21 MID Tower, 2-1-61 Shiromi, Higashi-ku
Osaka, 540, Japan
phone: +81-6-949-1830
Please send Email to my net-address at Carnegie-Mellon University:
						  ahw@CAD.CS.CMU.EDU