AIList-REQUEST@SRI-AI.ARPA (AIList Moderator Kenneth Laws) (01/21/86)
AIList Digest Monday, 20 Jan 1986 Volume 4 : Issue 10 Today's Topics: Machine Learning - Connectionist Speech Machine ---------------------------------------------------------------------- Date: Wed, 15 Jan 86 23:06 EST From: Tim Finin <Tim%upenn.csnet@CSNET-RELAY.ARPA> Subject: nettalk Several people inquired about the work of Terrence Sejnowski (of Johns Hopkins) which was reported on the Today show recently. This abstract is to a talk given by Sejnowski here at Penn in October '85: NETTALK: TEACHING A MASSIVELY-PARALLEL NETWORK TO TALK TERRENCE J. SEJNOWSKI BIOPHYSICS DEPARTMENT JOHNS HOPKINS UNIVERSITY BALTIMORE, MARYLAND Text to speech is a difficult problem for rule-based systems because English pronunciation is highly context dependent and there are many exceptions to phonological rules. A more suitable knowledge representation for correspondences between letters and phonemes will be described in which rules and exceptions are treated uniformly and can be determined with a learning algorithm. The architecture is a layered network of several hundred simple processing units with several thousand weights on the connections between the units. The training corpus is continuous informal speech transcribed from tape recordings. Following training on 1000 words from this corpus the network can generalize to novel text. Even though this network was not designed to mimic human learning, the development of the network in some respects resembles the early stages in human language acquisition. It is conjectured that the parallel architecture and learning algorithm will also be effective on other problems which depend on evidential reasoning from previous experience. (No - I don't have his net address. Tim.) ------------------------------ Date: 16 Jan 86 1225 PST From: Richard Vistnes <RV@SU-AI.ARPA> Subject: John Hopkins learning machine: info See AIList Digest V3 #183 (10 Dec 1985) for a talk given at Stanford a little while ago that sounds very similar. The person is: Terrence J. Sejnowski Biophysics Department Johns Hopkins University Baltimore, MD 21218 (I didn't attend the talk). -Richard Vistnes ------------------------------ Date: Sun, 19 Jan 86 0:19:10 EST From: Terry Sejnowski <terry@hopkins-eecs-bravo.ARPA> Subject: Reply to Inquiries NBC ran a short segment last Monday, January 13, on the Today Show about my research on a connectionist model of text-to-speech. The segment was meant for a general audience (waking up) and all the details were left out, so here is an abstract for those who have asked for more information. A technical report is available (Johns Hopkins Electrical Engineering and Computer Science Technical Report EECS-8601) upon request. NETtalk: A Parallel Network that Learns to Read Aloud Terrence Sejnowski Department of Biophysics Johns Hopkins University Baltimore, MD 21218 Charles R. Rosenberg Department of Psychology Princeton Unviversity Princeton, NJ 08540 Unrestricted English text can be converted to speech by applying phonological rules and handling exceptions with a look-up table. However, this approach is highly labor intensive since each entry and rule must be hand-crafted. NETtalk is an alternative approach that is based on an automated learning procedure for a parallel network of deterministic processing units. After training on a corpus of informal continuous speech, it achieves good performance and generalizes to novel words. The distributed representations discovered by the network are damage resistant and recovery from damage is about ten times faster than the original learning starting from the same level of performance. Terry Sejnowski ------------------------------ Date: Thu, 16 Jan 86 12:53 EST From: Mark Beutnagel <Beutnagel%upenn.csnet@CSNET-RELAY.ARPA> Subject: speech learning machine The speech learning machine referred to in a recent AIList is almost certainly a connection machine built by Terry Sejnowski. The system consists of a maybe 200 processing elements (or simulations of such) and weighted connections between them. Input is a small window of text (5 letters?) and output is phonemes. The system learns (i.e. modifies weights) based on a comparison of the predicted phoneme with the "correct" phoneme. After running overnight the output was recognizable speech--good but still slightly mechanical. Neat stuff but nothing mystical. -- Mark Beutnagel (The above is my recollection of Terry's talk here at UPenn last fall so don't quote me.) ------------------------------ Date: Sun 19 Jan 86 12:31:31-PST From: Ken Laws <Laws@SRI-AI.ARPA> Subject: Speech Learning I'll have a try at summarizing Terry's talk at Stanford/CSLI: The speech learning machine is a three-layer "perceptron-like" network. The bottom layer of 189 "processing units" simply encodes a 7-character window of input text: each character (or space) activates one of 27 output lines and suppresses 26 other lines. The top, or output, layer represents a "coarse coding" of the phoneme (or silence) which should be output for the character at the center of the 7-character window. Each bit, or output line, of the top layer represents some phoneme characteristic: vowel/consonant, voiced, fricative, etc. Each legal phoneme is thus represented by a particular output pattern, but some output patterns might not correspond to legal phonemes. (I think they were mapped to silence in the recording.) The output was used for two purposes: to compute a feedback error signal used in training the machine, and to feed the output stage of a DecTalk speech synthesizer so that the output could be judged subjectively. The heart of the system is a "hidden layer" of about 200 processing units, together with several thousand interconnections and their weights. These connect the 189 first-level outputs to the small number of output processing units. It is the setting of the weight coefficients for this network that is the central problem. Input to the system was a page of a child's speech that had be transcribed in phonetic notation by a professional. Correspondence had been established between each input letter and the corresponding phoneme (or silence), and the coarse coding of the phonemes was known. For any possible output of the machine it was thus possible to determine which bits were correct and which were incorrect. This provided the error signal. Unlike the Boltzmann Machine or the Hopfield networks, Sejnowski's algorithm does not require symmetric excitory/inhibitory connections between the processing units -- the output computation is strictly feed-forward. Neither did this project require simulated annealing, although some form of stochastic training or of "inverse training" on wrong inputs might be helpful in avoiding local minima in the weight space. What makes this algorithm work, and what makes it different from multilayer perceptrons, is that the processing nodes do not perform a threshold binarization. Instead, the output of each unit is a sigmoid function of the weighted sum of its inputs. The sigmoid function, an inverse exponential, is essentially the same one used in the Boltzmann Machine's stochastic annealing; it also resembles the response curve of neurons. Its advantage over a threshold function is that it is differentiable. This permits the error signal to be propagated back through each processing unit so that appropriate "blame" can be attributed to each of the hidden units and to each of the connections feeding the hidden units. The back-propagated error signals are exactly the partial derivatives needed for steepest-descent optimization of the network. Subjective results: The output of the system for the page of text was originally just a few random phonemes with no information content. After sufficient training on the correct outputs the machine learned to "babble" with alternating vowels or vowel/consonants. After further training it discovered word divisions and then began to be intelligible. It could eventually read the page quite well, with a distinctly childish accent but with mechanical pacing of the phonemes. It was then presented with a second page of text and was able to read that quite well also. I have seen some papers by Sejnowski, Kienker, Hinton, Schumacher, Rumelhart, and Williams exploring variations of this machine learning architecture. Most of the work has concerned very simple, but difficult, problems, such as learning to compute exclusive OR or the sum of two two-bit numbers. More complex tasks involved detecting symmetries in binary matrices and computing figure/ground (or segmentation) relationships in noisy images with an associated focus of attention. I find the work promising and even exciting. -- Ken Laws ------------------------------ End of AIList Digest ********************