tgd@orstcs.CS.ORST.EDU (Tom Dietterich) (11/09/89)
By approximation results, I assume you are referring to various proofs that multi-layer feedforward networks can, with sufficient numbers of hidden units, approximate any function arbitrarily closely. Some people have jumped from these results to the conclusion that neural networks can learn any function. This is true, but only if there is no bound on the amount of training data presented to the learning system (and of course, no bound on the number of hidden units). In real applications, the important question is "Can learning algorithm X learn my unknown function given that I have only M training examples?" In other words, how effectively does the learning algorithm exploit its training data? The approximation results provide no insight into this question, which is why they are not very useful. For further details, see "Limitations on Inductive Learning", Proceedings of the Sixth International Conference on Machine Learning (available from Morgan-Kaufmann Publishers, San Mateo, CA). --Tom Dietterich
tgd@orstcs.CS.ORST.EDU (Tom Dietterich) (11/09/89)
Your accuracy claims for NETtalk are greatly exaggerated. I have replicated the NETtalk study using the same training data. In this case, training on 1000 words chosen at random from the 20000-word dictionary provided by Sejnowski. After running back propagation for 30 epochs using the parameters given in Sejnowski and Rosenberg (1986), I obtain the following results. Testing is performed on a randomly chosen test set of 1000 words. WORDS LETTERS (PHON/STRESS) BITS ------------------------------------------------------------------ BP TRAIN: 65.3 94.0 97.0 96.4 99.5 TEST : 14.9 71.6 81.8 81.4 96.7 Numbers give percentage of correct performance: Explanation: TRAIN: performance on the training set TEST: performance on the test set BITS: average performance on the 26 output bits of the network. STRESS: performance on the 5 stress bits PHONEME: performance on the 21 phoneme bits LETTERS: performance on all 26 bits WORDS: performance on whole words (i.e., each letter must be correct). The nettalk network has 120 hidden units, 203 input units (that code, very sparsely, a 7 letter window), and 26 output units (that code in a distributed fashion the 54 phonemes and 6 stresses). The 26 output bits are mapped to the nearest phoneme/stress combination that was observed in the training data. (i.e., a pass was made over the training data to find all phoneme/stress pairs appearing in the data. Decoding only considers those pairs. Ties are broken in favor of the phoneme/stress pair that appeared more frequently.) This decoding scheme is superior to decoding to the nearest syntactically legal phoneme/stress pair. --Tom Dietterich
tgd@orstcs.CS.ORST.EDU (Tom Dietterich) (11/09/89)
> furthermore, workers at los alamos used the same training set as a toy > problem for one of their very early non-linear interpolation codes. > after a _single_ pass through the training set, their program > performed perfectly on the training material and had lower than a 5% > error rate on the novel material. they didn't publish this work > because they thought that sejnowski's work was over-sensationalized > and too trivially replicable by conventional means. If this is true, I'd be very interested in seeing the results. One often hears rumors of great things happening at Los Alamos. I'd like to see the work peer-reviewed and published. Until then it is just a rumor. --Tom Dietterich Editor, Machine Learning Journal
bwk@mbunix.mitre.org (Kort) (11/09/89)
In article <16790@bcsaic.UUCP> bmartin@bcsaic.UUCP (Brett Martin) writes: > Is _Apprentices of Wonder: Inside the Neural Network Revolution_ the > correct title? Our library could only find _Apprentices of Wonder: > Reinventing the Mind_. That's what it says on the book jacket. I have loaned out my copy, so I can't double-check the title page. But it's probably one and the same book. --Barry Kort
heck@Sunburn.Stanford.EDU (Stefan P. Heck) (11/10/89)
According to Rumelhart in his ANN/PDP class here, Nettalk was trained on a set of the 1000 most common words rather than a random set. This run took overnight to learn. They later also did a second test using 10 000 words. I don't know for which run the accuracy figures are, but supposedly it got 87% right except on words which were irregular. The best competitor at the time was about 89% accurate. Human capability was estimated at 96%. Stefan CSD
hougen@umn-cs.CS.UMN.EDU (Dean Hougen) (11/10/89)
In article <13659@orstcs.CS.ORST.EDU> tgd@orstcs.CS.ORST.EDU (Tom Dietterich) writes: >Your accuracy claims for NETtalk are greatly exaggerated. I have >replicated the NETtalk study using the same training data. In this >case, training on 1000 words chosen at random from the 20000-word >dictionary provided by Sejnowski. ^^^^^^ >Testing is performed on a randomly chosen test set of 1000 words. ^^^^^^^^ I was under the impression that Sejnowski had NETtalk read real sentences in real paragraphs, not randomly ordered words. Right? BTW, did you present the input as one long string of charcters with the words seperated by a single space or did you present the words one at a time (i.e. as a long string of characters with the words seperated by three or more spaces) or did you do something else (what?)? I'll leave you to determine what effect any of this could have on NETtalk's performance. Dean Hougen -- "Stop making sense. Stop making sense. Stop making sense, making sense." - Talking Heads, "Stop Making Sense," _Stop Making Sense_
tgd@aramis.rutgers.edu (Tom Dietterich) (11/13/89)
From: heck@Sunburn.Stanford.EDU (Stefan P. Heck) writes According to Rumelhart in his ANN/PDP class here, Nettalk was trained on a set of the 1000 most common words rather than a random set. This run took overnight to learn. They later also did a second test using 10 000 words. I don't know for which run the accuracy figures are, but supposedly it got 87% right except on words which were irregular. The best competitor at the time was about 89% accurate. Human capability was estimated at 96%. I have also run the algorithm on the 1000 most common words. The results are quite similar to those I reported for 1000 randomly selected words. Testing is performed on the remaining 19000 words in the dictionary. WORDS LETTERS (PHON/STRESS) BITS ------------------------------------------------------------------- BP TRAIN: 76.6 94.8 97.1 97.3 99.6 120 hidden units TEST : 13.4 68.1 78.7 80.0 96.0 Sejnowski and Rosenberg also trained and tested nettalk on a corpus of connected conversational speech. I don't have access to that data, so I haven't replicated that part of their study. In my work (and in the S&R original), the 1000 most common words are presented one-at-a-time surrounded by blanks. Thomas G. Dietterich Department of Computer Science Computer Science Bldg, Room 100 Oregon State University Corvallis, OR 97331-3902