[comp.ai] Backpropagation applications

tgd@orstcs.CS.ORST.EDU (Tom Dietterich) (11/09/89)

By approximation results, I assume you are referring to various proofs
that multi-layer feedforward networks can, with sufficient numbers of
hidden units, approximate any function arbitrarily closely.

Some people have jumped from these results to the conclusion that
neural networks can learn any function.  This is true, but only if
there is no bound on the amount of training data presented to the
learning system (and of course, no bound on the number of hidden
units).  In real applications, the important question is "Can learning
algorithm X learn my unknown function given that I have only M
training examples?"  In other words, how effectively does the learning
algorithm exploit its training data?  The approximation results
provide no insight into this question, which is why they are not very
useful.

For further details, see "Limitations on Inductive Learning",
Proceedings of the Sixth International Conference on Machine Learning
(available from Morgan-Kaufmann Publishers, San Mateo, CA).

--Tom Dietterich

tgd@orstcs.CS.ORST.EDU (Tom Dietterich) (11/09/89)

Your accuracy claims for NETtalk are greatly exaggerated.  I have
replicated the NETtalk study using the same training data.  In this
case, training on 1000 words chosen at random from the 20000-word
dictionary provided by Sejnowski.

After running back propagation for 30 epochs using the parameters
given in Sejnowski and Rosenberg (1986), I obtain the following
results.  Testing is performed on a randomly chosen test set of 1000
words.


                              WORDS  LETTERS (PHON/STRESS)  BITS 
------------------------------------------------------------------
BP                    TRAIN:  65.3    94.0     97.0  96.4    99.5
                      TEST :  14.9    71.6     81.8  81.4    96.7

Numbers give percentage of correct performance:

Explanation:
  TRAIN: performance on the training set
  TEST: performance on the test set
  BITS: average performance on the 26 output bits of the network.
  STRESS: performance on the 5 stress bits
  PHONEME: performance on the 21 phoneme bits
  LETTERS: performance on all 26 bits
  WORDS: performance on whole words (i.e., each letter must be
correct). 

The nettalk network has 120 hidden units, 203 input units (that code,
very sparsely, a 7 letter window), and 26 output units (that code in a
distributed fashion the 54 phonemes and 6 stresses).  The 26 output
bits are mapped to the nearest phoneme/stress combination that was
observed in the training data.  (i.e., a pass was made over the
training data to find all phoneme/stress pairs appearing in the data.
Decoding only considers those pairs.  Ties are broken in favor of the
phoneme/stress pair that appeared more frequently.)  This decoding
scheme is superior to decoding to the nearest syntactically legal
phoneme/stress pair. 


--Tom Dietterich

tgd@orstcs.CS.ORST.EDU (Tom Dietterich) (11/09/89)

> furthermore, workers at los alamos used the same training set as a toy
> problem for one of their very early non-linear interpolation codes.
> after a _single_ pass through the training set, their program
> performed perfectly on the training material and had lower than a 5%
> error rate on the novel material.  they didn't publish this work
> because they thought that sejnowski's work was over-sensationalized
> and too trivially replicable by conventional means.

If this is true, I'd be very interested in seeing the results.
One often hears rumors of great things happening at Los Alamos.
I'd like to see the work peer-reviewed and published.  Until then
it is just a rumor.

--Tom Dietterich
Editor, Machine Learning Journal

bwk@mbunix.mitre.org (Kort) (11/09/89)

In article <16790@bcsaic.UUCP> bmartin@bcsaic.UUCP (Brett Martin) writes:

 > Is _Apprentices of Wonder: Inside the Neural Network Revolution_ the
 > correct title?  Our library could only find _Apprentices of Wonder:
 > Reinventing the Mind_.

That's what it says on the book jacket.  I have loaned out my copy,
so I can't double-check the title page.  But it's probably one
and the same book.

--Barry Kort

heck@Sunburn.Stanford.EDU (Stefan P. Heck) (11/10/89)

According to Rumelhart in his ANN/PDP class here, Nettalk was trained on a 
set of the 1000 most common words rather than a random set. This run took
overnight to learn. They later also did a second test using 10 000 words. 
I don't know for which run the accuracy figures are, but supposedly it got
87% right except on words which were irregular. The best competitor at the
time was about 89% accurate. Human capability was estimated at 96%.

Stefan
CSD

hougen@umn-cs.CS.UMN.EDU (Dean Hougen) (11/10/89)

In article <13659@orstcs.CS.ORST.EDU> tgd@orstcs.CS.ORST.EDU (Tom Dietterich) writes:
>Your accuracy claims for NETtalk are greatly exaggerated.  I have
>replicated the NETtalk study using the same training data.  In this
>case, training on 1000 words chosen at random from the 20000-word
>dictionary provided by Sejnowski.
                                        ^^^^^^
>Testing is performed on a randomly chosen test set of 1000 words.
                           ^^^^^^^^

I was under the impression that Sejnowski had NETtalk read real sentences in
real paragraphs, not randomly ordered words.  Right?

BTW, did you present the input as one long string of charcters with the words
seperated by a single space or did you present the words one at a time (i.e.
as a long string of characters with the words seperated by three or more 
spaces) or did you do something else (what?)?

I'll leave you to determine what effect any of this could have on NETtalk's
performance.

Dean Hougen
--
"Stop making sense.  Stop making sense.  Stop making sense, making sense."
    - Talking Heads, "Stop Making Sense," _Stop Making Sense_  

tgd@aramis.rutgers.edu (Tom Dietterich) (11/13/89)

  From: heck@Sunburn.Stanford.EDU (Stefan P. Heck) writes

  According to Rumelhart in his ANN/PDP class here, Nettalk was trained on a
  set of the 1000 most common words rather than a random set. This run took
  overnight to learn. They later also did a second test using 10 000 words.
  I don't know for which run the accuracy figures are, but supposedly it got
  87% right except on words which were irregular. The best competitor at the
  time was about 89% accurate. Human capability was estimated at 96%.

I have also run the algorithm on the 1000 most common words.  The
results are quite similar to those I reported for 1000 randomly
selected words.  Testing is performed on the remaining 19000 words in
the dictionary.


                               WORDS  LETTERS (PHON/STRESS)  BITS  
-------------------------------------------------------------------
BP                     TRAIN:    76.6    94.8   97.1   97.3   99.6
120 hidden units       TEST :    13.4    68.1   78.7   80.0   96.0


Sejnowski and Rosenberg also trained and tested nettalk on a corpus
of connected conversational speech.  I don't have access to that data,
so I haven't replicated that part of their study.

In my work (and in the S&R original), the 1000 most common words are
presented one-at-a-time surrounded by blanks.


Thomas G. Dietterich
Department of Computer Science
Computer Science Bldg, Room 100
Oregon State University
Corvallis, OR 97331-3902