[mod.ai] AIList Digest V4 #10

AIList-REQUEST@SRI-AI.ARPA (AIList Moderator Kenneth Laws) (01/21/86)
AIList Digest            Monday, 20 Jan 1986       Volume 4 : Issue 10

Today's Topics:
  Machine Learning - Connectionist Speech Machine

----------------------------------------------------------------------

Date: Wed, 15 Jan 86 23:06 EST
From: Tim Finin <Tim%upenn.csnet@CSNET-RELAY.ARPA>
Subject: nettalk


Several people inquired about the work of Terrence Sejnowski (of Johns
Hopkins) which was reported on the Today show recently.  This abstract is to
a talk given by Sejnowski here at Penn in October '85:

             NETTALK: TEACHING A MASSIVELY-PARALLEL NETWORK TO TALK

                             TERRENCE J. SEJNOWSKI
                             BIOPHYSICS DEPARTMENT
                           JOHNS HOPKINS UNIVERSITY
                              BALTIMORE, MARYLAND

Text  to  speech  is a difficult problem for rule-based systems because English
pronunciation is highly context dependent and  there  are  many  exceptions  to
phonological   rules.      A   more   suitable   knowledge  representation  for
correspondences between letters and phonemes will be described in  which  rules
and  exceptions  are  treated  uniformly  and can be determined with a learning
algorithm.  The architecture is a layered network  of  several  hundred  simple
processing  units  with several thousand weights on the connections between the
units.  The training corpus is continuous informal speech transcribed from tape
recordings.   Following training on 1000 words from this corpus the network can
generalize to novel text.  Even though this network was not designed  to  mimic
human  learning,  the development of the network in some respects resembles the
early stages in human  language  acquisition.    It  is  conjectured  that  the
parallel  architecture  and  learning algorithm will also be effective on other
problems which depend on evidential reasoning from previous experience.

(No - I don't have his net address.  Tim.)

------------------------------

Date: 16 Jan 86  1225 PST
From: Richard Vistnes <RV@SU-AI.ARPA>
Subject: John Hopkins learning machine: info

See AIList Digest V3 #183 (10 Dec 1985) for a talk given at Stanford
a little while ago that sounds very similar.  The person is:

    Terrence J. Sejnowski
    Biophysics Department
    Johns Hopkins University
    Baltimore, MD 21218

(I didn't attend the talk).
        -Richard Vistnes

------------------------------

Date: Sun, 19 Jan 86 0:19:10 EST
From: Terry Sejnowski <terry@hopkins-eecs-bravo.ARPA>
Subject: Reply to Inquiries

        NBC ran a short segment last Monday, January 13, on the
Today Show about my research on a connectionist model of text-to-speech.
The segment was meant for a general audience (waking up)
and all the details were left out, so here is an abstract for
those who have asked for more information.  A technical report is
available (Johns Hopkins Electrical Engineering and Computer Science
Technical Report EECS-8601) upon request.

        NETtalk: A Parallel Network that Learns to Read Aloud

                Terrence Sejnowski
                Department of Biophysics
                Johns Hopkins University
                Baltimore, MD 21218

                Charles R. Rosenberg
                Department of Psychology
                Princeton Unviversity
                Princeton, NJ 08540

Unrestricted English text can be converted to speech by applying
phonological rules and handling exceptions with a look-up table.
However, this approach is highly labor intensive since each entry
and rule must be hand-crafted.  NETtalk is an alternative approach
that is based on an automated learning procedure for a parallel
network of deterministic processing units.  After training on a
corpus of informal continuous speech, it achieves good performance
and generalizes to novel words.  The distributed representations
discovered by the network are damage resistant and recovery from
damage is about ten times faster than the original learning
starting from the same level of performance.


Terry Sejnowski

------------------------------

Date: Thu, 16 Jan 86 12:53 EST
From: Mark Beutnagel <Beutnagel%upenn.csnet@CSNET-RELAY.ARPA>
Subject: speech learning machine

The speech learning machine referred to in a recent AIList is almost
certainly a connection machine built by Terry Sejnowski.  The system
consists of a maybe 200 processing elements (or simulations of such)
and weighted connections between them.  Input is a small window of
text (5 letters?) and output is phonemes.  The system learns (i.e.
modifies weights) based on a comparison of the predicted phoneme with
the "correct" phoneme.  After running overnight the output was
recognizable speech--good but still slightly mechanical.  Neat stuff
but nothing mystical.

-- Mark Beutnagel  (The above is my recollection of Terry's talk here
                    at UPenn last fall so don't quote me.)

------------------------------

Date: Sun 19 Jan 86 12:31:31-PST
From: Ken Laws <Laws@SRI-AI.ARPA>
Subject: Speech Learning

I'll have a try at summarizing Terry's talk at Stanford/CSLI:

The speech learning machine is a three-layer "perceptron-like"
network.  The bottom layer of 189 "processing units" simply encodes a
7-character window of input text: each character (or space) activates
one of 27 output lines and suppresses 26 other lines.

The top, or output, layer represents a "coarse coding" of the phoneme
(or silence) which should be output for the character at the center
of the 7-character window.  Each bit, or output line, of the top layer
represents some phoneme characteristic: vowel/consonant, voiced,
fricative, etc.  Each legal phoneme is thus represented by a particular
output pattern, but some output patterns might not correspond to legal
phonemes.  (I think they were mapped to silence in the recording.)
The output was used for two purposes: to compute a feedback error signal
used in training the machine, and to feed the output stage of a DecTalk
speech synthesizer so that the output could be judged subjectively.

The heart of the system is a "hidden layer" of about 200 processing
units, together with several thousand interconnections and their weights.
These connect the 189 first-level outputs to the small number of output
processing units.  It is the setting of the weight coefficients for this
network that is the central problem.

Input to the system was a page of a child's speech that had be transcribed
in phonetic notation by a professional.  Correspondence had been established
between each input letter and the corresponding phoneme (or silence), and
the coarse coding of the phonemes was known.  For any possible output of the
machine it was thus possible to determine which bits were correct and which
were incorrect.  This provided the error signal.

Unlike the Boltzmann Machine or the Hopfield networks, Sejnowski's algorithm
does not require symmetric excitory/inhibitory connections between the
processing units -- the output computation is strictly feed-forward.
Neither did this project require simulated annealing, although some form
of stochastic training or of "inverse training" on wrong inputs might be
helpful in avoiding local minima in the weight space.

What makes this algorithm work, and what makes it different from multilayer
perceptrons, is that the processing nodes do not perform a threshold
binarization.  Instead, the output of each unit is a sigmoid function of
the weighted sum of its inputs.  The sigmoid function, an inverse
exponential, is essentially the same one used in the Boltzmann Machine's
stochastic annealing; it also resembles the response curve of neurons.
Its advantage over a threshold function is that it is differentiable.
This permits the error signal to be propagated back through each
processing unit so that appropriate "blame" can be attributed to each
of the hidden units and to each of the connections feeding the hidden
units.  The back-propagated error signals are exactly the partial
derivatives needed for steepest-descent optimization of the network.

Subjective results: The output of the system for the page of text was
originally just a few random phonemes with no information content.  After
sufficient training on the correct outputs the machine learned to "babble"
with alternating vowels or vowel/consonants.  After further training it
discovered word divisions and then began to be intelligible.  It could
eventually read the page quite well, with a distinctly childish accent
but with mechanical pacing of the phonemes.  It was then presented with
a second page of text and was able to read that quite well also.

I have seen some papers by Sejnowski, Kienker, Hinton, Schumacher,
Rumelhart, and Williams exploring variations of this machine learning
architecture.  Most of the work has concerned very simple, but
difficult, problems, such as learning to compute exclusive OR or the
sum of two two-bit numbers.  More complex tasks involved detecting
symmetries in binary matrices and computing figure/ground (or
segmentation) relationships in noisy images with an associated focus
of attention.  I find the work promising and even exciting.

                                        -- Ken Laws

------------------------------

End of AIList Digest
********************