[comp.sys.amiga.tech] Better Speech.

page@swan.ulowell.edu (Bob Page) (01/18/89)

sparks@corpane.UUCP (John Sparks) wrote:
>The synthetic voice Amy has now sux.

You're probably using the translator.library (the text-to-phoneme
library) which has a lot of rough spots, especially stress and
inter-phoneme pauses.  If you pick your own phonemes (or adjust what
comes from the translator.library) you'll be surprised at how much
better you can do with the narrator.device (the phoneme-to-speech
part).  Of course you need to spend some time with the params of the
narrator.device too.

>It sounds like a transvestite swedish chef.

Some of my best friends are transvestite Swedish chefs.	:-)  It also
sounds like it's holding its nose.

>The Amiga has enough potential power to have a really nice synthesized 
>...
>digital samples of his voice to make the phonemes. It needs a lot of 

Well, if it's sampled, it isn't synthesized.

>should be able to be as good as say, DECTALK, with a little bit of work. 

And a lot of custom DSP hardware, a ton of memory, and lots of CPU
time.  DECTALK uses two pronunciation dictionaries, a library of
letter-to-sound rules, and a large rules database for intonation,
duration word stress, as well as such stuff as "breathiness", "head
size" and "laryngealization".  It converts the phonemes to control
messages every 6.4ms, and sends this information through the D/A at
10k samples/sec.

You need a synthesizer to do all this, a digitized phoneme library
won't cut it, as you lose all the stresses and other nuances of
speech.

I'm not defending the current Amiga's voice capabilities, just
pointing out that building a great generalized text-to-voice
capability in software is very hard.

..Bob
-- 
Bob Page, U of Lowell CS Dept.  page@swan.ulowell.edu  ulowell!page
Have five nice days.

jesup@cbmvax.UUCP (Randell Jesup) (01/18/89)

In article <11263@swan.ulowell.edu> page@swan.ulowell.edu (Bob Page) writes:
>>It sounds like a transvestite swedish chef.
>
>Some of my best friends are transvestite Swedish chefs.	:-)  It also
>sounds like it's holding its nose.

	Try it with a 4-bitplane 640 screen up front: it sounds like it's
holding it's nose AND is under water.  :-)  Plus it tends to skip words, just
for extra amusement (only in 640, 4 bitplane).

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

mjw@f.gp.cs.cmu.edu (Michael Witbrock) (01/19/89)

Using digitised phones to generate Amiga's speech would make things worse,
not better. Guaranteed. 

However, there may be another way to skin this cat. 

In a recent tech report, Terry Sejnowski at JHU described an experiment with 
an artificial 'neural' network which learns to pronounce English words. 
The paper is :

@techreport     ( SEJNOWSKI86A,
key     =       "Sejnowski" ,
author  =       "Terrence Sejnowski and Charles Rosenberg" ,
title   =       "NETtalk: A Parallel Network that Learns to Read
Aloud" ,
institution=    "Johns Hopkins University" ,
number  =       "JHU-EECS-86-01" ,
year    =       "1986" ,
keywords=       "applications, NETtalk, speech generation" ,
annote  =       "Backprop was used to teach a 3-layer network how
to pronounce textual input. Input layer = 7-character window of local
chars; output layer = distributed phoneme features. Impressive
performance; improved by more hidden units, an extra hidden layer,
and/or using pure dictionary words (rather than inconsistent real-life
data). Damage-resistant, since distributed. [jmt/***]" ,
bibdate =       "Wed Mar  2 11:10:30 1988" ,
)

- The network takes a sliding window in the text and outputs a description of 
the required phone (voiced/unvoiced  dental/paletal.... etc etc) required to utter it.

From my experience in working with connectionist speech recognition, I have little doubt that the network could be extended to learn to produce the actual waveform (or LPC coefficients or whatever) for the speech. Perhaps CA would like to offer me a summer job trying :-). 

In any case, if C-A is thinking of improving the speech on the amiga, it would 
certainly be worth their while to get a copy of this TR.

Michael

-- 
Michael.Witbrock@cs.cmu.edu mjw@cs.cmu.edu                          \
US Mail: Michael Witbrock/ Dept of Computer Science                  \
         Carnegie Mellon University/ Pittsburgh PA 15213/ USA        /\
Telephone : (412) 268 3621 [Office]  (412) 441 1724 [Home]          /  \
-- 

dykimber@phoenix.Princeton.EDU (Daniel Yaron Kimberg) (01/19/89)

In article <4050@pt.cs.cmu.edu> mjw@f.gp.cs.cmu.edu (Michael Witbrock) writes:
>Using digitised phones to generate Amiga's speech would make things worse,
>not better. Guaranteed. 
>
>However, there may be another way to skin this cat. 
>
>In a recent tech report, Terry Sejnowski at JHU described an experiment with 
>an artificial 'neural' network which learns to pronounce English words. 
[description of NETTalk techreport ref deleted]

When I heard NETTalk, it wasn't a heck of a lot better sounding than the
amiga in terms of intonation.  It's main asset is that it's good at learning
what sounds to make based on an window of text (and it doesn't always sound
like a native speaker).  Perhaps a better way to apply backprop learning to
getting the amiga to talk would be to train it on a corpus of phonetic
transcriptions, to learn intonation.  Or the two could be used in tandem.
The question is really whether or not it'll work in real time.  I don't
think it will.

                                                  -Dan

rusty@hocpa.UUCP (M.W.HADDOCK) (01/20/89)

Is there a switch on SPEAK:  or say that will turn off the expansion of
abbreviations?  For example, the sentence

	I am the Wizard of Oz.

comes out sounding like 

	I am the wizard of ounces.

I know you're saying that the need for Oz at the end of a sentence if
miniscule but there is at least Dr. (for doctor when I wanted drive).  There
are other abbrevitions but you catch my drift.  Right now, the only way I've
come up with to defeat this is to put a space before a termination period.
Ah, but this means I have to "edit" my incoming text.

				-Rusty-
				RAH4STIY /HAE4DAAK
-- 
Rusty Haddock		{uunet!likewise,att,arpa}!hocpa!rusty
AT&T Consumer Products Laboratories - Human Factors Laboratory
Holmdel, New Joyzey 07733   (201) 834-1023  rusty@hocpa.att.com
** Genius may have its limitations but stupidity is not thus handicapped.

lee@uhccux.uhcc.hawaii.edu (Greg Lee) (01/20/89)

[How to get better sounding speech on the Amiga ...]

FM synthesis.  That's the way to go.  Here's the reasoning:

(1) In FM music synthesis, real instruments are mimicked in a way
that is very economical so far as the parameter data that has
to be stored or transferred goes.  The trick is not to try to
reproduce the wave form of the real instrument, nor to try
to reproduce the steady-state frequency spectrum, but to
reproduce the rapid changes in harmonic complexity (among other
aspects of the note envelope).  Why does this work?  It must
be that human perception of instrumental sounds in fact has
more to do with transitions than with steady states or
frequency spectra per se.

(2) There is some reason to think that in another area of
perception we really perceive changes in frequency spectrum
rather than spectra.  That is color perception.  Some
may recall Land's dramatic experiment reported in S. American
a few years ago.  He projected a picture in two frequencies
of red light of a *natural* scene photographed through
two corresponding red filters.  People saw it in full color.
Why does this work?  Color perception must concern the edges
in a scene between different spectra.

(3) Back to speech.  Conventional speech synthesis that is
not based on sampled speech works by reproducing the steady
state frequency spectra of vowels -- the formant structure --
in human speech, then stitching various vowel sounds together
with appropriate formant transitions, a little noise, etc.
If this corresponds to the way humans perceive speech, we
must have to change gears, so to speak, when we have been
listening to a man speak and we begin listening to a woman
or a child.  This is because the spectra of vowel sounds
are quite different due to differing sizes of the oral cavities
that produce the formants.  So, there should be a certain
adaptation time in these circumstances that we could measure.
In the book Three Experiments in Phonetics, Peter Ladefoged
reports the results of an attempt to measure this adaptation
time.  Guess what he found.  I'll tell you -- zero.  Maybe
there's no adaptation time because there's no adaptation
involved in the first place. (Ladefoged does not draw this
conclusion, however.)

(4) So far as I know, FM speech synthesis has never been tried.
The upside of trying it is that you'd be breaking new
theoretical ground if you could get it to work.  I can think
of a couple of downsides.  One is that it cannot work -- the
above reasoning is all a priori, and may be wrong at that.
Another is that there is probably no practical way to do it
a step at a time.  You couldn't expect to get recognizable
speech until you got fairly close to the "right" parameter
values for a considerable stretch of speech, judging by the
analogy to Land's experiment, because there the effect of
color perception appears only for natural scenes.  And I
don't think you could find appropriate measurements of
human speech reported in the journal literature.  It would
be cut and try.

		Greg, lee@uhccux.uhcc.hawaii.edu

lphillips@lpami.wimsey.bc.ca (Larry Phillips) (01/22/89)

In <506@hocpa.UUCP>, rusty@hocpa.UUCP (M.W.HADDOCK) writes:
>Is there a switch on SPEAK:  or say that will turn off the expansion of
>abbreviations?  For example, the sentence
>
>	I am the Wizard of Oz.
>
>comes out sounding like 
>
>	I am the wizard of ounces.
>
>I know you're saying that the need for Oz at the end of a sentence if
>miniscule but there is at least Dr. (for doctor when I wanted drive).  There
>are other abbrevitions but you catch my drift.  Right now, the only way I've
>come up with to defeat this is to put a space before a termination period.
>Ah, but this means I have to "edit" my incoming text.

Chip Orange has written a rather nifty little utility called BetterSpeech. It
intercepts the output of the 'say' command and accesses a translation file
that stores your translator exceptions in pairs. Typical entries might look
like:

ascii
asky
copied
copeed
ieee
i triple e
ibm
brain dead pile of silicon scrap

I know it's on Compuserve, either in AmigaTech or AmigaArts, but don't know
where else it might be available.

-larry

--
Frisbeetarianism: The belief that when you die, your soul goes up on
                  the roof and gets stuck.
+----------------------------------------------------------------------+ 
|   //   Larry Phillips                                                |
| \X/    lphillips@lpami.wimsey.bc.ca or uunet!van-bc!lpami!lphillips  |
|        COMPUSERVE: 76703,4322                                        |
+----------------------------------------------------------------------+