page@swan.ulowell.edu (Bob Page) (01/18/89)
sparks@corpane.UUCP (John Sparks) wrote: >The synthetic voice Amy has now sux. You're probably using the translator.library (the text-to-phoneme library) which has a lot of rough spots, especially stress and inter-phoneme pauses. If you pick your own phonemes (or adjust what comes from the translator.library) you'll be surprised at how much better you can do with the narrator.device (the phoneme-to-speech part). Of course you need to spend some time with the params of the narrator.device too. >It sounds like a transvestite swedish chef. Some of my best friends are transvestite Swedish chefs. :-) It also sounds like it's holding its nose. >The Amiga has enough potential power to have a really nice synthesized >... >digital samples of his voice to make the phonemes. It needs a lot of Well, if it's sampled, it isn't synthesized. >should be able to be as good as say, DECTALK, with a little bit of work. And a lot of custom DSP hardware, a ton of memory, and lots of CPU time. DECTALK uses two pronunciation dictionaries, a library of letter-to-sound rules, and a large rules database for intonation, duration word stress, as well as such stuff as "breathiness", "head size" and "laryngealization". It converts the phonemes to control messages every 6.4ms, and sends this information through the D/A at 10k samples/sec. You need a synthesizer to do all this, a digitized phoneme library won't cut it, as you lose all the stresses and other nuances of speech. I'm not defending the current Amiga's voice capabilities, just pointing out that building a great generalized text-to-voice capability in software is very hard. ..Bob -- Bob Page, U of Lowell CS Dept. page@swan.ulowell.edu ulowell!page Have five nice days.
jesup@cbmvax.UUCP (Randell Jesup) (01/18/89)
In article <11263@swan.ulowell.edu> page@swan.ulowell.edu (Bob Page) writes: >>It sounds like a transvestite swedish chef. > >Some of my best friends are transvestite Swedish chefs. :-) It also >sounds like it's holding its nose. Try it with a 4-bitplane 640 screen up front: it sounds like it's holding it's nose AND is under water. :-) Plus it tends to skip words, just for extra amusement (only in 640, 4 bitplane). -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
mjw@f.gp.cs.cmu.edu (Michael Witbrock) (01/19/89)
Using digitised phones to generate Amiga's speech would make things worse, not better. Guaranteed. However, there may be another way to skin this cat. In a recent tech report, Terry Sejnowski at JHU described an experiment with an artificial 'neural' network which learns to pronounce English words. The paper is : @techreport ( SEJNOWSKI86A, key = "Sejnowski" , author = "Terrence Sejnowski and Charles Rosenberg" , title = "NETtalk: A Parallel Network that Learns to Read Aloud" , institution= "Johns Hopkins University" , number = "JHU-EECS-86-01" , year = "1986" , keywords= "applications, NETtalk, speech generation" , annote = "Backprop was used to teach a 3-layer network how to pronounce textual input. Input layer = 7-character window of local chars; output layer = distributed phoneme features. Impressive performance; improved by more hidden units, an extra hidden layer, and/or using pure dictionary words (rather than inconsistent real-life data). Damage-resistant, since distributed. [jmt/***]" , bibdate = "Wed Mar 2 11:10:30 1988" , ) - The network takes a sliding window in the text and outputs a description of the required phone (voiced/unvoiced dental/paletal.... etc etc) required to utter it. From my experience in working with connectionist speech recognition, I have little doubt that the network could be extended to learn to produce the actual waveform (or LPC coefficients or whatever) for the speech. Perhaps CA would like to offer me a summer job trying :-). In any case, if C-A is thinking of improving the speech on the amiga, it would certainly be worth their while to get a copy of this TR. Michael -- Michael.Witbrock@cs.cmu.edu mjw@cs.cmu.edu \ US Mail: Michael Witbrock/ Dept of Computer Science \ Carnegie Mellon University/ Pittsburgh PA 15213/ USA /\ Telephone : (412) 268 3621 [Office] (412) 441 1724 [Home] / \ --
dykimber@phoenix.Princeton.EDU (Daniel Yaron Kimberg) (01/19/89)
In article <4050@pt.cs.cmu.edu> mjw@f.gp.cs.cmu.edu (Michael Witbrock) writes: >Using digitised phones to generate Amiga's speech would make things worse, >not better. Guaranteed. > >However, there may be another way to skin this cat. > >In a recent tech report, Terry Sejnowski at JHU described an experiment with >an artificial 'neural' network which learns to pronounce English words. [description of NETTalk techreport ref deleted] When I heard NETTalk, it wasn't a heck of a lot better sounding than the amiga in terms of intonation. It's main asset is that it's good at learning what sounds to make based on an window of text (and it doesn't always sound like a native speaker). Perhaps a better way to apply backprop learning to getting the amiga to talk would be to train it on a corpus of phonetic transcriptions, to learn intonation. Or the two could be used in tandem. The question is really whether or not it'll work in real time. I don't think it will. -Dan
rusty@hocpa.UUCP (M.W.HADDOCK) (01/20/89)
Is there a switch on SPEAK: or say that will turn off the expansion of abbreviations? For example, the sentence I am the Wizard of Oz. comes out sounding like I am the wizard of ounces. I know you're saying that the need for Oz at the end of a sentence if miniscule but there is at least Dr. (for doctor when I wanted drive). There are other abbrevitions but you catch my drift. Right now, the only way I've come up with to defeat this is to put a space before a termination period. Ah, but this means I have to "edit" my incoming text. -Rusty- RAH4STIY /HAE4DAAK -- Rusty Haddock {uunet!likewise,att,arpa}!hocpa!rusty AT&T Consumer Products Laboratories - Human Factors Laboratory Holmdel, New Joyzey 07733 (201) 834-1023 rusty@hocpa.att.com ** Genius may have its limitations but stupidity is not thus handicapped.
lee@uhccux.uhcc.hawaii.edu (Greg Lee) (01/20/89)
[How to get better sounding speech on the Amiga ...] FM synthesis. That's the way to go. Here's the reasoning: (1) In FM music synthesis, real instruments are mimicked in a way that is very economical so far as the parameter data that has to be stored or transferred goes. The trick is not to try to reproduce the wave form of the real instrument, nor to try to reproduce the steady-state frequency spectrum, but to reproduce the rapid changes in harmonic complexity (among other aspects of the note envelope). Why does this work? It must be that human perception of instrumental sounds in fact has more to do with transitions than with steady states or frequency spectra per se. (2) There is some reason to think that in another area of perception we really perceive changes in frequency spectrum rather than spectra. That is color perception. Some may recall Land's dramatic experiment reported in S. American a few years ago. He projected a picture in two frequencies of red light of a *natural* scene photographed through two corresponding red filters. People saw it in full color. Why does this work? Color perception must concern the edges in a scene between different spectra. (3) Back to speech. Conventional speech synthesis that is not based on sampled speech works by reproducing the steady state frequency spectra of vowels -- the formant structure -- in human speech, then stitching various vowel sounds together with appropriate formant transitions, a little noise, etc. If this corresponds to the way humans perceive speech, we must have to change gears, so to speak, when we have been listening to a man speak and we begin listening to a woman or a child. This is because the spectra of vowel sounds are quite different due to differing sizes of the oral cavities that produce the formants. So, there should be a certain adaptation time in these circumstances that we could measure. In the book Three Experiments in Phonetics, Peter Ladefoged reports the results of an attempt to measure this adaptation time. Guess what he found. I'll tell you -- zero. Maybe there's no adaptation time because there's no adaptation involved in the first place. (Ladefoged does not draw this conclusion, however.) (4) So far as I know, FM speech synthesis has never been tried. The upside of trying it is that you'd be breaking new theoretical ground if you could get it to work. I can think of a couple of downsides. One is that it cannot work -- the above reasoning is all a priori, and may be wrong at that. Another is that there is probably no practical way to do it a step at a time. You couldn't expect to get recognizable speech until you got fairly close to the "right" parameter values for a considerable stretch of speech, judging by the analogy to Land's experiment, because there the effect of color perception appears only for natural scenes. And I don't think you could find appropriate measurements of human speech reported in the journal literature. It would be cut and try. Greg, lee@uhccux.uhcc.hawaii.edu
lphillips@lpami.wimsey.bc.ca (Larry Phillips) (01/22/89)
In <506@hocpa.UUCP>, rusty@hocpa.UUCP (M.W.HADDOCK) writes: >Is there a switch on SPEAK: or say that will turn off the expansion of >abbreviations? For example, the sentence > > I am the Wizard of Oz. > >comes out sounding like > > I am the wizard of ounces. > >I know you're saying that the need for Oz at the end of a sentence if >miniscule but there is at least Dr. (for doctor when I wanted drive). There >are other abbrevitions but you catch my drift. Right now, the only way I've >come up with to defeat this is to put a space before a termination period. >Ah, but this means I have to "edit" my incoming text. Chip Orange has written a rather nifty little utility called BetterSpeech. It intercepts the output of the 'say' command and accesses a translation file that stores your translator exceptions in pairs. Typical entries might look like: ascii asky copied copeed ieee i triple e ibm brain dead pile of silicon scrap I know it's on Compuserve, either in AmigaTech or AmigaArts, but don't know where else it might be available. -larry -- Frisbeetarianism: The belief that when you die, your soul goes up on the roof and gets stuck. +----------------------------------------------------------------------+ | // Larry Phillips | | \X/ lphillips@lpami.wimsey.bc.ca or uunet!van-bc!lpami!lphillips | | COMPUSERVE: 76703,4322 | +----------------------------------------------------------------------+