[comp.sys.amiga] Phonemes: why not just digitize them?

sean@ukma.UUCP (02/10/87)

You know, with a 512k amiga and a low sampling rate, you could probably
fit all the digitized phonemes you ever wanted in memory.  They'd
probably sound a hell of a lot better, too.  And you could make them
sound like Leonard Nimoy, or perhaps, Mae West :-).

BTW - I finally got my Amiga!!!!!

Sean
-- 
===========================================================================
Sean Casey      UUCP:  cbosgd!ukma!sean           CSNET:  sean@ms.uky.csnet
		ARPA:  ukma!sean@anl-mcs.arpa    BITNET:  sean@UKMA.BITNET

flaps@utcsri.UUCP (Alan J Rosenthal) (02/11/87)

In a recent article sean@ukmj.ms.uky.csnet (Sean Casey) writes:
>You know, with a 512k amiga and a low sampling rate, you could probably
>fit all the digitized phonemes you ever wanted in memory.  They'd
>probably sound a hell of a lot better, too.

They'd sound terrible.  The narrator does a lot of intonation, etc.  This
is currently based on a model of human speech and could not be adapted
directly to work with digitized sounds.  Assuming you didn't have this in
mind, just cutting up phonemes and pasting them together would sound like those
talking clocks that say "The time.  is?  five!  thirty.."  except much worse
because the oddness would be on a phoneme, and not word, level.

-- 

Alan J Rosenthal

UUCP: {backbone}!seismo!mnetor!utgpu!flaps, ubc-vision!utai!utgpu!flaps,
      or utzoo!utgpu!flaps (among other possibilities)
ARPA: flaps@csri.toronto.edu
CSNET: flaps@toronto
BITNET: flaps at utorgpu

lachac@topaz.UUCP (02/12/87)

In article <4111@utcsri.UUCP> flaps@utcsri.UUCP (Alan J Rosenthal) writes:
>
>In a recent article sean@ukmj.ms.uky.csnet (Sean Casey) writes:
>> {about digitized phonemes}
>
>They'd sound terrible.  The narrator does a lot of intonation, etc.  This
>is currently based on a model of human speech and could not be adapted
>directly to work with digitized sounds.  Assuming you didn't have this in
>mind, just cutting up phonemes and pasting them together would sound like those
>talking clocks that say "The time.  is?  five!  thirty.."  except much worse
>because the oddness would be on a phoneme, and not word, level.

In that case then, how are the phone companies recorded messages done?
("The number you have reached..")???  Aren't those computer generated?

Also at work we have a Periphonics with a FANTASTIC female voice.  I 
understand that the Amiga would probably be incapable of doing this,
(maybe with 8 megs....?) I was wondering if the voice on the Amy could be
improved...

-- 
----------------------------------------------------------------------------
	"Isn't fun the best thing to have?"

			lachac@topaz.rutgers.edu

cjp@vax135.UUCP (02/12/87)

While this subject is current, I'd like to post a few ramblings.  First
let me say that I value the accomplishments of the current Narrator,
and to some extent appreciate the difficulty of the enhancements I
propose here.  I would, however, like to see something better.

In playing with Narrator through "say" in phoneme mode (i.e. not
invoking Translator), I have found it difficult to achieve a
satisfactorily natural-sounding voice.  Impossible, really.  One
problem is that there is not enough control available.  The phoneme
syntax gives you only one parameter, called "stress", that you can
adjust (ignoring the parameters like "pitch" or "male" which affect the
whole utterance).

I feel there is a need for control of, say, three factors: volume,
duration, and pitch.  These things need to be controllable at the
resolution *at least* of phonemes.  I argue that even finer control is
necessary for fully expressive, natural sounding voice.  It is not just
Thai that needs the ability to change pitch during a vowel (or other
voiced) sound.  You need to be able to slide the pitch and volume of
the phoneme between two limits from its start to its end.  Perhaps you
even need to say what the "shape" of that slide is, chosen from a few
such as exponential, negative exponential, linear.  Try listening
critically to the pitch and speed of someone talking -- using a tape
recording (or digitized voice) helps -- and notice how much of a
person's attitude and intent are communicated through intonation.
I'm sure you've already noticed how little of it comes through in
Narrator's voice.

Let me call this hypothetical voice generator the Expressor.  Now
clearly, this type of voice generator is not meant to be driven by an
automatic text translator.  There is generally not enough information
in text for even humans to derive accurate intentions and attitudes,
let alone the problem of generating parameters which re-evoke those
attitudes.  But I think there would be many good and impressive uses
for "canned" strings of phonemes, generated manually.  I estimate that
even a fully parameterized, inflected, modulated, and warbled word,
expressed as a string of phonemes in Expressor syntax, would require a
tiny fraction of the storage of a digitized sound sample saying more or
less the same thing.  One could store maybe hours of expressive,
intelligible talk on a single disk instead of the (I forget the exact
time) less than a minute of sampled sound.  If done properly, if the
parameters are given enough range and resolution (*much* more than 1 to
9), one could even take a good shot at synthesized singing.

Well, enough of me talking through my hat.  I certainly don't know how
hard it would be to implement.  It would be neat though.  Comments,
especially informed comments, are requested.

Charles Poirier   USENET vax135!cjp

flaps@utcsri.UUCP (Alan J Rosenthal) (02/13/87)

>>In a recent article sean@ukmj.ms.uky.csnet (Sean Casey) writes:
>>> {about digitized phonemes}

I, flaps@utcsri.UUCP (Alan J Rosenthal), wrote:
>>They'd sound terrible.

lachac@topaz.rutgers.edu (Gerard Lachac) writes:
>In that case then, how are the phone companies recorded messages done?
>("The number you have reached..")???  Aren't those computer generated?
>
>Also at work we have a Periphonics with a FANTASTIC female voice.

Well, I have a different phone company than you do, but I don't think this
is really on the same topic.  There is no obstacle to computer generated
speech based on sampled phonemes, but it would have to be done with features
like the attack and decay in the demo called "NewMusic" (the one with the
window entitled "Audio Demo"), and not just by pasting recordings next to
each other.

In any case, if something that always begins with "The number you have
reached" is based on sampled sound, I would assume that they recorded
someone saying "The number you have reached" rather than asking them to
say each word separately.  Does it say a telephone number?  Listen to the
relative inflexion between digits.  It sounds terrible, as I originally
said, if it's like the ones here (or like any others I've heard where
beauty of speech sound was not a design objective).

And just to reiterate, my original article was a response to an article
saying "why not just record each phoneme separately and splice them
together?  They'd probably sound better than the current narrator" or
somesuch.

-- 

Alan J Rosenthal

UUCP: {backbone}!seismo!mnetor!utgpu!flaps, ubc-vision!utai!utgpu!flaps,
      or utzoo!utgpu!flaps (among other possibilities)
ARPA: flaps@csri.toronto.edu
CSNET: flaps@toronto
BITNET: flaps at utorgpu

gwe@cbosgd.UUCP (02/13/87)

In article <9146@topaz.RUTGERS.EDU> lachac@topaz.rutgers.edu (Gerard Lachac) writes:
>In article <4111@utcsri.UUCP> flaps@utcsri.UUCP (Alan J Rosenthal) writes:
>>
>>In a recent article sean@ukmj.ms.uky.csnet (Sean Casey) writes:
>>> {about digitized phonemes}
>>talking clocks that say "The time.  is?  five!  thirty.."  except much worse
>>because the oddness would be on a phoneme, and not word, level.
>
>
>In that case then, how are the phone companies recorded messages done?
>("The number you have reached..")???  Aren't those computer generated?
>
While I don't work in that area, I do know that the recording you speak of is
just that. It contains one section "The number you have reached...", which
was digitized as spoken by somebody, then assembles the digits (each one
individually digitized), then plays the trailer, "has been disconnected".

The message isn't constructed from phonemes, or even at the word level.
It's basically an answering machine with 4k RAM. :-)

------------------------------clip and save----------------------------------

	Bill Thacker    	cbatt!cbosgd!cbdkc1!serial!wbt
DISCLAIMER: Farg 'em if they can't take a joke !

If you love something, set it free. If it doesn't come back to you,
	track it down and kill it.

-----------------------------valuable coupon---------------------------------

jimh@hpsadla.UUCP (02/15/87)

Just a quick note on the Narrator...

     Has anyone out there ever played with SAM (the Software Automatic
Mouth), which was a program for the Apple ][, C-64, and Atari 6502
machines?  Am I the only one who recognizes the voice in `SAY'?

     Speaking of improved Amiga speech, has anyone else heard or played
with the (gasp!) Macintosh's `Smoothtalker'?  Do it.  You will be
pleasantly surprised.  Now if a < 8MHz 68K can do that without DMA, why
can't that code find its way to the Amiga?

	Jim Horn	{The World}!hplabs!hpfcla!hpsrla!hpsadla!jimh