[sci.lang] Better Speech.

lee@uhccux.uhcc.hawaii.edu (Greg Lee) (01/20/89)

[How to get better sounding speech on the Amiga ...]

FM synthesis.  That's the way to go.  Here's the reasoning:

(1) In FM music synthesis, real instruments are mimicked in a way
that is very economical so far as the parameter data that has
to be stored or transferred goes.  The trick is not to try to
reproduce the wave form of the real instrument, nor to try
to reproduce the steady-state frequency spectrum, but to
reproduce the rapid changes in harmonic complexity (among other
aspects of the note envelope).  Why does this work?  It must
be that human perception of instrumental sounds in fact has
more to do with transitions than with steady states or
frequency spectra per se.

(2) There is some reason to think that in another area of
perception we really perceive changes in frequency spectrum
rather than spectra.  That is color perception.  Some
may recall Land's dramatic experiment reported in S. American
a few years ago.  He projected a picture in two frequencies
of red light of a *natural* scene photographed through
two corresponding red filters.  People saw it in full color.
Why does this work?  Color perception must concern the edges
in a scene between different spectra.

(3) Back to speech.  Conventional speech synthesis that is
not based on sampled speech works by reproducing the steady
state frequency spectra of vowels -- the formant structure --
in human speech, then stitching various vowel sounds together
with appropriate formant transitions, a little noise, etc.
If this corresponds to the way humans perceive speech, we
must have to change gears, so to speak, when we have been
listening to a man speak and we begin listening to a woman
or a child.  This is because the spectra of vowel sounds
are quite different due to differing sizes of the oral cavities
that produce the formants.  So, there should be a certain
adaptation time in these circumstances that we could measure.
In the book Three Experiments in Phonetics, Peter Ladefoged
reports the results of an attempt to measure this adaptation
time.  Guess what he found.  I'll tell you -- zero.  Maybe
there's no adaptation time because there's no adaptation
involved in the first place. (Ladefoged does not draw this
conclusion, however.)

(4) So far as I know, FM speech synthesis has never been tried.
The upside of trying it is that you'd be breaking new
theoretical ground if you could get it to work.  I can think
of a couple of downsides.  One is that it cannot work -- the
above reasoning is all a priori, and may be wrong at that.
Another is that there is probably no practical way to do it
a step at a time.  You couldn't expect to get recognizable
speech until you got fairly close to the "right" parameter
values for a considerable stretch of speech, judging by the
analogy to Land's experiment, because there the effect of
color perception appears only for natural scenes.  And I
don't think you could find appropriate measurements of
human speech reported in the journal literature.  It would
be cut and try.

		Greg, lee@uhccux.uhcc.hawaii.edu