lee@uhccux.uhcc.hawaii.edu (Greg Lee) (01/20/89)
[How to get better sounding speech on the Amiga ...] FM synthesis. That's the way to go. Here's the reasoning: (1) In FM music synthesis, real instruments are mimicked in a way that is very economical so far as the parameter data that has to be stored or transferred goes. The trick is not to try to reproduce the wave form of the real instrument, nor to try to reproduce the steady-state frequency spectrum, but to reproduce the rapid changes in harmonic complexity (among other aspects of the note envelope). Why does this work? It must be that human perception of instrumental sounds in fact has more to do with transitions than with steady states or frequency spectra per se. (2) There is some reason to think that in another area of perception we really perceive changes in frequency spectrum rather than spectra. That is color perception. Some may recall Land's dramatic experiment reported in S. American a few years ago. He projected a picture in two frequencies of red light of a *natural* scene photographed through two corresponding red filters. People saw it in full color. Why does this work? Color perception must concern the edges in a scene between different spectra. (3) Back to speech. Conventional speech synthesis that is not based on sampled speech works by reproducing the steady state frequency spectra of vowels -- the formant structure -- in human speech, then stitching various vowel sounds together with appropriate formant transitions, a little noise, etc. If this corresponds to the way humans perceive speech, we must have to change gears, so to speak, when we have been listening to a man speak and we begin listening to a woman or a child. This is because the spectra of vowel sounds are quite different due to differing sizes of the oral cavities that produce the formants. So, there should be a certain adaptation time in these circumstances that we could measure. In the book Three Experiments in Phonetics, Peter Ladefoged reports the results of an attempt to measure this adaptation time. Guess what he found. I'll tell you -- zero. Maybe there's no adaptation time because there's no adaptation involved in the first place. (Ladefoged does not draw this conclusion, however.) (4) So far as I know, FM speech synthesis has never been tried. The upside of trying it is that you'd be breaking new theoretical ground if you could get it to work. I can think of a couple of downsides. One is that it cannot work -- the above reasoning is all a priori, and may be wrong at that. Another is that there is probably no practical way to do it a step at a time. You couldn't expect to get recognizable speech until you got fairly close to the "right" parameter values for a considerable stretch of speech, judging by the analogy to Land's experiment, because there the effect of color perception appears only for natural scenes. And I don't think you could find appropriate measurements of human speech reported in the journal literature. It would be cut and try. Greg, lee@uhccux.uhcc.hawaii.edu