[comp.compression] Voice Compression

rdippold@lajolla.qualcomm.com (Ron Dippold) (06/13/91)

I'm looking at different methods of sound compression for voice.  That means
that it can be lossy, as we only need to preserve sound in a certain range.
Even better, it's for a phone system, so we only need a 3K range.

It's hard to keep up with all the advances in the field, especially when
it's not my primary field.  If anyone has any references for sound compression,
be it journal articles or ftp sites, I'd be grateful.

-- 
Standard disclaimer applies, you legalistic hacks.     |     Ron Dippold

wcs) (06/20/91)

In article <1991Jun12.191313.16540@qualcomm.com> rdippold@lajolla.qualcomm.com (Ron Dippold) writes:
] I'm looking at different methods of sound compression for voice.  

What kind of bit rate do you need?  Are you sure it will all be
voice, and not modem-data?  How much are you willing to spend for a
compressor?  What sound quality do you need - decent speech where
you can recognize the speaker, or synthetic Speak-And-Spell (tm)?
I assume it needs to run in real-time?  Is it important to use a
standard compression algorithm (probably)?

There are lots of kinds of voice compression, and AT&T and
presumably other phone companies have done infinite amounts of research :-)
The AT&T Technical Journal (Formerly Bell System Technical Journal,
AT&T Bell Laboratories Technical Journal, etc.) often has articles
on speech compression.  Also, we've published a number of books,
and articles in lots of papers - I suspect the various IEEE journals
are a good place to look, as are books on Digital Signal Processing (DSP).
(Disclaimer: I'm not a speech-hacker.)

Typical telephone voice is 64 kbps, with 8000 8-bit samples
(non-linearly companded - a linear encoding would probably be 12 bits?)
Compression to 32 kbps is easy using ADPCM (Adaptive Differential
Pulse Code Modulation - the basic 64 kbps stuff is regular PCM.)
There's a lot of commercial telecomm equipment that does 32kbps ADPCM.
Also, if your application is actually telephony, a significant
amount of compression is possible simply by detecting silence -
most conversations only have one speaker at a time, and gaps between
words are non-trivial.

There are a number of techniques for getting to speeds in the 8-16
kbps range, without seriously degraded quality.
Essentially you're trading bandwidth vs. quality vs. processing complexity,
and the advances in DSP chips have made processing MUCH faster and
cheaper than it used to be.  One of the families of coding algorithms 
used is called Linear Predictive Coding - essentially, you're
predicting what sounds will come next, and sending the difference
between the prediction and the actual sound.  Another common technique
is to split up the speech energy into different frequency bands,
and separately encode the different bands.

Encryption equipment typically uses 2400, 4800, and 9600 baud voice,
which gets digitally encrypted and sent over modems.
It's not bad voice quality, though there's a bit of processing delay
(fractions of a second, but enough to notice.)
It's certainly good enough for typical voice-mail or answering machine.

People have been mentioning extremely-low-bit rate synthetic algorithms here,
with rates like 300 baud, which are basically identifying the
phonemes in your words (a bit simpler than speech-to-text, since you
don't have to disambiguate spelling), and reconstructing them at the
far end, with a voice that presumably doesn't resemble you at all.
-- 
				Pray for peace;		  Bill
# Bill Stewart 908-949-0705 erebus.att.com!wcs AT&T Bell Labs 4M-312 Holmdel NJ
# No, that's covered by the Drug Exception to the Fourth Amendment.
# You can read it here in the fine print.