[comp.dsp] Compression Techniques for Speech

dcollins@sug.org (Daniel Collins) (12/13/90)

I am working on a project where I have to compress a speech signal  
by 20-to-1 ratio in real-time.  If 20-to-1 not practical, what are 
limiting factors, and what is achievable I will be using an 8 bit ADC 
with a 8KHz conversion rate.  Can anyone offer any pointers to DSP 
chips, other special purpose chips, literature, code fragments, etc. 

Dan Collins
dcollins@pittslug.sug.org

drake@drake.almaden.ibm.com (12/13/90)

In article <9508@pitt.UUCP> dcollins@pittslug.sug.org.UUCP (Daniel Collins) writes:
>I am working on a project where I have to compress a speech signal  
>by 20-to-1 ratio in real-time.  If 20-to-1 not practical, what are 
>limiting factors, and what is achievable I will be using an 8 bit ADC 
>with a 8KHz conversion rate.  

So from 64 Kbits/second you want to do 20:1 compression, down to 3200 bits
per second?  Pretty aggressive.  There's a company with a product that
hooks to a standard serial port that claims to be able to do speech at
1100 bits per second; the product is from Digispeech, called the DS201.  
The compression algorithm seems to be proprietary.

Sam Drake / IBM Almaden Research Center 
Internet:  drake@ibm.com            BITNET:  DRAKE at ALMADEN
Usenet:    ...!uunet!ibmarc!drake   Phone:   (408) 927-1861

jbuck@galileo.berkeley.edu (Joe Buck) (12/14/90)

In article <367@rufus.UUCP>, drake@drake.almaden.ibm.com writes:
> In article <9508@pitt.UUCP> dcollins@pittslug.sug.org.UUCP (Daniel Collins) writes:
> >I am working on a project where I have to compress a speech signal  
> >by 20-to-1 ratio in real-time.  If 20-to-1 not practical, what are 
> >limiting factors, and what is achievable I will be using an 8 bit ADC 
> >with a 8KHz conversion rate.  

> So from 64 Kbits/second you want to do 20:1 compression, down to 3200 bits
> per second?  Pretty aggressive.

Pretty standard, actually, if you go a bit further to 2400 bps.  You can
get a 2400/4800 bps speech vocoder from a number of places (my former
company, Entropic Speech, Inc, sells one); if you're interested in algorithms,
you can get the government standard LPC-10e algorithm (sorry, I don't
know contacts for this; check out back issues of Speech Technology
magazine).

Speech quality is artificial but usable in many applications.  There's been
great progress recently in "medium-range" (4800 to 12K bps) speech compression;
algorithms in this range, while computationally expensive, now have very
low distortion.  The upper end of this range may soon be used in digital
cellular phones.

It's possible to compress speech to much lower rates, but I haven't heard
anything below 2400 bps that I consider generally usable (though you can
make nice sounding tapes in a lab of 1200 bps systems, they tend to be
very speaker-dependent and lack robustness).

--
Joe Buck
jbuck@galileo.berkeley.edu	 {uunet,ucbvax}!galileo.berkeley.edu!jbuck

turner@lance.tis.llnl.gov (Michael Turner) (12/14/90)

In article <367@rufus.UUCP> drake@drake.almaden.ibm.com writes:
>In article <9508@pitt.UUCP> dcollins@pittslug.sug.org.UUCP (Daniel Collins) writes:
>>I am working on a project where I have to compress a speech signal  
>>by 20-to-1 ratio in real-time.  If 20-to-1 not practical, what are 
>>limiting factors, and what is achievable I will be using an 8 bit ADC 
>>with a 8KHz conversion rate.  
>
>So from 64 Kbits/second you want to do 20:1 compression, down to 3200 bits
>per second?  Pretty aggressive.  There's a company with a product that
>hooks to a standard serial port that claims to be able to do speech at
>1100 bits per second; the product is from Digispeech, called the DS201.  
>The compression algorithm seems to be proprietary.
>
>Sam Drake / IBM Almaden Research Center 
>Internet:  drake@ibm.com            BITNET:  DRAKE at ALMADEN
>Usenet:    ...!uunet!ibmarc!drake   Phone:   (408) 927-1861

And that's with all the time in the world to compress.  The best REAL-TIME
compression I've heard about that preserves the signal significantly is some
CELP (code-excited linear prediction, large codebook) technique (see recent
IEEE AS&SP issues) that gets you down to 9600 baud.  However, you need a
significant fraction of a Cray to run at that rate, according to the author.

Of course, there's the wonderful Apple Macintosh compression technique that
runs in sublinear time: just throw out samples.  But I assume you want to
be able to understand the speech it when it's played back.

I suggest you revise your constraints, either the compression ratio or
the real-time response, or both.  I'm no expert (yet) at speech compression,
but I think you're out of luck.

On the subject, however: I'm always on the look-out for NON-real-time
compression algorithms (similar sampling rates, accuracy and compression
ratio to the above problem).  I know about Moser, etc.  I'm most interested
in techniques that exploit knowledge of perceptual limitations in hearing
and production limitations in speech to figure out what parts of the
raw signal can be thrown out.  Assume that the speech has already been
"recognized" down to something like the phoneme level, and that this
information can be used in the compression algorithm.  Assume also a
single non-singing speaker with little background noise.  I'm interested
in good extraction and reproduction of nasal antiresonances, subglottal
coupling, pitch-pulse shape, etc.  For higher (16KHz) rates, getting
believable sibilance is high on my list as well.*  A parametric
representation that allows control of variation of stress factors
(duration, pitch, amplitude) is important.

I see that the relevant techniques are out there, but I'm having trouble
finding all them all and putting them all together.  It doesn't help that I'm
a nearly-total neophyte with DSP, which seems to the dialect of greek that
most of the relevant literature is written in.  On the other hand, I do know
something about phonetics and phonology, which is swahili to a lot of
DSP folk.
---
Michael Turner
turner@tis.llnl.gov

* Just today I was talking on some bandwidth-limited line to a friend
  who couldn't understand that I was talking about our mutual friend
  BRUCE, not somebody I'd never met named RUTH.

jbuck@galileo.berkeley.edu (Joe Buck) (12/14/90)

In article <1192@ncis.tis.llnl.gov>, turner@lance.tis.llnl.gov (Michael Turner) writes:
> And that's with all the time in the world to compress.  The best REAL-TIME
> compression I've heard about that preserves the signal significantly is some
> CELP (code-excited linear prediction, large codebook) technique (see recent
> IEEE AS&SP issues) that gets you down to 9600 baud.  However, you need a
> significant fraction of a Cray to run at that rate, according to the author.

There are several tricks to make CELP substantially faster without much loss
in quality.  Most involve imposing some structure on the codebook of vectors
so you can find the best match without computing distortions for every vector
(there are some connections with the theory of error-correcting codes here;
I don't understand them all).  Look in the proceedings of recent ICASSP
conferences with details.

> On the subject, however: I'm always on the look-out for NON-real-time
> compression algorithms (similar sampling rates, accuracy and compression
> ratio to the above problem).  I know about Moser, etc.  I'm most interested
> in techniques that exploit knowledge of perceptual limitations in hearing
> and production limitations in speech to figure out what parts of the
> raw signal can be thrown out.  Assume that the speech has already been
> "recognized" down to something like the phoneme level, and that this
> information can be used in the compression algorithm.  Assume also a
> single non-singing speaker with little background noise.  I'm interested
> in good extraction and reproduction of nasal antiresonances, subglottal
> coupling, pitch-pulse shape, etc.

Unfortunately, compression techniques that match up that specifically to
assumptions about the human speech reproduction system tend to make large,
bad-sounding errors when things don't match the model.  That's why there's
a general movement away from, for instance, models that depend strongly
on a voicing decision (like LPC).  You get big croaks when it screws up.
Still, for speech recognition purposes, LPC parameters contain valuable
information.

>  For higher (16KHz) rates, getting believable sibilance is high on my list
> as well.*  

George Kang of Naval Research Laboratory, who had a lot to do with the
design of the government LPC-10 algorithm, did a good deal of work on
this.  He argues that the old-fashioned carbon microphone's nonlinearities
are actually beneficial in the telephone system, becuase they map the
high frequencies of sibilants (especially for female speakers) down into
the passband of the phone system where they can be heard.  He did some
research on various types of nonlinear distortions to apply to speech
sampled at 16 KHz before downsampling to 8KHz, so that female speakers
would sound better when processed by LPC (female speakers generally
sound a good deal worse in LPC because of their smaller pitch periods
and higher-frequency sibilants).  He gave a very amusing talk at
an ICASSP about six years ago, using the test sentence

"Her purse was full of useless trash"

as a source of sibilants. :-)

It appears that the main problem with sibilants is the anti-aliasing
filter; there just isn't much energy in sibilants below 3.2 KHz.

--
Joe Buck
jbuck@galileo.berkeley.edu	 {uunet,ucbvax}!galileo.berkeley.edu!jbuck