dcollins@sug.org (Daniel Collins) (12/13/90)
I am working on a project where I have to compress a speech signal by 20-to-1 ratio in real-time. If 20-to-1 not practical, what are limiting factors, and what is achievable I will be using an 8 bit ADC with a 8KHz conversion rate. Can anyone offer any pointers to DSP chips, other special purpose chips, literature, code fragments, etc. Dan Collins dcollins@pittslug.sug.org
drake@drake.almaden.ibm.com (12/13/90)
In article <9508@pitt.UUCP> dcollins@pittslug.sug.org.UUCP (Daniel Collins) writes: >I am working on a project where I have to compress a speech signal >by 20-to-1 ratio in real-time. If 20-to-1 not practical, what are >limiting factors, and what is achievable I will be using an 8 bit ADC >with a 8KHz conversion rate. So from 64 Kbits/second you want to do 20:1 compression, down to 3200 bits per second? Pretty aggressive. There's a company with a product that hooks to a standard serial port that claims to be able to do speech at 1100 bits per second; the product is from Digispeech, called the DS201. The compression algorithm seems to be proprietary. Sam Drake / IBM Almaden Research Center Internet: drake@ibm.com BITNET: DRAKE at ALMADEN Usenet: ...!uunet!ibmarc!drake Phone: (408) 927-1861
jbuck@galileo.berkeley.edu (Joe Buck) (12/14/90)
In article <367@rufus.UUCP>, drake@drake.almaden.ibm.com writes: > In article <9508@pitt.UUCP> dcollins@pittslug.sug.org.UUCP (Daniel Collins) writes: > >I am working on a project where I have to compress a speech signal > >by 20-to-1 ratio in real-time. If 20-to-1 not practical, what are > >limiting factors, and what is achievable I will be using an 8 bit ADC > >with a 8KHz conversion rate. > So from 64 Kbits/second you want to do 20:1 compression, down to 3200 bits > per second? Pretty aggressive. Pretty standard, actually, if you go a bit further to 2400 bps. You can get a 2400/4800 bps speech vocoder from a number of places (my former company, Entropic Speech, Inc, sells one); if you're interested in algorithms, you can get the government standard LPC-10e algorithm (sorry, I don't know contacts for this; check out back issues of Speech Technology magazine). Speech quality is artificial but usable in many applications. There's been great progress recently in "medium-range" (4800 to 12K bps) speech compression; algorithms in this range, while computationally expensive, now have very low distortion. The upper end of this range may soon be used in digital cellular phones. It's possible to compress speech to much lower rates, but I haven't heard anything below 2400 bps that I consider generally usable (though you can make nice sounding tapes in a lab of 1200 bps systems, they tend to be very speaker-dependent and lack robustness). -- Joe Buck jbuck@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck
turner@lance.tis.llnl.gov (Michael Turner) (12/14/90)
In article <367@rufus.UUCP> drake@drake.almaden.ibm.com writes: >In article <9508@pitt.UUCP> dcollins@pittslug.sug.org.UUCP (Daniel Collins) writes: >>I am working on a project where I have to compress a speech signal >>by 20-to-1 ratio in real-time. If 20-to-1 not practical, what are >>limiting factors, and what is achievable I will be using an 8 bit ADC >>with a 8KHz conversion rate. > >So from 64 Kbits/second you want to do 20:1 compression, down to 3200 bits >per second? Pretty aggressive. There's a company with a product that >hooks to a standard serial port that claims to be able to do speech at >1100 bits per second; the product is from Digispeech, called the DS201. >The compression algorithm seems to be proprietary. > >Sam Drake / IBM Almaden Research Center >Internet: drake@ibm.com BITNET: DRAKE at ALMADEN >Usenet: ...!uunet!ibmarc!drake Phone: (408) 927-1861 And that's with all the time in the world to compress. The best REAL-TIME compression I've heard about that preserves the signal significantly is some CELP (code-excited linear prediction, large codebook) technique (see recent IEEE AS&SP issues) that gets you down to 9600 baud. However, you need a significant fraction of a Cray to run at that rate, according to the author. Of course, there's the wonderful Apple Macintosh compression technique that runs in sublinear time: just throw out samples. But I assume you want to be able to understand the speech it when it's played back. I suggest you revise your constraints, either the compression ratio or the real-time response, or both. I'm no expert (yet) at speech compression, but I think you're out of luck. On the subject, however: I'm always on the look-out for NON-real-time compression algorithms (similar sampling rates, accuracy and compression ratio to the above problem). I know about Moser, etc. I'm most interested in techniques that exploit knowledge of perceptual limitations in hearing and production limitations in speech to figure out what parts of the raw signal can be thrown out. Assume that the speech has already been "recognized" down to something like the phoneme level, and that this information can be used in the compression algorithm. Assume also a single non-singing speaker with little background noise. I'm interested in good extraction and reproduction of nasal antiresonances, subglottal coupling, pitch-pulse shape, etc. For higher (16KHz) rates, getting believable sibilance is high on my list as well.* A parametric representation that allows control of variation of stress factors (duration, pitch, amplitude) is important. I see that the relevant techniques are out there, but I'm having trouble finding all them all and putting them all together. It doesn't help that I'm a nearly-total neophyte with DSP, which seems to the dialect of greek that most of the relevant literature is written in. On the other hand, I do know something about phonetics and phonology, which is swahili to a lot of DSP folk. --- Michael Turner turner@tis.llnl.gov * Just today I was talking on some bandwidth-limited line to a friend who couldn't understand that I was talking about our mutual friend BRUCE, not somebody I'd never met named RUTH.
jbuck@galileo.berkeley.edu (Joe Buck) (12/14/90)
In article <1192@ncis.tis.llnl.gov>, turner@lance.tis.llnl.gov (Michael Turner) writes: > And that's with all the time in the world to compress. The best REAL-TIME > compression I've heard about that preserves the signal significantly is some > CELP (code-excited linear prediction, large codebook) technique (see recent > IEEE AS&SP issues) that gets you down to 9600 baud. However, you need a > significant fraction of a Cray to run at that rate, according to the author. There are several tricks to make CELP substantially faster without much loss in quality. Most involve imposing some structure on the codebook of vectors so you can find the best match without computing distortions for every vector (there are some connections with the theory of error-correcting codes here; I don't understand them all). Look in the proceedings of recent ICASSP conferences with details. > On the subject, however: I'm always on the look-out for NON-real-time > compression algorithms (similar sampling rates, accuracy and compression > ratio to the above problem). I know about Moser, etc. I'm most interested > in techniques that exploit knowledge of perceptual limitations in hearing > and production limitations in speech to figure out what parts of the > raw signal can be thrown out. Assume that the speech has already been > "recognized" down to something like the phoneme level, and that this > information can be used in the compression algorithm. Assume also a > single non-singing speaker with little background noise. I'm interested > in good extraction and reproduction of nasal antiresonances, subglottal > coupling, pitch-pulse shape, etc. Unfortunately, compression techniques that match up that specifically to assumptions about the human speech reproduction system tend to make large, bad-sounding errors when things don't match the model. That's why there's a general movement away from, for instance, models that depend strongly on a voicing decision (like LPC). You get big croaks when it screws up. Still, for speech recognition purposes, LPC parameters contain valuable information. > For higher (16KHz) rates, getting believable sibilance is high on my list > as well.* George Kang of Naval Research Laboratory, who had a lot to do with the design of the government LPC-10 algorithm, did a good deal of work on this. He argues that the old-fashioned carbon microphone's nonlinearities are actually beneficial in the telephone system, becuase they map the high frequencies of sibilants (especially for female speakers) down into the passband of the phone system where they can be heard. He did some research on various types of nonlinear distortions to apply to speech sampled at 16 KHz before downsampling to 8KHz, so that female speakers would sound better when processed by LPC (female speakers generally sound a good deal worse in LPC because of their smaller pitch periods and higher-frequency sibilants). He gave a very amusing talk at an ICASSP about six years ago, using the test sentence "Her purse was full of useless trash" as a source of sibilants. :-) It appears that the main problem with sibilants is the anti-aliasing filter; there just isn't much energy in sibilants below 3.2 KHz. -- Joe Buck jbuck@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck