[comp.dsp] Speech recognition: state-of-the-art ?

csirpd@quagga.ru.ac.za (Paul Ducklin) (12/18/90)

I'm trying to get a handle on the state-of-the-art in speech
recognition systems. Could anyone in netland let me have some
idea of (a) where we are now and (b) where we'll be in 2-5
years time re...

  * for a specific voice, and a non-mega$ desktop machine, what's
    a good recognition vocabulary? 5000 words? 10000 words? I
    hear that there's a product out called "Dragon Dictate" for
    which a 30000 word vocabulary has been mentioned (h/w reqd.
    = 386-type). Is this genuwyne? What happens if you get a cold?
    Will you need to train all 30000 words again? What sort of
    recognition speed is likely on a non-Cray?

  * for "generic voice" (eg: all American-speaking females), what is
    a good vocabulary? What sort of reliability is attainable?

  * what's good vis-a-vis "natural" or continuous speech? How
    capable are recognition systems at handling speech without
    staccato-type interword pauses?

Paul Ducklin
------------
CSIR, Pretoria, RSA
------------

schultz@halley.tmc.edu (John C. Schultz) (12/20/90)

In article <1990Dec17.202616.3021@quagga.ru.ac.za> csirpd@quagga.ru.ac.za (Paul Ducklin) writes:
>
>  * for a specific voice, and a non-mega$ desktop machine, what's
>    a good recognition vocabulary? 5000 words? 10000 words? I
>
>  * for "generic voice" (eg: all American-speaking females), what is
>    a good vocabulary? What sort of reliability is attainable?
>
>  * what's good vis-a-vis "natural" or continuous speech? How
>    capable are recognition systems at handling speech without
>    staccato-type interword pauses?


I would like to add to this list of questions 

   * How reliable is voice recognition in noisy environments with
     respect to vocabulary size?  For example if the system only
     needs to recognize 50 or so words, is that more robust than 
     a 5000 word system? How much more reliable? Don't know answers
     would be preferable to wrong answers.

--
John C. Schultz                    EMAIL: schultz@halley.est.3m.com
3M Company,  Building 518-01-1     WRK: +1 (612) 733-4047
1865 Woodlane Drive, Dock 4,       Woodbury, MN  55125

ray@ariel.ucs.unimelb.edu.au (Douglas Ray) (12/24/90)

From article <1990Dec19.215611.10659@mmm.serc.3m.com>, by schultz@halley.tmc.edu (John C. Schultz):
> In article <1990Dec17.202616.3021@quagga.ru.ac.za> csirpd@quagga.ru.ac.za (Paul Ducklin) writes:
>>
>>  * for a specific voice, and a non-mega$ desktop machine, what's
>>    a good recognition vocabulary? 5000 words? 10000 words? I
>>
>>  * for "generic voice" (eg: all American-speaking females), what is
>>    a good vocabulary? What sort of reliability is attainable?
>>
>>  * what's good vis-a-vis "natural" or continuous speech? How
>>    capable are recognition systems at handling speech without
>>    staccato-type interword pauses?
> 
>    * How reliable is voice recognition in noisy environments with
>      respect to vocabulary size?  For example if the system only
>      needs to recognize 50 or so words, is that more robust than 
>      a 5000 word system? How much more reliable? Don't know answers
>      would be preferable to wrong answers.

initial response:

I'm not qualified in this field, but if I haven't misinterpreted the
figures, here's summaries from papers presented at the 3rd international
conference on Speech Science and Technology, Melbourne, Australia,
November 1990.

General attitude at conference was to quote "small" vocabs as 20 - 200
words, and large as 500 - 1000 words.

[only first authors quoted]

  C. Rowles (Telecom Australia)
    state of art for speaker independent, continuous speech, modest vocab.
    (200-500w ?): 95% word recognition.

This 95% figure comes up a lot:

  W.A. Smith (Waikato, N.Z.)
    presented a feature selection algorithm for speaker independant, isolated
    word recognition, vocab. 20w: 95% word recognition

  Tracy Clark (Canturbury, N.Z.)
    compares various methods in isolation, comments on accent dependance;
    speaker dependant, isolated word, 10w vocab.: best up to 96% word
    recognition

but for larger vocabs you can't expect this:

  Tony Robinson (Cambridge, U.K.)
    Preliminary work on word recognition without grammatic constraints:
    speaker independant, continuous speech, using the DARPA 1000 word
    Resource Management Task: 52.1% word recognition rate (43.3% accuracy),
    but quotes the Sphinx system at 81.9%.

There was also some work on language recognition, eg:

  Walter Weigel (Munich, Germany)
    speaker independant, continuous speech, 132w vocab., 40 rule context-free
    grammar subset of German: 74% sentence recognition

[The conference proceedings contain around 80 papers in over 500 pp.;
inquiries to the Secretary, Australian Speech Science and Technology
Association, GPO Box 143, Canberra ACT 2601, Australia]

Eric.Thayer@cs.cmu.edu (Eric H. Thayer) (01/03/91)

In article <377@ariel.ucs.unimelb.edu.au> ray@ariel.ucs.unimelb.edu.au 
(Douglas Ray) writes:
> but for larger vocabs you can't expect this:
> 
>   Tony Robinson (Cambridge, U.K.)
>     Preliminary work on word recognition without grammatic constraints:
>     speaker independant, continuous speech, using the DARPA 1000 word
>     Resource Management Task: 52.1% word recognition rate (43.3% 
accuracy),
>     but quotes the Sphinx system at 81.9%.

I am not sure what conditions he is quoting, but the 'best' results are 
above 95% correct for the SPHINX system on the resource management task 
(sprk ind., continuous).   It is generally not very useful to quote raw 
performance numbers because there are many factors which improve/degrade 
recognition accuracy.

----------------------------------
Replies can have NeXT attachments in them
Phone: (412)268-7679