[comp.sys.att] UNIX PC Voice Power: unlocking the untapped capabilities?

lenny@icus.islp.ny.us (Lenny Tropiano) (11/08/88)

I've posed this before, but now I have proof that it's possible.  I've
spoken with various people (some who were on the original Voice Power
development team) who couldn't give me "specifics" but said it was
possible.   Voice Recognition, how?  That's the question... Since my
involvement with the Voice Power product on the UNIX pc, I've learned
a lot.  Learning bits and pieces about CODEC's, PCM (pulse code modulation),
DSP's (digital signal processors), sub bands, mu-law, a-law, etc...  It's
still very technical, and way over my head, but I'm learning...  [side note:
if there is anyone out there who can give me help in the above topics
please feel free to contact me].

I was fortunate to get a copy of the _AT&T Technical Journal_, Sept/Oct 1986,
volume 65, issue 5, entitled "Speech Processsing Theory", from someone on
the Voice Power team.  This issue of the magazine was dedicated to the
technical aspects of speech processing and added some light to certain
topics.  [in most places it's very technical].  

In the article, "Speech Processing for AT&T Workstations", by 
John G. Ackenhusen, Syed S. Ali, James G. Josenhans, John W. Moffett, 
Reuel R. Robertson and Jaime R. Tormos, they discussed the Voice Power
product for the UNIX PC.  

-- the follow is paraphrasing passages out of the 8 page article --

..."    This paper describes the Voice Power speech processor, a speech
processing option for the AT&T UNIX PC.  It is a peripheral card (with
software) that slides into an expansion slot on the worksation and adds 
the capability for speech store and playpack, speech recognition, and
text-to-speech synthesis to the UNIX PC.

	The initial offereing of the application software is also described.
This software uses a subset of the hardwares capabilities:  speech storage
and playpack, and text-to-speech synthesis.
...
	Future Speech Processing Capabilities

	The Voice Power speech processor can support speech recognition
and text-to-speech synthesis.  These capabilities are under development.

	Speech Recognition.  The Voice Power speech processor's speech
recognition capability permits the automatic identification of an
unknown word of phrase from a vocabulary of 50 words or phrases.  First
the user or an automatic application compiler must slect the words; then, 
the user trains the recognizer... "


Text-to-speech works, and is implemented with the "vtts(1) command
plus various library calls"   All throughout the manuals they mention:
...
     CAVEATS
	  Text-to-speech components of Voice Power are provided	for
	  evaluation only and are not supported.
...

It does have some problems, but does work neithertheless.

Speech recognition is mentioned in various header files and commands, but
there is *no* software that utilizes it and no documentation on
how to interface to it.  How do you do it?!   I spoke with a person
who's name is plastered all over everything having to do with Voice Power
at AT&T, and he said that the recognition parameters (and format) are
proprietary and he couldn't say any more ... He did say that the
recognition parameters are a representation of filters of the vocal track.
Gee, that helps me a lot! :-}

Anyone able to help?   

The manual page for vrecord(1) also wets your taste buds, but that's about
it.

     vrecord(1)			VOICE POWER		    vrecord(1)


     NAME
	  vrecord - record voice data file

     SYNOPSIS
	  vrecord [-c card] [-v] [-x] [-q] [-o]	[-l n] [-f n] [-16]
	  [-24]	[-64] [-s]
		      [-t n] [-e n] [-g] [-i] [-D] [file]

     ...
	  -f n Set format control word (v2_ctrl.format)	to n.

	       0    16k	sub-band with silence compression

	       4    16k	sub-band, no silence compression, default.

	       2    24k	sub-band, with silence compression.

	       6    24k	sub-band, no silence compression.

	       36   Recognition
		    ^^^^^^^^^^^
	       8    64k	mu Law


-Lenny
-- 
Lenny Tropiano             ICUS Software Systems         [w] +1 (516) 582-5525
lenny@icus.islp.ny.us      Telex; 154232428 ICUS         [h] +1 (516) 968-8576
{talcott,decuac,boulder,hombre,pacbell,sbcs}!icus!lenny  attmail!icus!lenny
        ICUS Software Systems -- PO Box 1; Islip Terrace, NY  11752

clb@loci.UUCP (Charles Brunow) (11/09/88)

In article <540@icus.islp.ny.us>, lenny@icus.islp.ny.us (Lenny Tropiano) writes:
> I've posed this before, but now I have proof that it's possible.  I've
> spoken with various people (some who were on the original Voice Power
> development team) who couldn't give me "specifics" but said it was
> possible.   Voice Recognition, how?  That's the question... Since my
> involvement with the Voice Power product on the UNIX pc, I've learned
> a lot.  Learning bits and pieces about CODEC's, PCM (pulse code modulation),
> DSP's (digital signal processors), sub bands, mu-law, a-law, etc...  It's
> still very technical, and way over my head, but I'm learning...  [side note:
> if there is anyone out there who can give me help in the above topics
> please feel free to contact me].
> 
	If you don't already know this stuff pat then you're years away
	from speech recognition (SR).  The coding method and companding
	are basic stuff which you can find in telco references.  There's
	a bit in "Transmission Systems for Communications", by "Members
	of the Technical Staff - Bell Telephone Laboratories",  and you
	could profit from "Digital Signal Processing" by Alan V. Oppenheim
	and Ronald W. Schafer (Prentice-Hall, 1975).  There are bound
	to be other references which are basically equivalent.

	Another sources might be the app notes put out by TI a few years
	back when they were trying to convince the world that they had
	the best speech stuff.  Some of it is very specific, like how
	the vocal tract simulations work (schematics).  My archives are
	too confused to find copies so maybe someone else can lay their
	hands on a copy for you.

	Ultimately the process probably consists of determining the
	coefficients for the filter nodes and looking for the best
	match with the set of known words and updating the coefficients
	either completely or with a damping factor for learning.  The
	problem is that knowing that doesn't get you much closer to
	actually doing it.  There is loads of raw data (assume a 8KHz
	sample rate) which has to be reduced to a form which can be
	efficiently processed while keeping enough data to distinguish
	similar words from different people.  Many people have spent
	lots of time on it without significant break-thoughs.

-- 
			CLBrunow - KA5SOF
	clb@loci.uucp, loci@csccat.uucp, loci@killer.dallas.tx.us
	  Loci Products, POB 833846-131, Richardson, Texas 75083