pfuetz@baloo.uucp (Matthias Pfuetzner) (02/14/91)
Hello All! I'd like to know whether anybody out there might have the information I'm looking for. Last week I read in a german magazine on Hifi-Stereophonie (the magazine is called "Stereoplay") that Philips is doing much in order to get his DCC (Digital Compact Cassette) into the market. They announced the first recorders for spring 1992. Because of the fact that these shall use no rotating heads (S-DAT) they need to reduce the amount of digital data to be recorded. (CD 2.5 Megabit / second down to 0.77 Megabit (including subcode and error-information) for DCC). The algorithm used is called PASC (Precision Adaptive Subband Coding). There are some other algorithms available (MUSICAM, MASCAM, ASPEC, OCF). As far as I read the ISO (International Standardisation Organisation) will release a standard on this subject. What I'm looking for now is information on the basic algorithms, the steps the ISO already did, etc. The reason for that is, that Sun SparcStations have audio possibilities and I'd like to use them on some amount. So any hint is highly appreciated. Thanks in advance, Matthias ----- @work: | Matthias Pfuetzner | @home: ZGDV, Wilhelminenstrasse 7 | 6100 Darmstadt, FRG | Lichtenbergstrasse 73 +49 6151 155-164 or -101 \ <- Tel.nr. -> / +49 6151 75717 pfuetzner@agd.fhg.de pfuetzner@zgdvda.UUCP XBR1YD3U@DDATHD21.BITNET
Alvin@cup.portal.com (Alvin Henry White) (02/17/91)
Thinking about projects. I have been turning these ideas over in my mind. It seems that all the technologies to accomplish this are in existence. The problem seems to be that each of the technologies is born of a different specialty and thus what is lacking is the bringing them together technology. What I would like to accomplish. I have many bible audio tapes in many languages. I also have some classic stories in several languages. Including the Chinese classics of Confucious, Lao Tze, the I-Jing or I Ching, and and old Indian song called the Bagavad Gita. This latter from India India not American Indian. I also have the written text. I want to have the computer show the words like the old bouncing ball sing-a-long. And at the same time I want the computer to put out the sound of the words. I want the text to be in what is approximately called "interlinear transliteration" word for word and with the translated clue word starting right under the word in the original tongue, language 1, or master language. The tree is green. El arbol es verde. El arbol verde. Th tree green. The computer synthesized speech I want in stereo. One language in each ear and the bouncing ball or highlighted phoneme showing what phoneme is sounding. It seems the unix pc could do this if two voice power cards could be made to operate at the same time in sync. For the songs you would need from one to four more voice power cards to play the music. Thus for Rock of Ages you could have 6 voice power cards. To render the art form as four voice, Father, Mother, Son and Daughter accompanied by organ recital of four note polyphony it would take at least four cards for the English parts and four cards for the Chinese parts and you might be able to get away with somehow dividing the four musical tones for each language between both languages. In other words, a total of 12 voice power cards operating in sync. Now if the graphics where treated equally in both directions you would need to add two more lines to each line to be treated interlinearly. That would be one line of Chinese Kanji indicating in Kanji the sound of the English line. And one line of Kanji showing the Chinese Text. The English line accompanying this would be what is known as phonetic transliteration. In other words, an English singer singing that string of alpha characters would make sounds that a Chinese speaking person listening on the other end of the telephone line could interpret in the Chinese Language sufficient to send an order of egg rolls and be able to record your visa card number. So here we need at least 12 AT&T voice power cards running in sync with the multilingual multimedia graphics on the AT&T Unix PC. There are programs called approximately "phase vocoders" that can change the tempo, time, speed of an audio sample without changing the pitch but it is said that these require a lot of processing power. The subject of phase vocoders has considerable consideration in a recent book by F. Richard Moore of the University of California at San Diego. The book is Element of Computer Music. I get the idea the FRM is the author of a topic occasionally referred to as "CMUSIC" in the computer music section of usenet. I was reading my local papers for the last few days, San Jose, California, and there were stories about 64meg memory chips around 1994 and soon now 100 megahertz 486 chips. Also not far away, we are told, is an increase in parallel processing. Thus I don't feel it unreasonable to think that what I have suggested is beyond the home market, church market, school market, in four years. The synchronization problems will take that long to work out anyway. Thus I want the option, if one word takes a different length of time in relation to its counterpart, to be able to speed up or slow down the tempo of the second language clue word without changing its pitch. I have bought many of the computerized text to speech hardware and software over the years. One of the problems is that they do not speak the English Language as I want it spoken. Now there are speech recognition programs comming available on the market. One of the most notable is Dragon Systems' Dragon Dictate which can reportedly recognize 30,000 English words. It is said to have an artificial intelligence capability to learn the particular speech of an individual. It is not sold with a speech synthesizer. Here you see is the specialization I mentioned above. They see themselves as being in the speech recognition business and having nothing to do with the speech synthesis business. People in the speech synthesis business generally have nothing to do with speech recognition. Of course it is rare that one or the other knows their own discipline and also how to incorporate artificial intelligence learning. I think I can understand that each of these disciplines takes a lifetime to learn and even one accomplishment is a miracle, out let alone multiple. But we the people need the combined abilities of all to help save our lives and the lives of our loved ones. So while we are asking a lot, it is for a good cause. To get the two groups to sing four part harmony simultaneously a considerable knowledge of music is necessary. Here again is a lifetime's worth of study. But if we are going to hear God's word and the word of Confucius in a language that we can all understand and provide justice for all, it need be done. Some local people have a sound card that does sampling and allows visual editing by viewing the graphic depiction of the wave form on an ibm compatible. If I could take the readings of the classic stories read by what I consider to be good readers for my purpose on audio tape and clip each word into a sound file dictionary database and then as the graphics terminal generates the sing-a-long it would call up the necessary sound files and either render or edit and generate as necessary the several sound tracks. There is work going on in "Continuous Speech Recognition". You can guess what that is. It is the leading edge of the art of speech recognition. I have a task that would be simpler for them. Where they have little clue what words are comming next I have the complete text in machine readable form. What I want them to do automatically is to determine where one word ends and the next begins. Then to write out to mass storage each word's sampled sound file into a database and have each record cataloged by language, chapter, verse and word. I clipped part of a usenet article that had the words: Automatically segmenting speech signals and feature extraction using neural nets for phoneme recognition. A possible reference would be a Professor T. Kohenen using a Fast Fourier Transform to preprocess the voice signal of the Finish language. Kohenen T., Associative Memory and Self Organization. Another opportunity for artificial intelligence and computer speech analysis is then to compare all the various ways a text word has been rendered into audio and could possibly define a "most usual" which seems to be what the dragon is doing. A person then could have various options as to how they wanted to hear the text read back. I want to hear it produced close to latin or spanish in that I want each alpha character to have one and only one audio representation. I also want to see if I can define what I think should be the standard speed for machine read american english. I would like to be able to have text to speech generated such that "come" and "home" rhyme. Some people might like "cum hum" others could prefer "comb" "home" and I want to hear "co may" "ho may". Any of you unix-pc'ers figured out how to make this thing speak italian? At some point one would like to vary the tempo but not the pitch. But, getting back to the standard speed and standard audio representation of american english. People who are trying to produce low cost speech recognition equipment would have their task greatly simplified at least if the speaker could conform to this "standard american speech and beat." The beat could be prompted by the recognition machine, just add a drum machine. People hereabouts have considerable practice singing to the beat. Another option I want is to be able to have the machine generate speech as a teaching tool. So it will give an example of how it thinks a word should be rendered and then the student can try to imitate it. If the operator does not like the machine version, the operator can give the machine an example and tell the machine to imitate the operator more closely. You could take the machine to church and have several people speak into the machine the word of God and the machine would come back and tell you how each person scored. Anybody heard of "Pastor for a Day?" Rightly dividing the word of truth. Speakometer. Just think of the opportunity for a public speaking class? An automated machine to tell you who is most able to speak properly and allow the operator of the machine to tell it what is supposed to be right. The machine might be unpopular with the educational authorities, read "financial administration", unless the machine could be properly adjusted such that the purchaser would always be shown to be the best speaker. Well, I better get off of this before we open up the subject of politics. -alvin Alvin H. White, Gen. Sect. G.O.D.S.B.R.A.I.N. Government Online Database Systems Bureau for Resource Allocations to Information Networks alvin@cup.portal.com