fordjm@byuvax.UUCP (11/15/87)
Recently someone on the net commented on a program or method of rating the "Englishness" of words according to the frequency of occurance of various letters in sequence, etc. I am currently involved in a project in which this approach might prove useful, but I have lost the original posting. Could the author please contact me with more information about his or her project? Thanks in advance, John M. Ford fordjm@byuvax.bitnet 131 Starcrest Drive Orem, UT 84058
kfl@SPEECH2.CS.CMU.EDU (Kai-Fu Lee) (11/17/87)
In article <32fordjm@byuvax.bitnet>, fordjm@byuvax.bitnet writes: > > Recently someone on the net commented on a program or method of rating > the "Englishness" of words according to the frequency of occurance of > various letters in sequence, etc. > I don't know anything about the said post. But you might be interested in the following article: Cave and Neuwirth, Hidden Markov Models for English, Proceedings of the Symposium on Appication of Hidden Markov Models to Text and Speech, Princeton, NJ 1980. Here's the editor's summary of the paper: L.P. Neuwirth discusses the application of hidden Markov analysis to English newspaper text (26 letters plus word space, without punctuation). This work showed that the technique is capable of automatically discovering linguistically important categorizations (e.g., vowels and consonants). Moreover, a calculation of the entropy of these models shows that some of them are stronger than the ordinary digraphic model, yet employ only half as many parameters. But one of the most interesting points, from a philosophical point of view, is the completely automatic nature of the process of obtaining the model: only the size of the state space, and a long example of English text, are give. No a priori structure of the state transition matrix, or of the output probabilities is assumed. Since hidden Markov models can be used for generation and recognition, it is possible to train a model for English, and "score" any previously unseen word with a probability that it was generated by the model for English. > Thanks in advance, > John M. Ford fordjm@byuvax.bitnet > 131 Starcrest Drive > Orem, UT 84058 > Kai-Fu Lee Computer Science Department Carnegie-Mellon University