[sci.lang] Measures of "Englishness"?

fordjm@byuvax.UUCP (11/15/87)

  Recently someone on the net commented on a program or method of rating
the "Englishness" of words according to the frequency of occurance of
various letters in sequence, etc.
     
  I am currently involved in a project in which this approach might prove
useful, but I have lost the original posting.  Could the author please
contact me with more information about his or her project?

Thanks in advance,
John M. Ford               fordjm@byuvax.bitnet
131 Starcrest Drive
Orem, UT 84058


     

kfl@SPEECH2.CS.CMU.EDU (Kai-Fu Lee) (11/17/87)

In article <32fordjm@byuvax.bitnet>, fordjm@byuvax.bitnet writes:
> 
>   Recently someone on the net commented on a program or method of rating
> the "Englishness" of words according to the frequency of occurance of
> various letters in sequence, etc.
>      

I don't know anything about the said post.  But you might be interested
in the following article: 
	Cave and Neuwirth, Hidden Markov Models for English, Proceedings
	of the Symposium on Appication of Hidden Markov Models to Text
	and Speech, Princeton, NJ 1980.

Here's the editor's summary of the paper:

	L.P. Neuwirth discusses the application of hidden Markov analysis to
	English newspaper text (26 letters plus word space, without 
	punctuation).  This work showed that the technique is capable 
	of automatically discovering linguistically important categorizations
	(e.g., vowels and consonants).  Moreover, a calculation of the
	entropy of these models shows that some of them are stronger than
	the ordinary digraphic model, yet employ only half as many parameters.
	But one of the most interesting points, from a philosophical point
	of view, is the completely automatic nature of the process of
	obtaining the model: only the size of the state space, and a
	long example of English text, are give.  No a priori structure of the 
	state transition matrix, or of the output probabilities is assumed.

Since hidden Markov models can be used for generation and recognition,
it is possible to train a model for English, and "score" any previously
unseen word with a probability that it was generated by the model for
English.

> Thanks in advance,
> John M. Ford               fordjm@byuvax.bitnet
> 131 Starcrest Drive
> Orem, UT 84058
> 

Kai-Fu Lee
Computer Science Department
Carnegie-Mellon University