[sci.crypt] simple language statistics

doug@eris (Doug Merritt) (04/24/88)

I've written a program that categorizes files by the apparent language
they're written in. So far it distinguishes English from non-english
(e.g. scripts, software source, etc) simply by letter frequency, as
derived by a simple statistical analysis I did. Some other languages
are recognized by the presence or absence of non-latin alphabetical
characters encoded in 8 bit ANSI (e.g. hexadecimal F1 represents an
'N' with a diacritical mark, which I presumptuously assume means the
file is Spanish).

I'd like to do a better job of recognition. In particular I'd like to
be able to recognize common languages transcribed in ASCII/ANSI latin
alphabet.

To do this I need at least a letter frequency table for various languages
(especially the Romance languages), and possibly a digram frequency
table if it's absolutely necessary for accuracy. Similar information
for 8 bit ANSI would also be useful (my system does not support
multi-byte ANSI so that's not an issue; it *does* support 8-bit Icelandic,
for instance, so I'm currently recognizing that, among others).

Does anyone have such information online that they could mail me?
Embedded in other software is fine, I can extract it. References to
printed material also welcome; *especially* if there's a single source
I could use to find the most frequent letters (or digrams) for, say,
Spanish, German, French, Italian, Japanese, or any other languages that
are frequently transcribed into the latin alphabet.

The end product is a general purpose file identifier, like "file"
on Unix (but a lot smarter), so any other, possibly bizarre, but
easily recognizable languages not implied by the above description
would also be of potential interest.

Oh yeah, if anyone has ANSI standards documentation online that apply
to any of this, that'd be great too.

Thanks for any and all help!

	Doug Merritt		doug@mica.berkeley.edu (ucbvax!mica!doug)
			or	ucbvax!unisoft!certes!doug
			or	sun.com!cup.portal.com!doug-merritt

lampman@heurikon.UUCP (Ray Lampman) (04/26/88)

__________________________________________________________________________

In article <9141@agate.BERKELEY.EDU> doug@eris.UUCP (Doug Merritt) writes:
| I've written a program that categorizes files by the apparent language
| they're written in.
__________________________________________________________________________

I will be interested in this tool when you are satisfied it is complete.
Please send mail or post news, others may be interested as well.

Given enough example texts for letter, digram, or trigram frequencies, it
should be possible to identify just about any language. Will your program
recognize computer as well as human languages?

How about providing a way of `teaching' your program about languages it
does not yet recognize? If I can provide a sample text of my favorite
language `X', can your program assimilate the sample and recognize other
samples of the same language? What should the program do if there is no
statistical difference between a language it already `knows' and a new
one you are trying to teach it? Hope some of this is useful,
-- 
                                        - Ray Lampman (lampman@heurikon.UUCP)

lpb@csadfa.oz (Lawrie Brown) (04/27/88)

> Does anyone have such information online that they could mail me?
> Embedded in other software is fine, I can extract it. References to
> printed material also welcome; *especially* if there's a single source
> I could use to find the most frequent letters (or digrams) for, say,
> Spanish, German, French, Italian, Japanese, or any other languages that
> are frequently transcribed into the latin alphabet.
> 
	The following book has just been published by Prentice-Hall:

	Jennifer Seberry and Josef Pieprzyk, Cryptography: An Introduction to
	Computer Security, Prentice-Hall, 1988.

The book contains frequency tables for the following languages:

	Arabic, Danish, Dutch, English, Finnish, French, German, Italian,
	Greek, Hebrew, Japanese, Latin, Malay, Norwegian, Portuguese,
	Russian, Sanskrit, Spanish, and Swedish.

Further, programs similar to those being proposed have already been written as
part of the following honours theses: 

	Susy Deck, Computational Methods for Foreign Language Identification
	with Application to Cryptography, University of Sydney. 1985 

	Hui-eng Tchun, Towards the Cryptanalysis of Bhahasa Malaysia,
	University of Sydney. 1985 

	Sing Guat Ong, Towards the Cryptanalysis of Mandarin-Pinyin,
	University of Sydney. 1987 

and in a Postgraduate Diploma Students Minor Thesis:

	Robert Ramsay, More Computational Methods for Foreign Language
	Identification, with Application to Cryptography,
	University of Sydney. 1987 

This work is being continued at present by students of this Department
in some Indonesian and New Guinean languages. We also have raw data and
some statistics on a number of other languages, including Kurdish and
Serbo-Croatian.  It is expected that a number of Technical reports and
other publications of this work will be released in the near future.

If you have any queries please feel free to contact me, or our Professor
and Head, Dr. Jennifer Seberry <jennie@cs.adfa.oz@UUNET.UU.NET>.

	Regards
	Lawrie Brown.

----
Mr. Lawrie Brown,		Phone ISD:   +61 62 688167   Fax: +61 62 470702
Dept. Computer Science,		Telex:	     ADFADM AA62030
University College, UNSW,	ACSNET/CSNET:	lpb@cs.adfa.oz
Aust. Defence Force Academy,	UUCP:		...!uunet!munnari!cs.adfa.oz!lpb
Canberra. ACT 2600.		ARPA:		lpb%cs.adfa.oz@uunet.uu.net
AUSTRALIA			JANET:		lpb@oz.csadfa
				Other Gateways:	see CACM 29(10) Oct. 1986