[sci.lang] simple language statistics

doug@eris (Doug Merritt) (04/24/88)

I've written a program that categorizes files by the apparent language
they're written in. So far it distinguishes English from non-english
(e.g. scripts, software source, etc) simply by letter frequency, as
derived by a simple statistical analysis I did. Some other languages
are recognized by the presence or absence of non-latin alphabetical
characters encoded in 8 bit ANSI (e.g. hexadecimal F1 represents an
'N' with a diacritical mark, which I presumptuously assume means the
file is Spanish).

I'd like to do a better job of recognition. In particular I'd like to
be able to recognize common languages transcribed in ASCII/ANSI latin
alphabet.

To do this I need at least a letter frequency table for various languages
(especially the Romance languages), and possibly a digram frequency
table if it's absolutely necessary for accuracy. Similar information
for 8 bit ANSI would also be useful (my system does not support
multi-byte ANSI so that's not an issue; it *does* support 8-bit Icelandic,
for instance, so I'm currently recognizing that, among others).

Does anyone have such information online that they could mail me?
Embedded in other software is fine, I can extract it. References to
printed material also welcome; *especially* if there's a single source
I could use to find the most frequent letters (or digrams) for, say,
Spanish, German, French, Italian, Japanese, or any other languages that
are frequently transcribed into the latin alphabet.

The end product is a general purpose file identifier, like "file"
on Unix (but a lot smarter), so any other, possibly bizarre, but
easily recognizable languages not implied by the above description
would also be of potential interest.

Oh yeah, if anyone has ANSI standards documentation online that apply
to any of this, that'd be great too.

Thanks for any and all help!

	Doug Merritt		doug@mica.berkeley.edu (ucbvax!mica!doug)
			or	ucbvax!unisoft!certes!doug
			or	sun.com!cup.portal.com!doug-merritt

lampman@heurikon.UUCP (Ray Lampman) (04/26/88)

__________________________________________________________________________

In article <9141@agate.BERKELEY.EDU> doug@eris.UUCP (Doug Merritt) writes:
| I've written a program that categorizes files by the apparent language
| they're written in.
__________________________________________________________________________

I will be interested in this tool when you are satisfied it is complete.
Please send mail or post news, others may be interested as well.

Given enough example texts for letter, digram, or trigram frequencies, it
should be possible to identify just about any language. Will your program
recognize computer as well as human languages?

How about providing a way of `teaching' your program about languages it
does not yet recognize? If I can provide a sample text of my favorite
language `X', can your program assimilate the sample and recognize other
samples of the same language? What should the program do if there is no
statistical difference between a language it already `knows' and a new
one you are trying to teach it? Hope some of this is useful,
-- 
                                        - Ray Lampman (lampman@heurikon.UUCP)

doug@eris (Doug Merritt) (04/26/88)

Crossposted from sci.lang; edit Newsgroups if appropriate!

In article <201@heurikon.UUCP> lampman@heurikon.UUCP (Ray Lampman) writes:
>In article <9141@agate.BERKELEY.EDU> doug@eris.UUCP (Doug Merritt) writes:
>> I've written a program that categorizes files by the apparent language
>> they're written in.
> [...] be possible to identify just about any language. Will your program
>recognize computer as well as human languages?

Yes, it does. It is extremely thorough, compared with the Unix "file"
program. But it is biased towards things I can test. Lisp, for instance,
is recognized by a simple heuristic to avoid the problem of tangling
with the issue of zillions of dialects.

It currently recognizes perhaps a dozen common programming languages,
a half dozen human languages, transportable file formats (e.g. GIF,
arc, sit, pit, zoo, etc), executables (Unix, Amiga, Atari ST, MacIntosh),
about thirty (maybe sixty, it's late, my mind is fading) files that
are purely native to the Amiga (devices, sound, graphics, animation,
fonts, etc), since that's the target machine, and a couple dozen formats
from other machines that seem likely to pop up. I've got the initial
documentation finished, but I'm too tired to read it to refresh my memory.

So I keep trying to stretch its capabilities. I'm currently investigating
some information I found about ASCII transliterations of Hebrew,
more because I've got the info than because I think I'll run into any
Hebrew. If I can dig up cluster frequencies or something, I'd like
to recognize ASCII translits of common Romance languages, Japanese
and Russian. And maybe others. Why not?

Sorry, Pig Latin and Jive show up as English. :-) But it knows the
difference between Usenet news files and Unix mail, for instance.
Handy.

>How about providing a way of `teaching' your program about languages it
>does not yet recognize?

I'm working on it...

>If I can provide a sample text of my favorite
>language `X', can your program assimilate the sample and recognize other
>samples of the same language? What should the program do if there is no
>statistical difference between a language it already `knows' and a new
>one you are trying to teach it? Hope some of this is useful,

Well, it's impossible to do a perfect job, so inevitably I accept that it'll
be wrong sometimes. Finnegan's Wake would give it a heart attack. :-)
Also, it would be *very* slow (cpu intensive) to do even the best job
possible. So I pick a reasonable tradeoff.

As for using it on arbitrary new languages via example, that sounds
like a tough objective, but I'll think it over.

Porting it to a non-Amiga machine would be mostly easy, just chop
off the parts that are system specific or you don't care about (e.g.
analysing Amiga device names) and replace a few centralized i/o
routines. It's not *quite* ready for full release (still hoping
to find more statistics), but I could give out pieces of source code
to aid in particular specialized purposes if anyone is dying for it.
-Doug-

	Doug Merritt		doug@mica.berkeley.edu (ucbvax!mica!doug)
			or	ucbvax!unisoft!certes!doug
			or	sun.com!cup.portal.com!doug-merritt