[net.internat] Lexicographers

pn@stc.UUCP (01/23/86)

Are there any lexicographers out there?  I believe that's the correct term.
I need to find a list of the most commonly used English words in WRITTEN
TEXT (not oral useage).  The most common 8000 would do nicely, to coin a
phrase.  Note no smiley here, I'm serious about the 8000, but I suspect
that after the first few hundred most common, there's not much to choose
between them.  Also in the same vein, I need the most common English word
endings, preferably ranked in order of probability.
Any pointers greatfully welcomed.
-- 

Phil Norris	<pn@stc.UUCP>
		{root44, ukc, datlog, idec, stl, creed, iclbra, iclkid}!stc!pn

dave@lsuc.UUCP (David Sherman) (01/30/86)

Phil Norris <pn@stc.UUCP> writes:
> Are there any lexicographers out there?  I believe that's the correct term.
> I need to find a list of the most commonly used English words in WRITTEN
> TEXT (not oral useage).  The most common 8000 would do nicely, to coin a
> phrase.  Note no smiley here, I'm serious about the 8000, but I suspect
> that after the first few hundred most common, there's not much to choose
> between them.  Also in the same vein, I need the most common English word
> endings, preferably ranked in order of probability.

You can get this kind of information from the Brown corpus.
That is a random sampling of 2000 500-word samples of writing
published in 1963, grouped into categories including fiction,
non-fiction, etc., all from (I believe) generally available
U.S. publications. It's a bit over a million words, because
each sample is carried to the end of the current sentence
instead of cut at 500 words. (I might have it backwards - it
could be 500 2000-word samples.)

The Brown corpus came out of Brown University in Rhode Island.
When I was doing some linguistics research a few years back,
I found a copy of the tape at the U of Toronto computing centre,
along with a book that was published at the same time. The book
contains answers to some of the more common questions, such as
what the most common words are.

I also copied the tape (with dd) into an ASCII format, and wrote
a C program to find given words in context (a 5-line context).
There are some symbols (punctuation and paragraphing) which are
represented by asterisk codes and need to be programmed around.
If you do get hold of the Brown corpus, I could probably dig up
the source. (This was 1977-78, though.)

Check your local university library and/or computer centre.

Dave Sherman
The Law Society of Upper Canada
Osgoode Hall
Toronto, Canada  M5H 2N6
+1 416 947 3466
-- 
{ ihnp4!utzoo  pesnta  utcs  hcr  decvax!utcsri  } !lsuc!dave