[comp.compression] word frequency in English

mjward@adobe.COM (Michael J. Ward) (04/10/91)

Where can I find a list/dictionary/datafile of English words sorted by
relative frequency in various classes of usage. For example, is "a" the most
commonly used English word. followed by "the"? How about "that" compared to
"sesquipedalianism"? Who's doing binary lookup tables based on word
frequency?  --Mike Ward

drenze@umaxc.weeg.uiowa.edu (Douglas Renze) (04/10/91)

In article <13834@adobe.UUCP> mjward@adobe.COM (Michael J. Ward) writes:
>Where can I find a list/dictionary/datafile of English words sorted by
>relative frequency in various classes of usage. For example, is "a" the most
>commonly used English word. followed by "the"? How about "that" compared to
>"sesquipedalianism"? Who's doing binary lookup tables based on word
>frequency?  --Mike Ward

Well, I could be totally off base, but have you checked any linguistics text-
books or textbooks written to teach English to non-native english speakers?  I'd
almost say that the latter is more likely to be of help, because language texts
usually teach the basic 500 (or whatever) words in a language first.

Hope this helps, but I doubt it.

herrickd@iccgcc.decnet.ab.com (04/13/91)

In article <13834@adobe.UUCP>, mjward@adobe.COM (Michael J. Ward) writes:
> Where can I find a list/dictionary/datafile of English words sorted by
> relative frequency in various classes of usage. For example, is "a" the most
> commonly used English word. followed by "the"? How about "that" compared to
> "sesquipedalianism"? Who's doing binary lookup tables based on word
> frequency?  --Mike Ward

One such histogram is Strong's Concordance of the King James Bible, a
19th century exhaustive (down to a complete list of occurences of
the words "a", "the", and "and") list of occurences of the words
in an English book published in 1611.  A photo reproduction of the 
concordance is available in a bookstore near you for ten or twelve dollars.

dan herrick
herrickd@iccgcc.decnet.ab.com

PS.  I know he doesn't want a paper document, but Strong's is a
fantastic work done before von Neumann was born and looking it
over is a valuable educational experience for people interested
in what Mike asked about.

gwyn@smoke.brl.mil (Doug Gwyn) (04/13/91)

In article <5408@ns-mx.uiowa.edu> drenze@umaxc.weeg.uiowa.edu (Douglas Renze) writes:
>In article <13834@adobe.UUCP> mjward@adobe.COM (Michael J. Ward) writes:
>>Where can I find a list/dictionary/datafile of English words sorted by
>>relative frequency in various classes of usage. For example, is "a" the most
>>commonly used English word. followed by "the"? How about "that" compared to
>>"sesquipedalianism"?

Any reasonable English-language cryptanalysis text should cover this to
some extent.  Personally I would use the tables in Callimahos and
Friedman's "Military Cryptanalytics", Part 1, Volume 2, which has been
reprinted by Aegean Park Press in Laguna Hills, CA (and often stocked at
the Computer Literacy Bookstore in San Jose, CA).

emv@ox.com (Ed Vielmetti) (04/13/91)

In article <13834@adobe.UUCP> mjward@adobe.COM (Michael J. Ward) writes:

   Where can I find a list/dictionary/datafile of English words sorted by
   relative frequency in various classes of usage. For example, is "a" the most
   commonly used English word. followed by "the"? How about "that" compared to
   "sesquipedalianism"? Who's doing binary lookup tables based on word
   frequency?  --Mike Ward

The frequency of English words depends a lot on the body of text that
you're looking at.  As a first pass, it's relatively easy to scan
though a representative usenet newsgroup and count word frequencies
with something like "wordcount", a perl program on p.39 of the perl
book (or on uunet.uu.net:/nutshell/perl/).  

You've just thrown off the count for "sesquipedalianism", though ...

-- 
 Msen	Edward Vielmetti
/|---	moderator, comp.archives
	emv@msen.com

"With all of the attention and publicity focused on gigabit networks,
not much notice has been given to small and largely unfunded research
efforts which are studying innovative approaches for dealing with
technical issues within the constraints of economic science."  
							RFC 1216

melby@daffy.yk.Fujitsu.CO.JP (John B. Melby) (04/15/91)

>Where can I find a list/dictionary/datafile of English words sorted by
>relative frequency in various classes of usage. For example, is "a" the most
>commonly used English word. followed by "the"?

In fact, the word "the" is generally considered to be the most common word
in the English language.  As was pointed out, this is dependent on the body
of text that is used as a reference - for instance, on telegrams the word
"olive" might be more common.

You might want to look at a non-technical book on codes and ciphers.  Some
of these books also contain other interesting pointers (the most common
two-letter combination in the English language is "th," the letters most
frequently found in the English language are e,t,a,o,n,[r,i | i,r],s,h,
and so on....)

-----
John B. Melby
Fujitsu Limited, Machida, Japan
melby%yk.fujitsu.co.jp@uunet

fox@allegra.att.com (David Fox) (04/16/91)

In article <13834@adobe.UUCP> mjward@adobe.COM (Michael J. Ward) writes:

   Where can I find a list/dictionary/datafile of English words sorted by
   relative frequency in various classes of usage. For example, is "a" the most
   commonly used English word. followed by "the"? How about "that" compared to
   "sesquipedalianism"? Who's doing binary lookup tables based on word
   frequency?  --Mike Ward

Just find a bunch of text and run it through this shell script:

#!/bin/sh
tr -c '-a-zA-Z' ' ' $* | \
tr 'A-Z' 'a-z' | \
tr -s ' ' '\012' | \
sort | \
uniq -c | \
sort -r

Isn't unix wonderful?  A very quick test has convinced me that
"hotel" is the most common english word.  It could be a problem
with my sample data, though.

-david

fox@allegra.att.com (David Fox) (04/16/91)

It has been pointed out to me that you may as well omit the $* from
my shell script, as "tr" only reads standard input.

ts@uwasa.fi (Timo Salmi) (04/20/91)

In article <91100.124514II70038@MAINE.BITNET> II70038@MAINE.BITNET (Jonathan A. Kimball) writes:
>In article <13834@adobe.UUCP>, mjward@adobe.COM (Michael J. Ward) says:
>>
>>Where can I find a list/dictionary/datafile of English words sorted by
>>relative frequency in various classes of usage.
> 
>I have also been looking for this.  I could be way off on this, but I
:

For general information.  If you want to make the frequency counts
yourself there are facilities for that in /pc/ts/tspell24.arc
(wordlist.exe in there), and /pc/fileutil/wlist10.arc. 

The wares are available by anonymous ftp from garbo.uwasa.fi, Vaasa,
Finland, 128.214.12.37, or by using our mail server (use the latter
if, and only if you don't have anonymous ftp).  If you are not
familiar with anonymous ftp or mail servers, I am prepared to send
prerecorded instructions on request.  (If you don't get the
instructions from me within a few days, it will mean that your email
address cannot be reached by a simple email reply.  It that case,
contact your system manager for devicing a proper mail path for you,
because unless you do, you wouldn't be able to utilize the mail
server anyway.  If you are in North America first consider using an
ftp site near you to spare the overseas load.)

...................................................................
Prof. Timo Salmi
Moderating at garbo.uwasa.fi anonymous ftp archives 128.214.12.37
School of Business Studies, University of Vaasa, SF-65101, Finland
Internet: ts@chyde.uwasa.fi Funet: gado::salmi Bitnet: salmi@finfun

wells@parens.zko.dec.com (Richard A. Wells) (04/23/91)

In article <13834@adobe.UUCP>, mjward@adobe.COM (Michael J. Ward) writes:
>Where can I find a list/dictionary/datafile of English words sorted by
>relative frequency in various classes of usage. For example, is "a" the most
>commonly used English word. followed by "the"? How about "that" compared to
>"sesquipedalianism"? Who's doing binary lookup tables based on word
>frequency?  --Mike Ward

You didn't explicitly say it had to be an on-line source.  I inherited
a copy of the following book from someone who used to write word-processing
software for Wang:

[Title page:]

"The American Heritage

WORD FREQUENCY BOOK"

John B. Carroll
Senior Research Psychologist
Educational Testing Service

Peter Davies
Editor in Chief
Dictionary Division
American Heritage Publishing Co., Inc.

Barry Richman
Executive Editor
Dictionary Division
American Heritage Publishing Co., Inc.

Hougton Mifflin Company
Boston o New York o Atlanta o Geneva, Illinois o Dallas o Palo Alto

American Heritage Publishing Co., Inc.
New York"

[Catalog data:]

"Copyright (c) 1971 by American Heritage Publishing Co., Inc.
Library of Congress Catalog Card Number: 72-181517
ISBN: 0-395-13570-2"

[Also, it mentions:]

"Computer coposition of word data for publication by R. R. Donnelley and Sons
Company, Electrnic Graphics (R) Division, Chicago, Illinois"

[maybe they have electronic sources]

This book is over 800 pages of tables and charts of word frequencies, both
alphabetic and ranked, plus smaller frequency tables by specialized categories,
with graphs showing differences in the categories.

Hope this helps.

Richard

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Richard A. Wells                    | Internet:   wells@tle.enet.dec.com
  Technical Languages & Environments  | UUCP:       ..!decwrl!tle.enet!wells
  Digital Equipment Corporation       | CompuServe: 76256,2277 (personal only)
  ZKO2-3/N30      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Nashua NH 03062 | This message reflects my personal opinions, not Digital's.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~