mjward@adobe.COM (Michael J. Ward) (04/10/91)
Where can I find a list/dictionary/datafile of English words sorted by relative frequency in various classes of usage. For example, is "a" the most commonly used English word. followed by "the"? How about "that" compared to "sesquipedalianism"? Who's doing binary lookup tables based on word frequency? --Mike Ward
drenze@umaxc.weeg.uiowa.edu (Douglas Renze) (04/10/91)
In article <13834@adobe.UUCP> mjward@adobe.COM (Michael J. Ward) writes: >Where can I find a list/dictionary/datafile of English words sorted by >relative frequency in various classes of usage. For example, is "a" the most >commonly used English word. followed by "the"? How about "that" compared to >"sesquipedalianism"? Who's doing binary lookup tables based on word >frequency? --Mike Ward Well, I could be totally off base, but have you checked any linguistics text- books or textbooks written to teach English to non-native english speakers? I'd almost say that the latter is more likely to be of help, because language texts usually teach the basic 500 (or whatever) words in a language first. Hope this helps, but I doubt it.
herrickd@iccgcc.decnet.ab.com (04/13/91)
In article <13834@adobe.UUCP>, mjward@adobe.COM (Michael J. Ward) writes: > Where can I find a list/dictionary/datafile of English words sorted by > relative frequency in various classes of usage. For example, is "a" the most > commonly used English word. followed by "the"? How about "that" compared to > "sesquipedalianism"? Who's doing binary lookup tables based on word > frequency? --Mike Ward One such histogram is Strong's Concordance of the King James Bible, a 19th century exhaustive (down to a complete list of occurences of the words "a", "the", and "and") list of occurences of the words in an English book published in 1611. A photo reproduction of the concordance is available in a bookstore near you for ten or twelve dollars. dan herrick herrickd@iccgcc.decnet.ab.com PS. I know he doesn't want a paper document, but Strong's is a fantastic work done before von Neumann was born and looking it over is a valuable educational experience for people interested in what Mike asked about.
gwyn@smoke.brl.mil (Doug Gwyn) (04/13/91)
In article <5408@ns-mx.uiowa.edu> drenze@umaxc.weeg.uiowa.edu (Douglas Renze) writes: >In article <13834@adobe.UUCP> mjward@adobe.COM (Michael J. Ward) writes: >>Where can I find a list/dictionary/datafile of English words sorted by >>relative frequency in various classes of usage. For example, is "a" the most >>commonly used English word. followed by "the"? How about "that" compared to >>"sesquipedalianism"? Any reasonable English-language cryptanalysis text should cover this to some extent. Personally I would use the tables in Callimahos and Friedman's "Military Cryptanalytics", Part 1, Volume 2, which has been reprinted by Aegean Park Press in Laguna Hills, CA (and often stocked at the Computer Literacy Bookstore in San Jose, CA).
emv@ox.com (Ed Vielmetti) (04/13/91)
In article <13834@adobe.UUCP> mjward@adobe.COM (Michael J. Ward) writes:
Where can I find a list/dictionary/datafile of English words sorted by
relative frequency in various classes of usage. For example, is "a" the most
commonly used English word. followed by "the"? How about "that" compared to
"sesquipedalianism"? Who's doing binary lookup tables based on word
frequency? --Mike Ward
The frequency of English words depends a lot on the body of text that
you're looking at. As a first pass, it's relatively easy to scan
though a representative usenet newsgroup and count word frequencies
with something like "wordcount", a perl program on p.39 of the perl
book (or on uunet.uu.net:/nutshell/perl/).
You've just thrown off the count for "sesquipedalianism", though ...
--
Msen Edward Vielmetti
/|--- moderator, comp.archives
emv@msen.com
"With all of the attention and publicity focused on gigabit networks,
not much notice has been given to small and largely unfunded research
efforts which are studying innovative approaches for dealing with
technical issues within the constraints of economic science."
RFC 1216
melby@daffy.yk.Fujitsu.CO.JP (John B. Melby) (04/15/91)
>Where can I find a list/dictionary/datafile of English words sorted by >relative frequency in various classes of usage. For example, is "a" the most >commonly used English word. followed by "the"? In fact, the word "the" is generally considered to be the most common word in the English language. As was pointed out, this is dependent on the body of text that is used as a reference - for instance, on telegrams the word "olive" might be more common. You might want to look at a non-technical book on codes and ciphers. Some of these books also contain other interesting pointers (the most common two-letter combination in the English language is "th," the letters most frequently found in the English language are e,t,a,o,n,[r,i | i,r],s,h, and so on....) ----- John B. Melby Fujitsu Limited, Machida, Japan melby%yk.fujitsu.co.jp@uunet
fox@allegra.att.com (David Fox) (04/16/91)
In article <13834@adobe.UUCP> mjward@adobe.COM (Michael J. Ward) writes:
Where can I find a list/dictionary/datafile of English words sorted by
relative frequency in various classes of usage. For example, is "a" the most
commonly used English word. followed by "the"? How about "that" compared to
"sesquipedalianism"? Who's doing binary lookup tables based on word
frequency? --Mike Ward
Just find a bunch of text and run it through this shell script:
#!/bin/sh
tr -c '-a-zA-Z' ' ' $* | \
tr 'A-Z' 'a-z' | \
tr -s ' ' '\012' | \
sort | \
uniq -c | \
sort -r
Isn't unix wonderful? A very quick test has convinced me that
"hotel" is the most common english word. It could be a problem
with my sample data, though.
-david
fox@allegra.att.com (David Fox) (04/16/91)
It has been pointed out to me that you may as well omit the $* from my shell script, as "tr" only reads standard input.
ts@uwasa.fi (Timo Salmi) (04/20/91)
In article <91100.124514II70038@MAINE.BITNET> II70038@MAINE.BITNET (Jonathan A. Kimball) writes: >In article <13834@adobe.UUCP>, mjward@adobe.COM (Michael J. Ward) says: >> >>Where can I find a list/dictionary/datafile of English words sorted by >>relative frequency in various classes of usage. > >I have also been looking for this. I could be way off on this, but I : For general information. If you want to make the frequency counts yourself there are facilities for that in /pc/ts/tspell24.arc (wordlist.exe in there), and /pc/fileutil/wlist10.arc. The wares are available by anonymous ftp from garbo.uwasa.fi, Vaasa, Finland, 128.214.12.37, or by using our mail server (use the latter if, and only if you don't have anonymous ftp). If you are not familiar with anonymous ftp or mail servers, I am prepared to send prerecorded instructions on request. (If you don't get the instructions from me within a few days, it will mean that your email address cannot be reached by a simple email reply. It that case, contact your system manager for devicing a proper mail path for you, because unless you do, you wouldn't be able to utilize the mail server anyway. If you are in North America first consider using an ftp site near you to spare the overseas load.) ................................................................... Prof. Timo Salmi Moderating at garbo.uwasa.fi anonymous ftp archives 128.214.12.37 School of Business Studies, University of Vaasa, SF-65101, Finland Internet: ts@chyde.uwasa.fi Funet: gado::salmi Bitnet: salmi@finfun
wells@parens.zko.dec.com (Richard A. Wells) (04/23/91)
In article <13834@adobe.UUCP>, mjward@adobe.COM (Michael J. Ward) writes: >Where can I find a list/dictionary/datafile of English words sorted by >relative frequency in various classes of usage. For example, is "a" the most >commonly used English word. followed by "the"? How about "that" compared to >"sesquipedalianism"? Who's doing binary lookup tables based on word >frequency? --Mike Ward You didn't explicitly say it had to be an on-line source. I inherited a copy of the following book from someone who used to write word-processing software for Wang: [Title page:] "The American Heritage WORD FREQUENCY BOOK" John B. Carroll Senior Research Psychologist Educational Testing Service Peter Davies Editor in Chief Dictionary Division American Heritage Publishing Co., Inc. Barry Richman Executive Editor Dictionary Division American Heritage Publishing Co., Inc. Hougton Mifflin Company Boston o New York o Atlanta o Geneva, Illinois o Dallas o Palo Alto American Heritage Publishing Co., Inc. New York" [Catalog data:] "Copyright (c) 1971 by American Heritage Publishing Co., Inc. Library of Congress Catalog Card Number: 72-181517 ISBN: 0-395-13570-2" [Also, it mentions:] "Computer coposition of word data for publication by R. R. Donnelley and Sons Company, Electrnic Graphics (R) Division, Chicago, Illinois" [maybe they have electronic sources] This book is over 800 pages of tables and charts of word frequencies, both alphabetic and ranked, plus smaller frequency tables by specialized categories, with graphs showing differences in the categories. Hope this helps. Richard ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Richard A. Wells | Internet: wells@tle.enet.dec.com Technical Languages & Environments | UUCP: ..!decwrl!tle.enet!wells Digital Equipment Corporation | CompuServe: 76256,2277 (personal only) ZKO2-3/N30 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Nashua NH 03062 | This message reflects my personal opinions, not Digital's. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~