pn@stc.UUCP (01/23/86)
Are there any lexicographers out there? I believe that's the correct term. I need to find a list of the most commonly used English words in WRITTEN TEXT (not oral useage). The most common 8000 would do nicely, to coin a phrase. Note no smiley here, I'm serious about the 8000, but I suspect that after the first few hundred most common, there's not much to choose between them. Also in the same vein, I need the most common English word endings, preferably ranked in order of probability. Any pointers greatfully welcomed. -- Phil Norris <pn@stc.UUCP> {root44, ukc, datlog, idec, stl, creed, iclbra, iclkid}!stc!pn
dave@lsuc.UUCP (David Sherman) (01/30/86)
Phil Norris <pn@stc.UUCP> writes: > Are there any lexicographers out there? I believe that's the correct term. > I need to find a list of the most commonly used English words in WRITTEN > TEXT (not oral useage). The most common 8000 would do nicely, to coin a > phrase. Note no smiley here, I'm serious about the 8000, but I suspect > that after the first few hundred most common, there's not much to choose > between them. Also in the same vein, I need the most common English word > endings, preferably ranked in order of probability. You can get this kind of information from the Brown corpus. That is a random sampling of 2000 500-word samples of writing published in 1963, grouped into categories including fiction, non-fiction, etc., all from (I believe) generally available U.S. publications. It's a bit over a million words, because each sample is carried to the end of the current sentence instead of cut at 500 words. (I might have it backwards - it could be 500 2000-word samples.) The Brown corpus came out of Brown University in Rhode Island. When I was doing some linguistics research a few years back, I found a copy of the tape at the U of Toronto computing centre, along with a book that was published at the same time. The book contains answers to some of the more common questions, such as what the most common words are. I also copied the tape (with dd) into an ASCII format, and wrote a C program to find given words in context (a 5-line context). There are some symbols (punctuation and paragraphing) which are represented by asterisk codes and need to be programmed around. If you do get hold of the Brown corpus, I could probably dig up the source. (This was 1977-78, though.) Check your local university library and/or computer centre. Dave Sherman The Law Society of Upper Canada Osgoode Hall Toronto, Canada M5H 2N6 +1 416 947 3466 -- { ihnp4!utzoo pesnta utcs hcr decvax!utcsri } !lsuc!dave