wolit@mhuxd.UUCP (Jan Wolitzky) (12/11/85)
> I need a way to count unique words in a document. > Does any one have suggestions on a simple way to do this? Try: deroff -w filename | dd conv=lcase 2>/dev/null | sort -u | wc -l "deroff -w" breaks the file up into single words, one per line. "dd" converts everything to lower case (so "word" and "Word" count as the same thing). ("dd" is verbose, so I redirect stderr.) "sort -u" keeps just one copy of each line. "wc -l" counts the lines. If you're going to run this frequently, stick it in a file, make it executable, replace "filename" with "$*" so you can pass it file names as arguments, and you're off. -- Jan Wolitzky, AT&T Bell Labs, Murray Hill, NJ; 201 582-2998; mhuxd!wolit (Affiliation given for identification purposes only)
heiby@cuae2.UUCP (Heiby) (12/11/85)
Here's something I threw together. This sequence assumes that case is not significant. Also, its idea of what is a word may not match yours. For example This will cound troff controls as words. The first "tr" is from the SVR2 tr man page. The text says, "The following example creates a list of all the words in file1 one per line in file2, where a word is taken to be a maximal string of alphabetics." (I am using pipes rather than files, though.) cat FILE | # FILE is the input file tr -cs "[A-Z][a-z]" "[\012*]" | # split the words tr "[A-Z]" "[a-z]" | # make all lower case sort | # sort them uniq | # remove duplicates wc -l # display final count -- Ron Heiby {NAC|ihnp4}!cuae2!heiby Moderator: mod.newprod & mod.unix AT&T-IS, /app/eng, Lisle, IL (312) 810-6109 "I am not a number! I am a free man!" (#6)
friesen@psivax.UUCP (Stanley Friesen) (12/12/85)
In article <3699@mhuxd.UUCP> wolit@mhuxd.UUCP (Jan Wolitzky) writes: >> I need a way to count unique words in a document. >> Does any one have suggestions on a simple way to do this? > >Try: > >deroff -w filename | dd conv=lcase 2>/dev/null | sort -u | wc -l > This looks quite inefficient, tr will do the case conversion much more efficiently than dd, and it can also split the file into one word lines. So try: tr 'A-Z\011 ' 'a-z\012' < filename | sort -u | wc -l or deroff -w filename | tr 'A-Z' 'a-z' | sort -u | wc -l depending on whether you wish to remove nroff macros or not. -- Sarima (Stanley Friesen) UUCP: {ttidca|ihnp4|sdcrdcf|quad1|nrcvax|bellcore|logico}!psivax!friesen ARPA: ttidca!psivax!friesen@rand-unix.arpa
darin@ut-dillo.UUCP (Darin Adler) (12/13/85)
<> Here is the method I normally use to count words: tr A-Z a-z | tr -cs -a-z0-9\'\" '\012' | sort -u | wc -l The first "tr" command take care of capitalization. The second "tr" command separates the file into a word per line (where a word is a sequence of characters [-A-Za-z0-9'"]). The "sort" command eliminates duplicates and the "wc" gives us the number of lines in the result. -- Darin Adler {gatech,harvard,ihnp4,seismo}!ut-sally!ut-dillo!darin "Such a mass of motion -- do not know where it goes" P. Gabriel
carl@bdaemon.UUCP (carl) (12/13/85)
> > I need a way to count unique words in a document. > Does any one have suggestions on a simple way to do this? The following is a fancy version of what you want. NOTE: The precise syntax of 'tr' varies among versions, so some diddling may be needed. Good Luck! ------------------------------------------------------------ cat $* | # tr reads the standard input tr "[A-Z]" "[a-z]" | # Convert all upper case to lower case tr -cs "[a-z]\'" "\012" | # Replace all characters not a-z to # a new line. i.e. one word per line sort | # uniq expects sorted input uniq -c | # Count the number of times each word appears sort +0nr +1d | # Sort first from most to least frequent, # then alphabetically. pr -w80 -4 -h "Concordance for $*" # Print in four columns ------------------------------------------------------------ Carl Brandauer daemon associates, Inc. 1760 Sunset Boulevard Boulder, CO 80302 303-442-1731 {allegra|amd|attunix|cbosgd|ucbvax|ut-sally}!nbires!bdaemon!carl
steve@jplgodo.UUCP (Steve Schlaifer x3171 156/224) (12/13/85)
> > I need a way to count unique words in a document. > Does any one have suggestions on a simple way to do this? > gpw > > -- a simple (Bourne) shell script on system V is: awk '{ for(i=1; i<=NF; ++i) print $i }' file | sort | uniq -c prints a sorted list of the words in *file* preceded by a count of the number of times each word appeared in the file. I expect this will work on other systems but don't have enough experience to really know. To just count the number of distinct words in file change uniq -c to uniq | wc -l To list words that only occur once, add | awk '$1==1 { print $2 }' to the end of the pipe. There are many other variations on this that will occur to you as you play with it. Enjoy, Steve Schlaifer ( {group3 | smeagol}!jplgodot!steve )