[net.unix] Unique Word Counter Needed

wolit@mhuxd.UUCP (Jan Wolitzky) (12/11/85)

> I need a way to count unique words in a document.
> Does any one have suggestions on a simple way to do this?

Try:

deroff -w filename | dd conv=lcase 2>/dev/null | sort -u | wc -l

"deroff -w" breaks the file up into single words, one per line.
"dd" converts everything to lower case (so "word" and "Word" count as
    the same thing). ("dd" is verbose, so I redirect stderr.)
"sort -u" keeps just one copy of each line.
"wc -l" counts the lines.

If you're going to run this frequently, stick it in a file, make it
executable, replace "filename" with "$*" so you can pass it file names
as arguments, and you're off.
-- 
Jan Wolitzky, AT&T Bell Labs, Murray Hill, NJ; 201 582-2998; mhuxd!wolit
(Affiliation given for identification purposes only)

heiby@cuae2.UUCP (Heiby) (12/11/85)

Here's something I threw together.  This sequence assumes that case
is not significant.  Also, its idea of what is a word may not match
yours.  For example This will cound troff controls as words.  The
first "tr" is from the SVR2 tr man page.  The text says, "The
following example creates a list of all the words in file1 one per
line in file2, where a word is taken to be a maximal string of
alphabetics."  (I am using pipes rather than files, though.)

cat FILE |			# FILE is the input file
tr -cs "[A-Z][a-z]" "[\012*]" |	# split the words
tr "[A-Z]" "[a-z]" |		# make all lower case
sort |				# sort them
uniq |				# remove duplicates
wc -l				# display final count
-- 
Ron Heiby {NAC|ihnp4}!cuae2!heiby   Moderator: mod.newprod & mod.unix
AT&T-IS, /app/eng, Lisle, IL	(312) 810-6109
"I am not a number!  I am a free man!" (#6)

friesen@psivax.UUCP (Stanley Friesen) (12/12/85)

In article <3699@mhuxd.UUCP> wolit@mhuxd.UUCP (Jan Wolitzky) writes:
>> I need a way to count unique words in a document.
>> Does any one have suggestions on a simple way to do this?
>
>Try:
>
>deroff -w filename | dd conv=lcase 2>/dev/null | sort -u | wc -l
>
	This looks quite inefficient, tr will do the case conversion
much more efficiently than dd, and it can also split the file into one
word lines. So try:

tr 'A-Z\011 ' 'a-z\012' < filename | sort -u | wc -l

or

deroff -w filename | tr 'A-Z' 'a-z' | sort -u | wc -l

depending on whether you wish to remove nroff macros or not.
-- 

				Sarima (Stanley Friesen)

UUCP: {ttidca|ihnp4|sdcrdcf|quad1|nrcvax|bellcore|logico}!psivax!friesen
ARPA: ttidca!psivax!friesen@rand-unix.arpa

darin@ut-dillo.UUCP (Darin Adler) (12/13/85)

<>

Here is the method I normally use to count words:

tr A-Z a-z | tr -cs -a-z0-9\'\" '\012' | sort -u | wc -l

The first "tr" command take care of capitalization.  The second "tr" command
separates the file into a word per line (where a word is a sequence of
characters [-A-Za-z0-9'"]).  The "sort" command eliminates duplicates and
the "wc" gives us the number of lines in the result.
-- 
Darin Adler	{gatech,harvard,ihnp4,seismo}!ut-sally!ut-dillo!darin

"Such a mass of motion -- do not know where it goes"	P. Gabriel

carl@bdaemon.UUCP (carl) (12/13/85)

> 
> I need a way to count unique words in a document.
> Does any one have suggestions on a simple way to do this?

The following is a fancy version of what you want.  NOTE:  The precise
syntax of 'tr' varies among versions, so some diddling may be needed.
Good Luck!

------------------------------------------------------------
cat $* |			# tr reads the standard input
tr "[A-Z]" "[a-z]" |		# Convert all upper case to lower case
tr -cs "[a-z]\'" "\012" |		# Replace all characters not a-z to
				# a new line. i.e. one word per line

sort |				# uniq expects sorted input
uniq -c |			# Count the number of times each word appears
sort +0nr +1d |			# Sort first from most to least frequent,
				# then alphabetically.

pr -w80 -4 -h "Concordance for $*" 	# Print in four columns
------------------------------------------------------------

Carl Brandauer
daemon associates, Inc.
1760 Sunset Boulevard
Boulder, CO 80302
303-442-1731
{allegra|amd|attunix|cbosgd|ucbvax|ut-sally}!nbires!bdaemon!carl

steve@jplgodo.UUCP (Steve Schlaifer x3171 156/224) (12/13/85)

> 
> I need a way to count unique words in a document.
> Does any one have suggestions on a simple way to do this?
> gpw
> 
> -- 
a simple (Bourne) shell script on system V is:

awk '{ for(i=1; i<=NF; ++i) print $i }' file | sort | uniq -c

prints a sorted list of the words in *file* preceded by a count of the number
of times each word appeared in the file.  I expect this will work on other
systems but don't have enough experience to really know.  To just count the
number of distinct words in file change uniq -c to uniq | wc -l
To list words that only occur once, add 

	| awk '$1==1 { print $2 }'

to the end of the pipe.  There are many other variations on this that will
occur to you as you play with it.

	Enjoy, Steve Schlaifer  ( {group3 | smeagol}!jplgodot!steve )