[comp.unix.questions] Counting characters with unix utilities

rouben@math9.math.umbc.edu (09/24/90)

How can I count the number of occurrences of a given character in a file?
It can be done rather trivially in C, but I wonder if it can also be done
using standard unix utilities like awk, sed, tr, wc, etc.

The closest  I have come to this is the following construction:

cat file | tr -c 'A' '' | wc -c

which attempts to count the number of occurrences of the character "A"
in the file.  The "tr" command replaced all characters different from
"A" by the null character, then "wc" counts all characterters in its input
(unfortunately) also counting the null characters :-(

I feel that I am missing something, and that there should be an easy way
to count characters a la unix.  Any hints?

[If it matters, the operating system is ultrix and the shells are sh and csh.]

--
Rouben Rostamian                               Telephone: (301) 455-2458
Department of Mathematics and Statistics       e-mail:
University of Maryland Baltimore County        rostamian@umbc.bitnet
Baltimore, MD 21228,  U.S.A.                   rostamian@umbc3.umbc.edu

emv@math.lsa.umich.edu (Edward Vielmetti) (09/24/90)

In article <4002@umbc3.UMBC.EDU> rouben@math9.math.umbc.edu writes:

   How can I count the number of occurrences of a given character in a file?
   It can be done rather trivially in C, but I wonder if it can also be done
   using standard unix utilities like awk, sed, tr, wc, etc.

   The closest  I have come to this is the following construction:

   cat file | tr -c 'A' '' | wc -c

This is what I came up with in perl, after about 15 minutes of digging
in the perl info pages:

   cat file | perl -ne '$c += tr/A/A/; if (eof()) {print "$c\n";}'

Going back to the tr man page this one seems to work too:

   cat file | tr -cd 'A' | wc -c

I don't see an easy perl equivalent of the "tr -cd" idiom.

--Ed

Edward Vielmetti, U of Michigan math dept <emv@math.lsa.umich.edu>
moderator, comp.archives

ted@nmsu.edu (Ted Dunning) (09/24/90)

i didn't want to answer this one, but

In article <EMV.90Sep23181658@picasso.math.lsa.umich.edu> emv@math.lsa.umich.edu (Edward Vielmetti) writes:

   In article <4002@umbc3.UMBC.EDU> rouben@math9.math.umbc.edu writes:

	...
      cat file | tr -c 'A' '' | wc -c

	...

   ...
   cat file | tr -cd 'A' | wc -c



ed must have been kidding when he left the cat in place, instead of

tr -cd 'A' < file | wc -c

--
ted@nmsu.edu					+---------+
						| In this |
						|  style  |
						|__10/6___|

skwu@boulder.Colorado.EDU (WU SHI-KUEI) (09/24/90)

The solution:

	cat file | tr -c 'A' '' | wc -c

is very close.  Just change it to:

	cat file | tr -cd 'A' | wc -c

and you'll count nothing but A's.

mwm@raven.pa.dec.com (Mike (My Watch Has Windows) Meyer) (09/25/90)

   ed must have been kidding when he left the cat in place, instead of

   tr -cd 'A' < file | wc -c

Why? Some people prefer the form

	cat file | utility args

to

	utility args < file

Why change from one to the other, except for efficiency? And even
then, there isn't enough difference to bother with for a command line.
For a script, you'd want to use the latter, though. But then the
difference is hidden from the user.

	<mike

--
Il brilgue: les toves lubricilleux			Mike Meyer
Se gyrent en vrillant dans le guave,			mwm@relay.pa.dec.com
Enmimes sont les gougebosqueux,				decwrl!mwm
Et le momerade horsgrave.

emv@math.lsa.umich.edu (Edward Vielmetti) (09/25/90)

In article <TED.90Sep24090053@kythera.nmsu.edu> ted@nmsu.edu (Ted Dunning) writes:

      cat file | tr -cd 'A' | wc -c

   ed must have been kidding when he left the cat in place, instead of
   tr -cd 'A' < file | wc -c

nope.  I cat files all the time.  That way when I go back to edit this command
to work on the output of a command instead of the contents of a file
there's no extra work or thinking involved.

      grep ^Subject file | tr -cd 'A' | wc -c

--Ed

Edward Vielmetti, U of Michigan math dept <emv@math.lsa.umich.edu>
moderator, comp.archives

"No funding, no fixing."

merlyn@iwarp.intel.com (Randal Schwartz) (09/25/90)

In article <MWM.90Sep24143855@raven.pa.dec.com>, mwm@raven (Mike (My Watch Has Windows) Meyer) writes:
| Why? Some people prefer the form
| 
| 	cat file | utility args
| 
| to
| 
| 	utility args < file
| 
| Why change from one to the other, except for efficiency? And even
| then, there isn't enough difference to bother with for a command line.
| For a script, you'd want to use the latter, though. But then the
| difference is hidden from the user.

Efficiency.  Yow.  Try using a slow machine sometime, where five people
have bogged it down with extra "cat" processes.  Besides, if you want
similar forms, use

	<file utility args

which changes to

	<file util2 args | utility args

and

	<file util3 args | util2 args | utility args

see... pretty orthogonal.  (What?  You didn't know that you could
stick redirects over there?  Shame on you! :-)

I sometimes do

	<fromfile cmd arg arg arg >tofile

so that when I edit the command line, it stays clean.

Just another shell hacker,
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Welcome to Portland, Oregon, home of the California Raisins!"=/

bob@wyse.wyse.com (Bob McGowen x4312 dept208) (09/25/90)

In article <EMV.90Sep24165437@picasso.math.lsa.umich.edu> emv@math.lsa.umich.edu (Edward Vielmetti) writes:
>In article <TED.90Sep24090053@kythera.nmsu.edu> ted@nmsu.edu (Ted Dunning) writes:
>
>      cat file | tr -cd 'A' | wc -c
>
>   ed must have been kidding when he left the cat in place, instead of
....
>
>nope.  I cat files all the time.  That way when I go back to edit this command
>to work on the output of a command instead of the contents of a file
...

Also, a script using:

	cat $* | tr -cd 'A' | wc -c

can use command line args OR read its standard input.

Bob McGowan  (standard disclaimer, these are my own ...)
Product Support, Wyse Technology, San Jose, CA
..!uunet!wyse!bob
bob@wyse.com

ror@grassys.bc.ca (Richard O'Rourke) (09/25/90)

In article <4002@umbc3.UMBC.EDU>, rouben@math9.math.umbc.edu writes:
> How can I count the number of occurrences of a given character in a file?
[ stuff deleted ]
> 
> cat file | tr -c 'A' '' | wc -c
> 
> which attempts to count the number of occurrences of the character "A"
> in the file.  The "tr" command replaced all characters different from
> "A" by the null character, then "wc" counts all characterters in its input
> (unfortunately) also counting the null characters :-(

I'm not sure that what you think `tr` does in this case is what
happens in reality.  I respectfully suggest re-reading the tr man page.

> 
> I feel that I am missing something, and that there should be an easy way
> to count characters a la unix.  Any hints?

I did not test this extensively, and I'm sure that it will work only on
textual files.  Sure to be bettered and or critiqued:

#
#  Count # of 'A' chars in a file 
#  use $0 filename
#
set `sed 's/[^A]//g
     /^$/d
' $1 | sed 's/ //g' | wc`
echo $3 - $2 | bc
#
#  End of script

> 
> [If it matters, the operating system is ultrix and the shells are sh and csh.]

The above seems to work with sh.

> 
> --
> Rouben Rostamian                               Telephone: (301) 455-2458
> Department of Mathematics and Statistics       e-mail:
> University of Maryland Baltimore County        rostamian@umbc.bitnet
> Baltimore, MD 21228,  U.S.A.                   rostamian@umbc3.umbc.edu

Richard O'Rourke: (604)438-8249      | Grass Root Systems: 436-1995
UUCP: uunet!van-bc!mplex!grassys!ror | Smart UUCP: ror@grassys.bc.ca
ror@grassys.wimsey.bc.ca             |

ellis@motcid.UUCP (John T Ellis) (09/25/90)

In article <4002@umbc3.UMBC.EDU> rouben@math9.math.umbc.edu () writes:
>How can I count the number of occurrences of a given character in a file?
>It can be done rather trivially in C, but I wonder if it can also be done
>using standard unix utilities like awk, sed, tr, wc, etc.
>
>The closest  I have come to this is the following construction:
>
>cat file | tr -c 'A' '' | wc -c
>
>which attempts to count the number of occurrences of the character "A"
>in the file.  The "tr" command replaced all characters different from
>"A" by the null character, then "wc" counts all characterters in its input
>(unfortunately) also counting the null characters :-(
[Text Deleted]

Try the following:

 cat file | tr -cs A-Za-z A-Za-z'\012' | sort | unique -c

If I understand tr, which is not necessarily true at this ungodly hour
of the morning :-), this should take the first occurence of either
A-Z or a-z and map it to A-Z or a-z with a line feed.  Hence, you get 
a long list with single characters.  Sort it and push it through
the unique filter which with the -c option tells you the number
of times a character appeared.

Note: This will differentiate between A and a.

John
-- 
---------------------------------------------------+----------------------------
Any sufficiently advanced technology               | John T. Ellis  708-632-7857
   is indistinguishable from magic.  :-}           |     Motorola Cellular
                                                   |   ...uunet!motcid!ellis

dmt@PacBell.COM (Dave Turner) (09/26/90)

In article <4002@umbc3.UMBC.EDU> rouben@math9.math.umbc.edu () writes:
>How can I count the number of occurrences of a given character in a file?
>It can be done rather trivially in C, but I wonder if it can also be done
>using standard unix utilities like awk, sed, tr, wc, etc.
>
>I feel that I am missing something, and that there should be an easy way
>to count characters a la unix.  Any hints?

The following will count all the occurrences of all character types in
a file. Simple modifications could limit it to those of interest.

cat file | sed -n -e "s/./&\\
/gp" | sort | uniq -c

Note: the first line of output is the number of newline characters.




-- 
Dave Turner	415/823-2001	{att,bellcore,sun,ames,decwrl}!pacbell!dmt

george@hls0.hls.oz (George Turczynski) (09/27/90)

In article <4002@umbc3.UMBC.EDU>, rouben@math9.math.umbc.edu writes:

	[...Deleted...]

> The closest  I have come to this is the following construction:
> 
> cat file | tr -c 'A' '' | wc -c
> 
> which attempts to count the number of occurrences of the character "A"
> in the file.

	[...Deleted...]

OK, try this:

	awk -F'A' '{ sum+= (NF-1) } END { print sum }' file

The single quotes around the "A" here are only to point out the "A",
and aren't really necessary.  It simply makes "A" the field separator
and adds the number of fields (NF) less one to the total.  You will
see why the "-1" is necessary if you think about it.

On some systems you may have to initialize sum to zero, with:-
	BEGIN { sum= 0 }

> [If it matters, the operating system is ultrix and the shells are sh and csh.]

Just for interest's sake, this is under SunOS 4.0.3.

I trust this is what you were looking for !

-- 
George P. J. Turczynski,   Computer Systems Engineer. Highland Logic Pty Ltd.
ACSnet: george@highland.oz |^^^^^^^^^^^^^^^^^^^^^^^^| Suite 1, 348-354 Argyle St
Phone:  +61 48 683490      |  Witty remarks are as  | Moss Vale, NSW. 2577
Fax:    +61 48 683474      |  hard to come by as is | Australia.
---------------------------   space to put them !    ---------------------------

haberman@msi.umn.edu (Joe Habermann) (09/29/90)

george@hls0.hls.oz (George Turczynski) writes:

>OK, try this:

>	awk -F'A' '{ sum+= (NF-1) } END { print sum }' file

This is close.  Doesn't seem to work when the number of matches = 0, 
though.  In that case NF = 0 and the awk will return -1.

How about:

	awk -F'A' '{ if (NF > 0) sum += (NF-1) } END { print sum }' file

Joe Habermann / haberman@msi.umn.edu

bob@wyse.wyse.com (Bob McGowen x4312 dept208) (09/29/90)

In article <1990Sep28.173033.292@msi.umn.edu> haberman@msi.umn.edu (Joe Habermann) writes:
>george@hls0.hls.oz (George Turczynski) writes:
>
...
Deleted examples of awk scripts.
...

Original postings on this topic used tr and wc.  Following that line
I decided to try my hand at a script for counting characters.  In the
meantime things seem to have moved away from the "simple" solutions
into more esoteric (still interesting) ways to solve the problem.

Never the less I will present my script for commnent and feed back.
The basic design is to take advantage of the tr commands use of regular
expressions and provide a tool that will allow the user to count the
set of characters named or their inverse.  So:

	chrcnt abc file
	chrcnt -n abc file

will count all occurances of the letters a, b and c followed by a count
of all characters that are not a, b or c.  This will work with white
space as well and handles cases where there are no matches.  The use
of cat allows you to specify one or more files on the command line or
have the script read its standard input.  One final note is that if you
should want to look for dashes and n's, use n- as the pattern (or --n,
if you want).

------------script follows----------

#!/bin/sh

case $# in
   0)
      # the following is because cmd aliasing can produce absolute paths
      CMD=`basename $0`
      echo "$CMD:  usage:  $CMD [-n] reg_expression [files...]\n"\
	   "\twhere -n means not the following pattern characters." >&2
      exit 1
   ;;
   1) # if only one arg it must be the pattern
      TR_ARGS=-cd
      pattern="$1"
   ;;
   *) # all other cases may or may not have -n as the first arg
      case $1 in
	 -n)
	    TR_ARGS=-d
	    pattern="$2"
	    shift;shift
	    files="$*"    # if only two args, files is null
	 ;;
	 *)
	    TR_ARGS=-cd
	    pattern="$1"
	    shift
	    files="$*"
	 ;;
      esac
   ;;
esac

cat $files |
tr $TR_ARGS "$pattern" |
wc -c

Bob McGowan  (standard disclaimer, these are my own ...)
Product Support, Wyse Technology, San Jose, CA
..!uunet!wyse!bob
bob@wyse.com

mickey@ncst.ernet.in (R Chandrasekar) (10/04/90)

In article <4002@umbc3.UMBC.EDU> rouben@math9.math.umbc.edu () writes:
>How can I count the number of occurrences of a given character in a file?
>...
>The closest  I have come to this is the following construction:
>
>cat file | tr -c 'A' '' | wc -c

Try:

    tr -dc 'A' < file | wc -c

tr -dc 'A'
  deletes all chars which are in the complement set of 'A'.
Voila etc etc! Works, but is a 'sumb' way to count chars. Is there
a better way?

>
>Rouben Rostamian                               Telephone: (301) 455-2458

  -- Chandrasekar