[comp.lang.c] Bit counting

thor@stout.ucar.edu (Rich Neitzel) (01/11/89)

Recently jim@athsys.uucp (Jim Becker) wrote:

>	The problem is the most efficient method to simply count the
>number of on bits in a sixteen bit short. This is a really easy
>exercise, but one that has a variety of different approaches. I would
>be interested in the opinions of programmers out there, an efficient
>algorithm that is understandable.
>
>	Assume that this is a basic sample algorithm:
>
>		int	
>		BitCount( num )
>		short	num;
>		{
>			int	count	= 0;
>
>			while( num ) {
>
>				count	+= (num &  1);
>				num	 = (num >> 1);
>			}
>
>			return count;
>		}
>
>	Is this efficient? Is this understandable? What is a better
>algorithm?

First a comment on this, what I call the 'classical shift' version. As
listed above, it will not function on many cpus, since it uses right
shifts. Since many cpus propagate the high bit, the loop will never
terminate.

Second, I do not consider this to be an efficient algorithm. While it
appears to be good, since it terminates when there are no more set
bits, it does not function well if there are many set bits or the only
set bit is in the last position tested. 

A common solution is to use a lookup table and parse the word into
nibbles or bytes. This is faster, but you must be certain that the
table is correct. I do not relish the thought of entering 256 table
entries. Efficiency suffers from the fact that the table lookup
requires pointer indexing.

I have found that the most efficient algorithm is one involving 'bit 
pair' addition. Attached are two routines that implement this
algorithm. They rely on 'parallel' addition. The word involved is
added as successively larger bit pairs. Both a 16 bit and 32 bit form
are provided. I have chosen not to attempt to apply it to any other
word lengths. This is because the algorithm is not truely 'portable',
that is, the idea is portable, but the implemenation is dependent on 
the word size. The larger the word size the more pairs are involved.

These routines have been tested in two environments. The first was on
a PDP-11 under RSX-11M-Plus and the second on a Motorola MV133 (20 MHz
68020) under VxWorks. In both cases the test routine was the only job.
They easily out performed the other algorithms tested. The completeing
algorithms were the 'classical' shift and nibble lookup tables. The
results were:

	Classical shift		lut		pair addition
RSX	   12.3			 7.4		  4.6
VxWorks    25			12                8

RSX times are seconds/65K iterations, VxWorks times are microseconds
per iterations. The lookup table code would increase in speed if a
byte table were used, but I doubt it would increase enough to beat the
pair addition time.

Supplied here are both the 16 and 32 bit routines, as well as the
classic and lut routines.
---------------------------------------------------------------------

/*
  Module: s_bitcnt.c
  Author: Richard E. K. Neitzel

revision history
----------------
1.0,9jan89,rekn            written


  Explanation:
      This routine takes an integer and returns the number of bits it contains
      which are set. It does this via 'parallel' addition, turning the integer
      into sets of bit pairs of increasing size. WARNING! This assumes that
      shorts are 16 bits!
*/

int s_bitcnt(word)
short word;
{
    register short i, j, k;	/* our work area */

    if (!word)			/* Why bother? */
      return(0);

    j = word;

    k = j & 0x5555;		/* Clear every other bit */
    j >>= 1;			/* Repeat for other bits */
    j &= 0x5555;		
    j += k;			/* 1st pairs */

    k = j & 0x3333;		/* Clear every two bits */
    j >>= 2;			/* shift and repeat */
    j &= 0x3333;
    j += k;			/* 2nd pairs */

    k = j & 0xff;		/* Get lower pairs */
    j &= 0xff00;		/* Get upper pairs */
    j >>= 8;
    j += k;			/* 3rd pairs */

    k = j & 0xf;		/* Get last pairs */
    j >>= 4;
    j += k;

    return(j);			/* bye */
}
---------------------------------------------------------------------
/*
  Module: l_bitcnt.c
  Author: Richard E. K. Neitzel

revision history
----------------
1.0,9jan89,rekn            written


  Explanation:
      This routine takes an integer and returns the number of bits it contains
      which are set. It does this via 'parallel' addition, turning the integer
      into sets of bit pairs of increasing size. WARNING! This assumes that
      integers are 32 bits! 
*/

int l_bitcnt(word)
int word;
{
    register int j, k;		/* Our work area */

    if (!word)			/* Why bother? */
      return(0);

    j = word;

    k = j & 0x55555555;		/* Clear every other bit */
    j >>= 1;			/* Do again for cleared bits */
    j &= 0x55555555;
    j += k;			/* 1st pairs done */

    k = j & 0x33333333;		/* Clear every two bits */
    j >>= 2;
    j &= 0x33333333;
    j += k;			/* 2nd pairs done */

    k = j & 0xffff;		/* Only need last 4 sets */
    j &= 0xffff0000;		/* and top 4 here */
    j = j >> 16;		/* ready - */
    j += k;			/* add 3rd pairs */

    k = j & 0xf0f;		/* Clear bits */
    j >>= 4;			/* Repeat */
    j &= 0xf0f;
    j += k;			/* add 4th pairs */

    k = j & 0xff;		/* Now add the 8 bit pairs */
    j >>= 8;
    j += k;

    return(j);			/* bye */
}
---------------------------------------------------------------------
int bitCount( num )
short	num;
{
    register short count = 0;
    register short tmp;

    tmp = num;

    while(tmp)		/* Note fix to avoid looping forever */ 
      {
	  count	+= (tmp &  0x8000);
	  tmp <<= 1;
      }

    return(count);
}
---------------------------------------------------------------------
short lut[] = {0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4};

int luts(word)
short word;
{
    register short j, k,l;

    if (!word)
      return(0);

    j = word;

    k = j & 0xf;

    l += lut[k];

    j >>= 4;

    k = j & 0xf;

    l += lut[k];

    j >>= 4;

    k = j & 0xf;

    l += lut[k];

    j >>= 4;

    k = j & 0xf;

    l += lut[k];

    return(l);
}

-------------------------------------------------------------------------------

			Richard Neitzel
			National Center For Atmospheric Research
			Box 3000
			Boulder, CO 80307-3000
			303-497-2057

			thor@thor.ucar.edu

    	Torren med sitt skjegg		Thor with the beard
    	lokkar borni under sole-vegg	calls the children to the sunny wall
    	Gjo'i med sitt shinn		Gjo with the pelts
    	jagar borni inn.		chases the children in.



-------------------------------------------------------------------------------

thor@stout.ucar.edu (Rich Neitzel) (01/11/89)

BTW, I just saw a posting of the 'and' algorithm for bit counting. It
suffers from the same problem as the classic shift - it loops 
repeatedly for cases of multiple set bits. I have also tested this and
find pair addition faster.

-------------------------------------------------------------------------------

			Richard Neitzel
			National Center For Atmospheric Research
			Box 3000
			Boulder, CO 80307-3000
			303-497-2057

			thor@thor.ucar.edu

    	Torren med sitt skjegg		Thor with the beard
    	lokkar borni under sole-vegg	calls the children to the sunny wall
    	Gjo'i med sitt shinn		Gjo with the pelts
    	jagar borni inn.		chases the children in.




-------------------------------------------------------------------------------

andrew@alice.UUCP (Andrew Hume) (01/11/89)

i admired the futzy but clever-looking algorithm to do parallel addition
to determine bit counts reported by rich neitzel. i myself had guessed
that table lookup by bytes was an appropriate space/time tradeoff.
i present times for 300K calls on a mips 120-5 below (8 is table lookup
for 8 bit chunks, 4 is for 4bit chunks and s is the parallel aaddition alg).
note that on the mips, the 3x speedup for 's' found on the vax isn't.
tuttle=; timex a.out 8
real        4.01
user        3.98
sys         0.03
tuttle=; timex a.out 4
real       11.61
user       11.59
sys         0.01
tuttle=; timex a.out s
real       11.33
user       11.24
sys         0.03
this was without optimising; -O3 gains a bit more:
tuttle=; timex a.out 8
real        3.64
user        3.62
sys         0.01
tuttle=; timex a.out 4
real       11.13
user       11.05
sys         0.02
tuttle=; timex a.out s
real       10.34
user       10.32
sys         0.02

thor@stout.ucar.edu (Rich Neitzel) (01/12/89)

As a followup to my posting of the "pair addition" algorithm for bit
counting, I received the following mail message from dunc@Sun.COM
(duncs home):

>That bitcounting algorithm you posted is really pretty; it's a technique I'll
>keep in mind, as it's no doubt adaptable to other things.  It turns out that
>on my machine the 8 bit table scheme does win, but not by much:

>int alternate(word) /* assumes char's are 8 bits and int's are 32 bits */
>    int word;
>{
>    static char bits[] = {
>        0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
>        1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
>        1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
>        2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
>        1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
>        2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
>        2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
>        3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8
>    };
>    register int result;
>    register unsigned char *p = (unsigned char *)&word;
>    result = bits[*p++];
>    result += bits[*p++];
>    result += bits[*p++];
>    result += bits[*p];
>}
>
>index  %time    self descendents  called+self    name           index
>[5]     14.2    2.04        0.00  100000         _l_bitcnt [5]
>[6]     12.0    1.72        0.00  100000         _alternate [6]

I tested this on my VxWorks system, after making the following
changes:

	add if (!word)
		return(0);

	add return(result);
This made it perform like the pair routine and gave back the result. I
tried both a 16 and 32 bit version. I would have tested them under RSX
but I don't have a PDP anymore. The results are:

	Shift	Nibble lut	Byte lut	Pair add.
RSX	12.3	7.4		n/a		4.6
VxWorks1 25	12		8		8
VxWorks2 51	25		11		11

As before RSX times are seconds/65k iterations, while VxWorks times
are microseconds/iteration. The times marked 1 are for 16 bit words
and times marked 2 are for 32 bit words. 

I too find no real difference between the two algorithms, so I stand
corrected. For general purpose use, I would probably recommend the
byte lut approach. However, the pair addition method was originally
for use on small memory machines and embedded systems with limited
memory. In that case the difference in memory size may weigh heavily
in favor of the pair addition method. On my VxWorks system the memory
use was:
	byte lut - 372 bytes
	pair addition - 144 bytes
both for 32 bit words.

-------------------------------------------------------------------------------

			Richard Neitzel
			National Center For Atmospheric Research
			Box 3000
			Boulder, CO 80307-3000
			303-497-2057

			thor@thor.ucar.edu

    	Torren med sitt skjegg		Thor with the beard
    	lokkar borni under sole-vegg	calls the children to the sunny wall
    	Gjo'i med sitt shinn		Gjo with the pelts
    	jagar borni inn.		chases the children in.



-------------------------------------------------------------------------------

henry@utzoo.uucp (Henry Spencer) (01/16/89)

In article <1208@ncar.ucar.edu> thor@thor.ucar.edu (Rich Neitzel) writes:
>... I do not relish the thought of entering 256 table
>entries...

Who would?  Compute them.  "Work smart, not hard."
-- 
"God willing, we will return." |     Henry Spencer at U of Toronto Zoology
-Eugene Cernan, the Moon, 1972 | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

bet@dukeac.UUCP (Bennett Todd) (01/18/89)

: This is a sharchive -- extract the files by running through sh.
echo x - README
sed 's/^X//' <<\Shar_Eof >README
XI wanted to see for myself whether the byte-at-a-time table could perform
Xcomparably to the "parallel addition" algorithm presented; so I wrote a
Xlittle benchmark code to compare. I feel (and folks are welcome to disagree
Xwith me here, obviously) that taking advantage of pointer type punning to
Xconvert from shift-and-mask extractions to bump-a-pointer-and-dereference is
Xlegitimate and (as far as I know!) portable, and not unreasonably ugly when
Xyou are already wading down in the bit-bashing.
X
XI did this using GCC 1.31 on a Sun 3/50 with 4M under SunOS 3.5 (no suntools
Xrunning). The machine wasn't quite idle; I had an rlogin to another machine
Xwhere I was reading netnews. However, it was close enough to idle that I
Xthink the user times are reasonable for comparison. I ran it several times
Xand the runs didn't vary by as much as a second in the user time. 1000000
Xiterations didn't have quite enough separation to convince me for sure this
Xwas a solid result, so I bumped it to 5000000.
X
XThe timing results came out:
X
XByte at a time table lookup:   75.0 user
X          Parallel addition:   82.7 user
X
XSo, given this coding style, and this compiler technology, on this sort of
Xarchitecture, it appears that the byte-wise table lookup isn't particularly
Xslower. It also isn't importantly faster. I personally find it more
Xunderstandable.
X
XI compared the output for the first 1000 results (on the output-generating
Xversion) and they were identical, so I hope this code is correctly computing
Xthe desired function. It compiled with the options given in the Makefile
X(optization cranked way up, fairly strict diagnostics) without error or
Xwarning messages.
X
XNote that the generator program in the comment is really a one-shot;
XI built it with a library I run against that performs system and library
Xcall error checking for me, because I am extremely lazy. It should be clear
Xhow to extract that into a stand-alone.
X
XFinally, this is my first try ever at coding a timing benchmark, so it is
Xquite possible I have committed some classic benchmark blunder. I hope not.
XThe separately-compiled "consume" procedure might not have been absolutely
Xnecessary, but it kept me from having to examine assembler or figure out
Xwhat a heavy-duty optimizing compiler might be capable of omitting.
Shar_Eof
echo x - Makefile
sed 's/^X//' <<\Shar_Eof >Makefile
X#
X# Choose one of the following four OPTIONS definitions, which
X# cover generating 1000-point validation versions on each algorithm,
X# and generating 5000000-point silent versions for timing.
X#
X
X# OPTIONS=-DLIM=5000000 -DBENCHMARK -DBYTE_ORIENTED_TABLE
XOPTIONS=-DLIM=5000000 -DBENCHMARK -DPARALLEL_ADDITION
X# OPTIONS=-DLIM=1000 -DVERIFY -DBYTE_ORIENTED_TABLE
X# OPTIONS=-DLIM=1000 -DVERIFY -DPARALLEL_ADDITION
X
XOBJ=bench.o consume.o
X
XCC=gcc
XCFLAGS=-O -g -Wall -Wwrite-strings -msoft-float -fstrength-reduce \
X       -finline-functions -I/usr/local/include $(OPTIONS)
X
Xbench : $(OBJ)
X	$(CC) $(CFLAGS) -o $@ $(OBJ)
X
X$(OBJ) : Makefile
Shar_Eof
echo x - bench.c
sed 's/^X//' <<\Shar_Eof >bench.c
X#include <stdio.h>
X
X#ifndef LIM
X#define LIM 1000000
X#endif
X
Xvoid consume(int);
Xint bitcount(int);
X
Xint main() {
X    int i;
X
X    for (i=0; i<LIM; i++)
X        consume(bitcount(i));
X
X    return(0);
X}
X
X#ifdef BYTE_ORIENTED_TABLE
Xint bitcount(int num) {
X    register unsigned char *ptr = (unsigned char *) &num;
X    register int count = 0;
X    /*
X     * The following table was generated, of course. Here's the
X     * (one-shot, quick-and-dirty) generating program (derived
X     * from a program in
X     *   Harbison and Steele, "C: A Reference Manual", Second Edition
X     * on page 175):
X     *
X     *  #include <bent.h>
X     *
X     *  char syntax_args[] = "";
X     *
X     *  int main(int argc, char **argv) {
X     *      register unsigned i;
X     *
X     *      progname = argv[0];
X     *      if (argc != 1)
X     *          syntax();
X     *
X     *      for (i = 0; i<256; i++) {
X     *          register unsigned tmp = i;
X     *          register int count = 0;
X     *          while (tmp != 0) {
X     *              tmp ^= (tmp & -tmp);
X     *              count++;
X     *          }
X     *          eprintf("%d%s", count, (i%32 == 31) ? ",\n" : ",");
X     *      }
X     *      return(0);
X     *  }
X     */
X    static char bits_per_byte[] = {
X        0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
X        1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
X        1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
X        2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
X        1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
X        2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
X        2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
X        3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
X    };
X
X    /*
X     * The following lines are the loop:
X     *
X     *  for (counter = 0; counter < sizeof(int); counter++)
X     *      count += bits_per_byte[*ptr++];
X     *
X     * unrolled.
X     */
X    count += bits_per_byte[*ptr++];
X    count += bits_per_byte[*ptr++];
X    count += bits_per_byte[*ptr++];
X    count += bits_per_byte[*ptr++];
X    return(count);
X}
X#endif /* BYTE_ORIENTED_TABLE */
X
X#ifdef PARALLEL_ADDITION
X/*
X  Module: l_bitcnt.c
X  Author: Richard E. K. Neitzel
X
Xrevision history
X----------------
X1.0,9jan89,rekn            written
X
X
X  Explanation:
X      This routine takes an integer and returns the number of bits it contains
X      which are set. It does this via 'parallel' addition, turning the integer
X      into sets of bit pairs of increasing size. WARNING! This assumes that
X      integers are 32 bits! 
X*/
X
X/*
X * Excised from netnews posting, and entry point renamed "bitcount";
X * changes made for ANSI C.
X *
X * In other words, this is Richard E. K. Neitzel's work, not mine, but
X * I have monkeyed with it, so don't blame him.
X * Bennett Todd
X */
X
Xint bitcount(int word) {
X    register int j, k;      /* Our work area */
X
X    if (!word)          /* Why bother? */
X      return(0);
X
X    j = word;
X
X    k = j & 0x55555555;     /* Clear every other bit */
X    j >>= 1;            /* Do again for cleared bits */
X    j &= 0x55555555;
X    j += k;         /* 1st pairs done */
X
X    k = j & 0x33333333;     /* Clear every two bits */
X    j >>= 2;
X    j &= 0x33333333;
X    j += k;         /* 2nd pairs done */
X
X    k = j & 0xffff;     /* Only need last 4 sets */
X    j &= 0xffff0000;        /* and top 4 here */
X    j = j >> 16;        /* ready - */
X    j += k;         /* add 3rd pairs */
X
X    k = j & 0xf0f;      /* Clear bits */
X    j >>= 4;            /* Repeat */
X    j &= 0xf0f;
X    j += k;         /* add 4th pairs */
X
X    k = j & 0xff;       /* Now add the 8 bit pairs */
X    j >>= 8;
X    j += k;
X
X    return(j);          /* bye */
X}
X#endif /* PARALLEL_ADDITION */
Shar_Eof
echo x - consume.c
sed 's/^X//' <<\Shar_Eof >consume.c
X#include <stdio.h>
X
Xint printf(const char *, ...);
X
Xvoid consume(int val) {
X
X#ifdef VERIFY
X    (void) printf("%d\n", val);
X#endif /* VERIFY */
X
X#ifdef BENCHMARK
X    /* Then we simply return.... */
X#endif /* BENCHMARK */
X
X}
Shar_Eof
exit 0

-Bennett
bet@orion.mc.duke.edu