[comp.unix.questions] Bit counting

thor@stout.ucar.edu (Rich Neitzel) (01/11/89)

Recently jim@athsys.uucp (Jim Becker) wrote:

>	The problem is the most efficient method to simply count the
>number of on bits in a sixteen bit short. This is a really easy
>exercise, but one that has a variety of different approaches. I would
>be interested in the opinions of programmers out there, an efficient
>algorithm that is understandable.
>
>	Assume that this is a basic sample algorithm:
>
>		int	
>		BitCount( num )
>		short	num;
>		{
>			int	count	= 0;
>
>			while( num ) {
>
>				count	+= (num &  1);
>				num	 = (num >> 1);
>			}
>
>			return count;
>		}
>
>	Is this efficient? Is this understandable? What is a better
>algorithm?

First a comment on this, what I call the 'classical shift' version. As
listed above, it will not function on many cpus, since it uses right
shifts. Since many cpus propagate the high bit, the loop will never
terminate.

Second, I do not consider this to be an efficient algorithm. While it
appears to be good, since it terminates when there are no more set
bits, it does not function well if there are many set bits or the only
set bit is in the last position tested. 

A common solution is to use a lookup table and parse the word into
nibbles or bytes. This is faster, but you must be certain that the
table is correct. I do not relish the thought of entering 256 table
entries. Efficiency suffers from the fact that the table lookup
requires pointer indexing.

I have found that the most efficient algorithm is one involving 'bit 
pair' addition. Attached are two routines that implement this
algorithm. They rely on 'parallel' addition. The word involved is
added as successively larger bit pairs. Both a 16 bit and 32 bit form
are provided. I have chosen not to attempt to apply it to any other
word lengths. This is because the algorithm is not truely 'portable',
that is, the idea is portable, but the implemenation is dependent on 
the word size. The larger the word size the more pairs are involved.

These routines have been tested in two environments. The first was on
a PDP-11 under RSX-11M-Plus and the second on a Motorola MV133 (20 MHz
68020) under VxWorks. In both cases the test routine was the only job.
They easily out performed the other algorithms tested. The completeing
algorithms were the 'classical' shift and nibble lookup tables. The
results were:

	Classical shift		lut		pair addition
RSX	   12.3			 7.4		  4.6
VxWorks    25			12                8

RSX times are seconds/65K iterations, VxWorks times are microseconds
per iterations. The lookup table code would increase in speed if a
byte table were used, but I doubt it would increase enough to beat the
pair addition time.

Supplied here are both the 16 and 32 bit routines, as well as the
classic and lut routines.
---------------------------------------------------------------------

/*
  Module: s_bitcnt.c
  Author: Richard E. K. Neitzel

revision history
----------------
1.0,9jan89,rekn            written


  Explanation:
      This routine takes an integer and returns the number of bits it contains
      which are set. It does this via 'parallel' addition, turning the integer
      into sets of bit pairs of increasing size. WARNING! This assumes that
      shorts are 16 bits!
*/

int s_bitcnt(word)
short word;
{
    register short i, j, k;	/* our work area */

    if (!word)			/* Why bother? */
      return(0);

    j = word;

    k = j & 0x5555;		/* Clear every other bit */
    j >>= 1;			/* Repeat for other bits */
    j &= 0x5555;		
    j += k;			/* 1st pairs */

    k = j & 0x3333;		/* Clear every two bits */
    j >>= 2;			/* shift and repeat */
    j &= 0x3333;
    j += k;			/* 2nd pairs */

    k = j & 0xff;		/* Get lower pairs */
    j &= 0xff00;		/* Get upper pairs */
    j >>= 8;
    j += k;			/* 3rd pairs */

    k = j & 0xf;		/* Get last pairs */
    j >>= 4;
    j += k;

    return(j);			/* bye */
}
---------------------------------------------------------------------
/*
  Module: l_bitcnt.c
  Author: Richard E. K. Neitzel

revision history
----------------
1.0,9jan89,rekn            written


  Explanation:
      This routine takes an integer and returns the number of bits it contains
      which are set. It does this via 'parallel' addition, turning the integer
      into sets of bit pairs of increasing size. WARNING! This assumes that
      integers are 32 bits! 
*/

int l_bitcnt(word)
int word;
{
    register int j, k;		/* Our work area */

    if (!word)			/* Why bother? */
      return(0);

    j = word;

    k = j & 0x55555555;		/* Clear every other bit */
    j >>= 1;			/* Do again for cleared bits */
    j &= 0x55555555;
    j += k;			/* 1st pairs done */

    k = j & 0x33333333;		/* Clear every two bits */
    j >>= 2;
    j &= 0x33333333;
    j += k;			/* 2nd pairs done */

    k = j & 0xffff;		/* Only need last 4 sets */
    j &= 0xffff0000;		/* and top 4 here */
    j = j >> 16;		/* ready - */
    j += k;			/* add 3rd pairs */

    k = j & 0xf0f;		/* Clear bits */
    j >>= 4;			/* Repeat */
    j &= 0xf0f;
    j += k;			/* add 4th pairs */

    k = j & 0xff;		/* Now add the 8 bit pairs */
    j >>= 8;
    j += k;

    return(j);			/* bye */
}
---------------------------------------------------------------------
int bitCount( num )
short	num;
{
    register short count = 0;
    register short tmp;

    tmp = num;

    while(tmp)		/* Note the fix to avoid looping always */ 
      {
	  count	+= (tmp &  0x8000);
	  tmp <<= 1;
      }

    return(count);
}
---------------------------------------------------------------------
short lut[] = {0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4};

int luts(word)
short word;
{
    register short j, k,l;

    if (!word)
      return(0);

    j = word;

    k = j & 0xf;

    l += lut[k];

    j >>= 4;

    k = j & 0xf;

    l += lut[k];

    j >>= 4;

    k = j & 0xf;

    l += lut[k];

    j >>= 4;

    k = j & 0xf;

    l += lut[k];

    return(l);
}

-------------------------------------------------------------------------------

			Richard Neitzel
			National Center For Atmospheric Research
			Box 3000
			Boulder, CO 80307-3000
			303-497-2057

			thor@thor.ucar.edu

    	Torren med sitt skjegg		Thor with the beard
    	lokkar borni under sole-vegg	calls the children to the sunny wall
    	Gjo'i med sitt shinn		Gjo with the pelts
    	jagar borni inn.		chases the children in.



-------------------------------------------------------------------------------

aeb@cwi.nl (Andries Brouwer) (01/16/89)

In article <1207@ncar.ucar.edu> thor@thor.ucar.edu (Rich Neitzel)
gave a number of algorithms of which his `bit pair addition' was
the fastest. My favourite routine is the following:

	/* find number of 1-bits in a */
	bitcnt(a) register int a; {
	register int ct = 0;
		while(a){
			a &= (a-1);
			ct++;
		}
		return(ct);
	}

It is twice as fast on the average as the usual bit count versions,
and comparable in speed to Neitzels routine (but much shorter).
An important advantage is that it does not depend on the word size.
Timing results:                this routine    Neitzels
        VAX 200000 iterations    12 sec         10 sec
     HARRIS 800000 iterations     7 sec         12 sec

-- 
      Andries Brouwer -- CWI, Amsterdam -- uunet!mcvax!aeb -- aeb@cwi.nl

gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/17/89)

In article <7825@boring.cwi.nl> aeb@cwi.nl (Andries Brouwer) writes:
>			a &= (a-1);

Such algorithms should use unsigned integers, to avoid possible
overflow problems on certain bit patterns.

leo@philmds.UUCP (Leo de Wit) (01/17/89)

In article <7825@boring.cwi.nl> aeb@cwi.nl (Andries Brouwer) writes:
|In article <1207@ncar.ucar.edu> thor@thor.ucar.edu (Rich Neitzel)
|gave a number of algorithms of which his `bit pair addition' was
|the fastest. My favourite routine is the following:
|
|	/* find number of 1-bits in a */
|	bitcnt(a) register int a; {
|	register int ct = 0;
|		while(a){
|			a &= (a-1);
|			ct++;
|		}
|		return(ct);
|	}
|
|It is twice as fast on the average as the usual bit count versions,
|and comparable in speed to Neitzels routine (but much shorter).
|An important advantage is that it does not depend on the word size.
|Timing results:                this routine    Neitzels
|        VAX 200000 iterations    12 sec         10 sec
|     HARRIS 800000 iterations     7 sec         12 sec

But the speed DOES depend on the number of bits set in the integer!
Each iteration of the loop removes and counts the least significant
bit, so this means 32 iterations on a unsigned int, 32 bits wide, with
all bits set. In this particular situation, it may well end up being
slower than an ordinary shift-&-count-until-zero.

What input data were used for those timing results???

	  Leo.