thor@stout.ucar.edu (Rich Neitzel) (01/11/89)
Recently jim@athsys.uucp (Jim Becker) wrote: > The problem is the most efficient method to simply count the >number of on bits in a sixteen bit short. This is a really easy >exercise, but one that has a variety of different approaches. I would >be interested in the opinions of programmers out there, an efficient >algorithm that is understandable. > > Assume that this is a basic sample algorithm: > > int > BitCount( num ) > short num; > { > int count = 0; > > while( num ) { > > count += (num & 1); > num = (num >> 1); > } > > return count; > } > > Is this efficient? Is this understandable? What is a better >algorithm? First a comment on this, what I call the 'classical shift' version. As listed above, it will not function on many cpus, since it uses right shifts. Since many cpus propagate the high bit, the loop will never terminate. Second, I do not consider this to be an efficient algorithm. While it appears to be good, since it terminates when there are no more set bits, it does not function well if there are many set bits or the only set bit is in the last position tested. A common solution is to use a lookup table and parse the word into nibbles or bytes. This is faster, but you must be certain that the table is correct. I do not relish the thought of entering 256 table entries. Efficiency suffers from the fact that the table lookup requires pointer indexing. I have found that the most efficient algorithm is one involving 'bit pair' addition. Attached are two routines that implement this algorithm. They rely on 'parallel' addition. The word involved is added as successively larger bit pairs. Both a 16 bit and 32 bit form are provided. I have chosen not to attempt to apply it to any other word lengths. This is because the algorithm is not truely 'portable', that is, the idea is portable, but the implemenation is dependent on the word size. The larger the word size the more pairs are involved. These routines have been tested in two environments. The first was on a PDP-11 under RSX-11M-Plus and the second on a Motorola MV133 (20 MHz 68020) under VxWorks. In both cases the test routine was the only job. They easily out performed the other algorithms tested. The completeing algorithms were the 'classical' shift and nibble lookup tables. The results were: Classical shift lut pair addition RSX 12.3 7.4 4.6 VxWorks 25 12 8 RSX times are seconds/65K iterations, VxWorks times are microseconds per iterations. The lookup table code would increase in speed if a byte table were used, but I doubt it would increase enough to beat the pair addition time. Supplied here are both the 16 and 32 bit routines, as well as the classic and lut routines. --------------------------------------------------------------------- /* Module: s_bitcnt.c Author: Richard E. K. Neitzel revision history ---------------- 1.0,9jan89,rekn written Explanation: This routine takes an integer and returns the number of bits it contains which are set. It does this via 'parallel' addition, turning the integer into sets of bit pairs of increasing size. WARNING! This assumes that shorts are 16 bits! */ int s_bitcnt(word) short word; { register short i, j, k; /* our work area */ if (!word) /* Why bother? */ return(0); j = word; k = j & 0x5555; /* Clear every other bit */ j >>= 1; /* Repeat for other bits */ j &= 0x5555; j += k; /* 1st pairs */ k = j & 0x3333; /* Clear every two bits */ j >>= 2; /* shift and repeat */ j &= 0x3333; j += k; /* 2nd pairs */ k = j & 0xff; /* Get lower pairs */ j &= 0xff00; /* Get upper pairs */ j >>= 8; j += k; /* 3rd pairs */ k = j & 0xf; /* Get last pairs */ j >>= 4; j += k; return(j); /* bye */ } --------------------------------------------------------------------- /* Module: l_bitcnt.c Author: Richard E. K. Neitzel revision history ---------------- 1.0,9jan89,rekn written Explanation: This routine takes an integer and returns the number of bits it contains which are set. It does this via 'parallel' addition, turning the integer into sets of bit pairs of increasing size. WARNING! This assumes that integers are 32 bits! */ int l_bitcnt(word) int word; { register int j, k; /* Our work area */ if (!word) /* Why bother? */ return(0); j = word; k = j & 0x55555555; /* Clear every other bit */ j >>= 1; /* Do again for cleared bits */ j &= 0x55555555; j += k; /* 1st pairs done */ k = j & 0x33333333; /* Clear every two bits */ j >>= 2; j &= 0x33333333; j += k; /* 2nd pairs done */ k = j & 0xffff; /* Only need last 4 sets */ j &= 0xffff0000; /* and top 4 here */ j = j >> 16; /* ready - */ j += k; /* add 3rd pairs */ k = j & 0xf0f; /* Clear bits */ j >>= 4; /* Repeat */ j &= 0xf0f; j += k; /* add 4th pairs */ k = j & 0xff; /* Now add the 8 bit pairs */ j >>= 8; j += k; return(j); /* bye */ } --------------------------------------------------------------------- int bitCount( num ) short num; { register short count = 0; register short tmp; tmp = num; while(tmp) /* Note fix to avoid looping forever */ { count += (tmp & 0x8000); tmp <<= 1; } return(count); } --------------------------------------------------------------------- short lut[] = {0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4}; int luts(word) short word; { register short j, k,l; if (!word) return(0); j = word; k = j & 0xf; l += lut[k]; j >>= 4; k = j & 0xf; l += lut[k]; j >>= 4; k = j & 0xf; l += lut[k]; j >>= 4; k = j & 0xf; l += lut[k]; return(l); } ------------------------------------------------------------------------------- Richard Neitzel National Center For Atmospheric Research Box 3000 Boulder, CO 80307-3000 303-497-2057 thor@thor.ucar.edu Torren med sitt skjegg Thor with the beard lokkar borni under sole-vegg calls the children to the sunny wall Gjo'i med sitt shinn Gjo with the pelts jagar borni inn. chases the children in. -------------------------------------------------------------------------------
thor@stout.ucar.edu (Rich Neitzel) (01/11/89)
BTW, I just saw a posting of the 'and' algorithm for bit counting. It suffers from the same problem as the classic shift - it loops repeatedly for cases of multiple set bits. I have also tested this and find pair addition faster. ------------------------------------------------------------------------------- Richard Neitzel National Center For Atmospheric Research Box 3000 Boulder, CO 80307-3000 303-497-2057 thor@thor.ucar.edu Torren med sitt skjegg Thor with the beard lokkar borni under sole-vegg calls the children to the sunny wall Gjo'i med sitt shinn Gjo with the pelts jagar borni inn. chases the children in. -------------------------------------------------------------------------------
andrew@alice.UUCP (Andrew Hume) (01/11/89)
i admired the futzy but clever-looking algorithm to do parallel addition to determine bit counts reported by rich neitzel. i myself had guessed that table lookup by bytes was an appropriate space/time tradeoff. i present times for 300K calls on a mips 120-5 below (8 is table lookup for 8 bit chunks, 4 is for 4bit chunks and s is the parallel aaddition alg). note that on the mips, the 3x speedup for 's' found on the vax isn't. tuttle=; timex a.out 8 real 4.01 user 3.98 sys 0.03 tuttle=; timex a.out 4 real 11.61 user 11.59 sys 0.01 tuttle=; timex a.out s real 11.33 user 11.24 sys 0.03 this was without optimising; -O3 gains a bit more: tuttle=; timex a.out 8 real 3.64 user 3.62 sys 0.01 tuttle=; timex a.out 4 real 11.13 user 11.05 sys 0.02 tuttle=; timex a.out s real 10.34 user 10.32 sys 0.02
thor@stout.ucar.edu (Rich Neitzel) (01/12/89)
As a followup to my posting of the "pair addition" algorithm for bit counting, I received the following mail message from dunc@Sun.COM (duncs home): >That bitcounting algorithm you posted is really pretty; it's a technique I'll >keep in mind, as it's no doubt adaptable to other things. It turns out that >on my machine the 8 bit table scheme does win, but not by much: >int alternate(word) /* assumes char's are 8 bits and int's are 32 bits */ > int word; >{ > static char bits[] = { > 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5, > 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6, > 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6, > 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7, > 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6, > 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7, > 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7, > 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8 > }; > register int result; > register unsigned char *p = (unsigned char *)&word; > result = bits[*p++]; > result += bits[*p++]; > result += bits[*p++]; > result += bits[*p]; >} > >index %time self descendents called+self name index >[5] 14.2 2.04 0.00 100000 _l_bitcnt [5] >[6] 12.0 1.72 0.00 100000 _alternate [6] I tested this on my VxWorks system, after making the following changes: add if (!word) return(0); add return(result); This made it perform like the pair routine and gave back the result. I tried both a 16 and 32 bit version. I would have tested them under RSX but I don't have a PDP anymore. The results are: Shift Nibble lut Byte lut Pair add. RSX 12.3 7.4 n/a 4.6 VxWorks1 25 12 8 8 VxWorks2 51 25 11 11 As before RSX times are seconds/65k iterations, while VxWorks times are microseconds/iteration. The times marked 1 are for 16 bit words and times marked 2 are for 32 bit words. I too find no real difference between the two algorithms, so I stand corrected. For general purpose use, I would probably recommend the byte lut approach. However, the pair addition method was originally for use on small memory machines and embedded systems with limited memory. In that case the difference in memory size may weigh heavily in favor of the pair addition method. On my VxWorks system the memory use was: byte lut - 372 bytes pair addition - 144 bytes both for 32 bit words. ------------------------------------------------------------------------------- Richard Neitzel National Center For Atmospheric Research Box 3000 Boulder, CO 80307-3000 303-497-2057 thor@thor.ucar.edu Torren med sitt skjegg Thor with the beard lokkar borni under sole-vegg calls the children to the sunny wall Gjo'i med sitt shinn Gjo with the pelts jagar borni inn. chases the children in. -------------------------------------------------------------------------------
henry@utzoo.uucp (Henry Spencer) (01/16/89)
In article <1208@ncar.ucar.edu> thor@thor.ucar.edu (Rich Neitzel) writes: >... I do not relish the thought of entering 256 table >entries... Who would? Compute them. "Work smart, not hard." -- "God willing, we will return." | Henry Spencer at U of Toronto Zoology -Eugene Cernan, the Moon, 1972 | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
bet@dukeac.UUCP (Bennett Todd) (01/18/89)
: This is a sharchive -- extract the files by running through sh. echo x - README sed 's/^X//' <<\Shar_Eof >README XI wanted to see for myself whether the byte-at-a-time table could perform Xcomparably to the "parallel addition" algorithm presented; so I wrote a Xlittle benchmark code to compare. I feel (and folks are welcome to disagree Xwith me here, obviously) that taking advantage of pointer type punning to Xconvert from shift-and-mask extractions to bump-a-pointer-and-dereference is Xlegitimate and (as far as I know!) portable, and not unreasonably ugly when Xyou are already wading down in the bit-bashing. X XI did this using GCC 1.31 on a Sun 3/50 with 4M under SunOS 3.5 (no suntools Xrunning). The machine wasn't quite idle; I had an rlogin to another machine Xwhere I was reading netnews. However, it was close enough to idle that I Xthink the user times are reasonable for comparison. I ran it several times Xand the runs didn't vary by as much as a second in the user time. 1000000 Xiterations didn't have quite enough separation to convince me for sure this Xwas a solid result, so I bumped it to 5000000. X XThe timing results came out: X XByte at a time table lookup: 75.0 user X Parallel addition: 82.7 user X XSo, given this coding style, and this compiler technology, on this sort of Xarchitecture, it appears that the byte-wise table lookup isn't particularly Xslower. It also isn't importantly faster. I personally find it more Xunderstandable. X XI compared the output for the first 1000 results (on the output-generating Xversion) and they were identical, so I hope this code is correctly computing Xthe desired function. It compiled with the options given in the Makefile X(optization cranked way up, fairly strict diagnostics) without error or Xwarning messages. X XNote that the generator program in the comment is really a one-shot; XI built it with a library I run against that performs system and library Xcall error checking for me, because I am extremely lazy. It should be clear Xhow to extract that into a stand-alone. X XFinally, this is my first try ever at coding a timing benchmark, so it is Xquite possible I have committed some classic benchmark blunder. I hope not. XThe separately-compiled "consume" procedure might not have been absolutely Xnecessary, but it kept me from having to examine assembler or figure out Xwhat a heavy-duty optimizing compiler might be capable of omitting. Shar_Eof echo x - Makefile sed 's/^X//' <<\Shar_Eof >Makefile X# X# Choose one of the following four OPTIONS definitions, which X# cover generating 1000-point validation versions on each algorithm, X# and generating 5000000-point silent versions for timing. X# X X# OPTIONS=-DLIM=5000000 -DBENCHMARK -DBYTE_ORIENTED_TABLE XOPTIONS=-DLIM=5000000 -DBENCHMARK -DPARALLEL_ADDITION X# OPTIONS=-DLIM=1000 -DVERIFY -DBYTE_ORIENTED_TABLE X# OPTIONS=-DLIM=1000 -DVERIFY -DPARALLEL_ADDITION X XOBJ=bench.o consume.o X XCC=gcc XCFLAGS=-O -g -Wall -Wwrite-strings -msoft-float -fstrength-reduce \ X -finline-functions -I/usr/local/include $(OPTIONS) X Xbench : $(OBJ) X $(CC) $(CFLAGS) -o $@ $(OBJ) X X$(OBJ) : Makefile Shar_Eof echo x - bench.c sed 's/^X//' <<\Shar_Eof >bench.c X#include <stdio.h> X X#ifndef LIM X#define LIM 1000000 X#endif X Xvoid consume(int); Xint bitcount(int); X Xint main() { X int i; X X for (i=0; i<LIM; i++) X consume(bitcount(i)); X X return(0); X} X X#ifdef BYTE_ORIENTED_TABLE Xint bitcount(int num) { X register unsigned char *ptr = (unsigned char *) # X register int count = 0; X /* X * The following table was generated, of course. Here's the X * (one-shot, quick-and-dirty) generating program (derived X * from a program in X * Harbison and Steele, "C: A Reference Manual", Second Edition X * on page 175): X * X * #include <bent.h> X * X * char syntax_args[] = ""; X * X * int main(int argc, char **argv) { X * register unsigned i; X * X * progname = argv[0]; X * if (argc != 1) X * syntax(); X * X * for (i = 0; i<256; i++) { X * register unsigned tmp = i; X * register int count = 0; X * while (tmp != 0) { X * tmp ^= (tmp & -tmp); X * count++; X * } X * eprintf("%d%s", count, (i%32 == 31) ? ",\n" : ","); X * } X * return(0); X * } X */ X static char bits_per_byte[] = { X 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5, X 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6, X 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6, X 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7, X 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6, X 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7, X 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7, X 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8, X }; X X /* X * The following lines are the loop: X * X * for (counter = 0; counter < sizeof(int); counter++) X * count += bits_per_byte[*ptr++]; X * X * unrolled. X */ X count += bits_per_byte[*ptr++]; X count += bits_per_byte[*ptr++]; X count += bits_per_byte[*ptr++]; X count += bits_per_byte[*ptr++]; X return(count); X} X#endif /* BYTE_ORIENTED_TABLE */ X X#ifdef PARALLEL_ADDITION X/* X Module: l_bitcnt.c X Author: Richard E. K. Neitzel X Xrevision history X---------------- X1.0,9jan89,rekn written X X X Explanation: X This routine takes an integer and returns the number of bits it contains X which are set. It does this via 'parallel' addition, turning the integer X into sets of bit pairs of increasing size. WARNING! This assumes that X integers are 32 bits! X*/ X X/* X * Excised from netnews posting, and entry point renamed "bitcount"; X * changes made for ANSI C. X * X * In other words, this is Richard E. K. Neitzel's work, not mine, but X * I have monkeyed with it, so don't blame him. X * Bennett Todd X */ X Xint bitcount(int word) { X register int j, k; /* Our work area */ X X if (!word) /* Why bother? */ X return(0); X X j = word; X X k = j & 0x55555555; /* Clear every other bit */ X j >>= 1; /* Do again for cleared bits */ X j &= 0x55555555; X j += k; /* 1st pairs done */ X X k = j & 0x33333333; /* Clear every two bits */ X j >>= 2; X j &= 0x33333333; X j += k; /* 2nd pairs done */ X X k = j & 0xffff; /* Only need last 4 sets */ X j &= 0xffff0000; /* and top 4 here */ X j = j >> 16; /* ready - */ X j += k; /* add 3rd pairs */ X X k = j & 0xf0f; /* Clear bits */ X j >>= 4; /* Repeat */ X j &= 0xf0f; X j += k; /* add 4th pairs */ X X k = j & 0xff; /* Now add the 8 bit pairs */ X j >>= 8; X j += k; X X return(j); /* bye */ X} X#endif /* PARALLEL_ADDITION */ Shar_Eof echo x - consume.c sed 's/^X//' <<\Shar_Eof >consume.c X#include <stdio.h> X Xint printf(const char *, ...); X Xvoid consume(int val) { X X#ifdef VERIFY X (void) printf("%d\n", val); X#endif /* VERIFY */ X X#ifdef BENCHMARK X /* Then we simply return.... */ X#endif /* BENCHMARK */ X X} Shar_Eof exit 0 -Bennett bet@orion.mc.duke.edu