zenith@uunet.UU.NET (Steven Zenith) (01/18/89)
RE: MOBY <mcvax!etive.ed.ac.uk!gvw@uunet.UU.NET> RE: Subject: Clever programming tricks wanted Greg Wilson of Edinburgh Concurrent Supercomputer Project writes.. > I'm looking for "clever" programming tricks. For example, > ... > I would like to hear about similar tricks that people have invented/ > discovered. In particular, I'm looking for fast ways to count the > number of 1 bits in a word (on machines that don't have this as a > machine instruction :-), .... Bit counting. This Occam function *bitcount* takes any sized array of integers, and returns the number of bits set in the array. The algorithm is particularly effective for counting the bits in a sparse array. -- occam program *bitcount*. 18/1/89 -- Steven Ericsson Zenith: Architecture Group -- INMOS Limited, Bristol, UK. #USE userio INT FUNCTION bitcount (VAL []INT array) INT count : VALOF SEQ count := 0 INT x : SEQ i = 0 FOR SIZE array SEQ x := array[i] WHILE x <> 0 SEQ x := x /\ (x MINUS 1) count := count + 1 RESULT count : [10]INT array : SEQ -- first input an array of ten integers INT char, num : SEQ i = 0 FOR 10 SEQ write.int (screen, i, 5) write.full.string (screen, " #") read.echo.char (keyboard, screen, char) read.echo.hex.int (keyboard, screen, num, char) array[i] := num -- count bits in array and write result to the screen newline (screen) write.full.string (screen, "Set bits counted in above array = ") write.hex.int (screen, bitcount (array), 8) -- wait for any key press INT any: keyboard ? any It would is a simple matter to a timers here to time the *bitcount* function, we could even look at the code generation if you like. The key to the algorithm lies in the expression (x /\ (x MINUS 1)). David May brought this to my attention some time ago, I forget to whom he attributed the original observation. Look at what happens. Take a value, subtract 1 (ignoring overflow), and BITAND it with the original value. Each time we do this we cause one bit to disappear. When the value equals zero we have counted each set bit. * Steven Ericsson Zenith Snail mail: INMOS Limited, 1000 Aztec West, | zenith@inmos.uucp Almondsbury, Bristol BS12 4SQ. UK. zenith@inmos.co.uk Voice: (UK)0454 616616x513 ** All can know beauty as beauty only because there is ugliness **
sestrich@Alliant.COM (Joe Sestrich) (01/20/89)
We've seen two algorithms for doing a population count: Use a neat trick to clear one bit at a time: A and (A-1) clears the bottom bit of A. This takes at least 4 instructions per loop-- 0 -> C LP: C + 1 -> C A - 1 -> T1 A and T1 -> T2 Jump to LP if T2 is not zero The number of cycles is dependant on the data, however, so assuming a uniform, independant probability of each bit being set, the time is [SUM i=1,32 (33-i)/2**i] * 4 = 31*4 + 1 (init cycle) = 125 The next idea is to divide the bits in half, and add them up until there is only one sum. This looks as follows: A AND 55555555 -> T1 A AND AAAAAAAA -> T2 Shift T2 right one bit -> T2 T1 + T2 -> A A AND 33333333 -> T1 A AND CCCCCCCC -> T2 Shift T2 right one bit -> T2 T1 + T2 -> A There are a total of FIVE similar copies of this for the complete algorithm (The original poster said 4). This is a common way of implementing the function in hardware, since the shifts and ands are only wires, and half of the adders need only be half adders. the total time is 5 * 4 = 20 units. Pretty good improvement! A third method is to group the bits, and use a table lookup to do the counting: 0 -> C LP: A AND M -> I T(I) + C -> C Shift A right by N bits -> A Jump to LP if A is not zero Where M is a mask with N bits set, and T is an array initialized as follows: 0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4 .... that is T(I) = the number of bits in I. If we pick n = 11, T has 2048 elements, and we have to go around the loop, at most 3 times, giving a time of 3*4 + 1 = 13. In fact, by unrolling the loop, we can get down to a time of 8. If indexing isn't free, the times would be 16 or 11.... Why the hell did I waste all this time on this silliness? It must be fun..... -- -- Joe Sestrich UUCP: ...!linus!alliant!sestrich Phone: (617) 486-1241 Mail: Alliant Computer Systems, One Monarch Drive, Littleton, MA 01460
dave@vax1.cc.uakron.edu (David Stoutamire) (01/21/89)
In article <4147@hubcap.UUCP>, sestrich@Alliant.COM (Joe Sestrich) writes: > There are a total of FIVE similar copies of this for > the complete algorithm (The original poster said 4). The four copies were for the 16-bit example. 32 bits does of course require five repetitions. Joe and several others mentioned using table lookup. For my particular needs, I rejected this: 1) This was (due to the nature of the contest) a program that had to be written entirely in Pascal under VM on an IBM 3090. Array/table lookups would require bounds checking, not to mention address calculation w/multiplication (yes, you can optimize your way out of these things, but in this case I couldn't). This version of Pascal happened to offer the non-`Standard' bit operations. 2) On a virtual memory machine, using large lookup tables increases the risk of being delayed for demand paging. 3) Generating the tables take time; a 64K lookup table to do two sixteen bit counts does initially require those 64K computations. Precalculation (before execution time) was not a possibility. 4) (Because this is comp.parallel) If this were being done on a parallel machine, memory bandwidth is something one wants to keep at a minimum. Local caches don't help much if your lookups don't exhibit spacial (memorial? :-) locality. This would depend on the machine, of course, but in any general purpose machine also running other users' code, one cannot count on the availability of fast-access memory. Table lookups are grand, but are the most silicon expensive way of implementing a function, over any other non-redundant means. Generality vs. speed and real estate. This doesn't mean they aren't great for specific applications, just not this particular one. By the way, I did use a lookup table for board evaluation results, but that required lookups *far* less often. The program never lost a game... -= David Stoutamire =-