[comp.parallel] clever programming tricks.

zenith@uunet.UU.NET (Steven Zenith) (01/18/89)

RE: MOBY <mcvax!etive.ed.ac.uk!gvw@uunet.UU.NET>
RE: Subject: Clever programming tricks wanted

Greg Wilson of Edinburgh Concurrent Supercomputer Project writes..

> I'm looking for "clever" programming tricks.  For example,
> ...
> I would like to hear about similar tricks that people have invented/
> discovered.  In particular, I'm looking for fast ways to count the
> number of 1 bits in a word (on machines that don't have this as a
> machine instruction :-), ....

Bit counting.

This Occam function *bitcount* takes any sized array of integers, and
returns the number of bits set in the array.
The algorithm is particularly effective for counting
the bits in a sparse array.

-- occam program *bitcount*. 18/1/89
-- Steven Ericsson Zenith: Architecture Group
-- INMOS Limited, Bristol, UK.
#USE userio

INT FUNCTION bitcount (VAL []INT array)
  INT count :
  VALOF
    SEQ
      count := 0
      INT x :
      SEQ i = 0 FOR SIZE array
        SEQ
          x := array[i]
          WHILE x <> 0
            SEQ
              x := x /\ (x MINUS 1)
              count := count + 1
    RESULT count
:

[10]INT array :
SEQ
  -- first input an array of ten integers
  INT char, num :
  SEQ i = 0 FOR 10
    SEQ
      write.int         (screen, i, 5)
      write.full.string (screen, " #")
      read.echo.char    (keyboard, screen, char)
      read.echo.hex.int (keyboard, screen, num, char)
      array[i] := num
  -- count bits in array and write result to the screen
  newline           (screen)
  write.full.string (screen, "Set bits counted in above array = ")
  write.hex.int     (screen, bitcount (array), 8)
  -- wait for any key press
  INT any:
  keyboard ? any

It would is a simple matter to a timers here to time the *bitcount*
function, we could even look at the code generation if you like.

The key to the algorithm lies in the expression (x /\ (x MINUS 1)).
David May brought this to my attention some time ago, I forget to
whom he attributed the original observation. Look at what happens.
Take a value, subtract 1 (ignoring overflow), and BITAND it with
the original value. Each time we do this we cause one bit to disappear.
When the value equals zero we have counted each set bit.


 *  Steven Ericsson Zenith     Snail mail: INMOS Limited, 1000 Aztec West,
 |    zenith@inmos.uucp                    Almondsbury, Bristol BS12 4SQ. UK.
      zenith@inmos.co.uk                   Voice: (UK)0454 616616x513
      ** All can know beauty as beauty only because there is ugliness ** 

sestrich@Alliant.COM (Joe Sestrich) (01/20/89)

	We've seen two algorithms for doing a population count:
		Use a neat trick to clear one bit at a time:
			A and (A-1)  clears the bottom bit of A.
		This takes at least 4 instructions per loop--
			0 -> C
		    LP:	C + 1 -> C
			A - 1 -> T1
			A and T1 -> T2
			Jump to LP if T2 is not zero
		The number of cycles is dependant on the data, however,
		so assuming a uniform, independant probability of each bit
		being set, the time is   [SUM i=1,32 (33-i)/2**i] * 4
		= 31*4 + 1 (init cycle) = 125

		The next idea is to divide the bits in half, and add
		them up until there is only one sum. This looks as follows:
			A AND 55555555 -> T1
			A AND AAAAAAAA -> T2
			Shift T2 right one bit -> T2
			T1 + T2 -> A

			A AND 33333333 -> T1
			A AND CCCCCCCC -> T2
			Shift T2 right one bit -> T2
			T1 + T2 -> A

		There are a total of FIVE similar copies of this for 
		the complete algorithm (The original poster said 4).
		This is a common way of implementing the function in
		hardware, since the shifts and ands are only wires,
		and half of the adders need only be half adders.
		the total time is 5 * 4 = 20 units.  Pretty good improvement!

		A third method is to group the bits, and use a table
		lookup to do the counting:
			0 -> C
		   LP:	A AND M -> I
			T(I) + C -> C
			Shift A right by N bits -> A
			Jump to LP if A is not zero
		
		Where M is a mask with N bits set, and T is an array
		initialized as follows: 0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4 ....
		that is T(I) = the number of bits in I.
		If we pick n = 11, T has 2048 elements, and we have to 
		go around the loop, at most 3 times, giving a time of
		3*4 + 1 = 13.  In fact, by unrolling the loop, we can get
		down to a time of 8.  If indexing isn't free, the times
		would be 16 or 11....

	Why the hell did I waste all this time on this silliness?  It
	must be fun.....

-- 
						-- Joe Sestrich

UUCP: ...!linus!alliant!sestrich       Phone: (617) 486-1241
Mail:  Alliant Computer Systems,  One Monarch Drive,  Littleton, MA  01460

dave@vax1.cc.uakron.edu (David Stoutamire) (01/21/89)

In article <4147@hubcap.UUCP>, sestrich@Alliant.COM (Joe Sestrich) writes:
> 		There are a total of FIVE similar copies of this for 
> 		the complete algorithm (The original poster said 4).

The four copies were for the 16-bit example.  32 bits does
of course require five repetitions.

Joe and several others mentioned using table lookup.  For my
particular needs, I rejected this:

   1) This was (due to the nature of the contest) a program that
      had to be written entirely in Pascal under VM on an IBM 3090.
      Array/table lookups would require bounds checking,
      not to mention address calculation w/multiplication
      (yes, you can optimize your way out of these things, but
      in this case I couldn't).  This version of Pascal happened
      to offer the non-`Standard' bit operations.

   2) On a virtual memory machine, using large lookup tables
      increases the risk of being delayed for demand paging.

   3) Generating the tables take time; a 64K lookup table to do
      two sixteen bit counts does initially require those
      64K computations.  Precalculation (before execution time)
      was not a possibility.

   4) (Because this is comp.parallel) If this were being done
      on a parallel machine, memory bandwidth is something
      one wants to keep at a minimum.  Local caches don't
      help much if your lookups don't exhibit spacial (memorial? :-)
      locality.  This would depend on the machine, of course,
      but in any general purpose machine also running other users'
      code, one cannot count on the availability of fast-access
      memory.  Table lookups are grand, but are the most silicon
      expensive way of implementing a function, over any other
      non-redundant means.  Generality vs. speed and real estate.
      This doesn't mean they aren't great for specific applications,
      just not this particular one.

By the way, I did use a lookup table for board evaluation results,
but that required lookups *far* less often.  The program never
lost a game...

					     -= David Stoutamire =-