[comp.arch] Leading zero and pop count

pmk@hall.cray.com (Peter Klausler) (11/03/88)

In article <998@l.cc.purdue.edu>, Herman Rubin proposes that HW be able to:
>	Find the distance to the next one in a bit stream FAST. It would
> be good to have an exception handler if one is not found.  I am considering
> algorithms not worth implementing if the operation is slow.

This is the "leading zero count" instruction common to Cray instruction sets.
The CFT77 compiler provides an intrinsic LEADZ function to get at this
instruction. The Cray-2 also has a vector version.

The leading zero count and population count (add up the 1 bits) instructions are
fertile ground for some nifty bit-twisting tricks. For example, the code
sequence I generate for the C logical negation operator ("!") is:

	S1	ZS0	; take leading zero count of argument
	S1	S1>6	; shift right six bits

This turns 0 into 1 and everything else into 0 on a 64-bit machine.

Population count can be handy when applied to vector masks.
After constructing an iota vector from a mask, the pop count gives you
its length.

I've never seen these useful operations in another architecture, though.
Have you? I'd be interested in hearing about it. I'm ignorant of HW design,
but they can't be too hard to implement.

-Peter Klausler / Cray Research compiler development / Mendota Heights, MN

smryan@garth.UUCP (Steven Ryan) (11/05/88)

>I've never seen these useful operations in another architecture, though.
>Have you? I'd be interested in hearing about it. I'm ignorant of HW design,
>but they can't be too hard to implement.
>
>-Peter Klausler / Cray Research compiler development / Mendota Heights, MN

Obviously collusion of CDC and Cray management. CDC Cyber 205 provided all
these and more. The compiler group, however, never knew anything about what
was in a Cray.
-- 
                                                   -- s m ryan
+-------------------------------+----------------------------------------------+
| Home of the brave and land of |  Congress shall make no law respecting the   |
| [deleted by the authority of  |  establishment of religion,...; or abridging |
| the Official Secrets Act].    |  the freedom of speech, or of the press;...  |
+-------------------------------+----------------------------------------------+

aglew@urbsdc.Urbana.Gould.COM (11/06/88)

>This is the "leading zero count" instruction common to Cray instruction sets.
>The CFT77 compiler provides an intrinsic LEADZ function to get at this
>instruction. The Cray-2 also has a vector version.
>
>The leading zero count and population count (add up the 1 bits) instructions are
>fertile ground for some nifty bit-twisting tricks. For example, the code
>sequence I generate for the C logical negation operator ("!") is:
>
>	S1	ZS0	; take leading zero count of argument
>	S1	S1>6	; shift right six bits
>
>This turns 0 into 1 and everything else into 0 on a 64-bit machine.
>
>Population count can be handy when applied to vector masks.
>After constructing an iota vector from a mask, the pop count gives you
>its length.
>
>I've never seen these useful operations in another architecture, though.
>Have you? I'd be interested in hearing about it. I'm ignorant of HW design,
>but they can't be too hard to implement.
>
>-Peter Klausler / Cray Research compiler development / Mendota Heights, MN

Gould had the SACZ (shift and count zeroes)  instruction,
which actually shifted on early models, but decoded in one cycle
on high end models.

MC680x0s have BFF0 (bit find first 0) (I always get the mnemonic
for this wrong). 

Many other machines have similar.

POPulation count instructions are found in most machines that have
had NSA interest.

newbery@rata.vuw.ac.nz (Michael Newbery) (11/08/88)

The Burrogues B6700 Leading One Test instruction had the mnemonic LOG2 since
that is what the op approximates to. It also had a CBON (count binary ones)
instruction.

One thing that is difficult to do in software (counter examples are welcomed
please!) and easy to do in hardware is to reverse a bit stream.

Michael Newbery <NEWBERY@rs1.vuw.ac.nz>

thomas@uplog.se (Thomas Hameenaho) (11/08/88)

In article <28200233@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>MC680x0s have BFF0 (bit find first 0) (I always get the mnemonic
>for this wrong). 
>

Not only the mnemonics!
Its BFFFO, Bit Field Find First One.
                                ^^^



-- 
Real life:	Thomas Hameenaho		Email:	thomas@uplog.{se,uucp}
Snail mail:	TeleLOGIC Uppsala AB		Phone:	+46 18 189406
		Box 1218			Fax:	+46 18 132039
		S - 751 42 Uppsala, Sweden

loeliger@mozart.uucp (Jon Loeliger) (11/10/88)

In article <10882@hall.cray.com> pmk@hall.cray.com (Peter Klausler) writes:
>Population count can be handy when applied to vector masks.
>After constructing an iota vector from a mask, the pop count gives you
>its length.
>
>I've never seen these useful operations in another architecture, though.
>Have you? I'd be interested in hearing about it. I'm ignorant of HW design,
>but they can't be too hard to implement.
>
>-Peter Klausler / Cray Research compiler development / Mendota Heights, MN

Yep, they are used elsewhere.  Convex C1 and C2 both have trailing-zero-count
and population counts.  There is actually an esoteric reason to prefer a
leading zero count over a trailing count....  If I'm not mistaking, there
is also now a leading-ones-place instruction microcoded into the C2 set.

Disclaimer: I'm not a hardware jock, just a compiler weenie...
jdl
------------------------------------------------------------------------------
Jon Loeliger			    loeliger@convex.com
Convex Computer Corporation	   "You should never attack an anarchist.  He
701 N Plano, Richardson, TX 75081   has nothing to lose."   Joe Bob's friend



------------------------------------------------------------------------------
Jon Loeliger			    loeliger@convex.com
Convex Computer Corporation	   "You should never attack an anarchist.  He
701 N Plano, Richardson, TX 75081   has nothing to lose."   Joe Bob's friend

dik@cwi.nl (Dik T. Winter) (11/11/88)

In article <705@convex.UUCP> loeliger@convex.com (Jon Loeliger) writes:
 > Yep, they are used elsewhere.  Convex C1 and C2 both have trailing-zero-count
 > and population counts.  There is actually an esoteric reason to prefer a
 > leading zero count over a trailing count....

The best reason is of course that, given a popcount instruction, getting a
trailing zero count takes only four instructions(*).  Getting a leading
zero count takes a few more.

(*) POPCOUNT(AND(NOT(X),X-1))
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

bgb@ihlpg.ATT.COM (Beuning) (11/12/88)

If anyone remembers the chess 0.5 program published in
BYTE magazine a long time ago, it had two functions that
did these functions in software.

They use the idea of a bit board.  A bit board has 64 bits
(one for each chess square) with a bit being 1 if something
interesting was in that square.  For example there is
a bit board for all white knights.  The pop count function
(cntrs in the program) would count the number of white
knights and the leading zero function (nxtrs in the program)
would find the first square that held a white knight.
The nxtrs function was the highest time using function
so the authors provided an assembly language version
that made use of the floating point normalize instruction.

wsmith@m.cs.uiuc.edu (11/20/88)

>One thing that is difficult to do in software (counter examples are welcomed
>please!) and easy to do in hardware is to reverse a bit stream.

What exactly to you mean by bit stream?   If you mean a single word, it is
very easy.  If it is more than a word the same trick may be used on a larger
scale:  (This is actually Pete Madany's code on loan.)

union WC {
	unsigned int	w;
	unsigned char	c[4];
} in, out;

out.c[3] = table[in.c[0]];
out.c[2] = table[in.c[1]];
out.c[1] = table[in.c[2]];
out.c[0] = table[in.c[3]];

The 256 byte array, table,  begins { 0x00, 0x80, 0x40, 0xc0, 0x20, 0xa0 ... }

Bill Smith
wsmith@cs.uiuc.edu
uiucdcs!wsmith

cik@l.cc.purdue.edu (Herman Rubin) (11/20/88)

In article <3300039@m.cs.uiuc.edu>, wsmith@m.cs.uiuc.edu writes:
> 
> >One thing that is difficult to do in software (counter examples are welcomed
> >please!) and easy to do in hardware is to reverse a bit stream.
> 
> What exactly to you mean by bit stream?   

	[Rest of article deleted.]

By _bit stream_ I mean a consecutive collection of bits.  These bits can have
an arbitrary starting and stopping point.

As an example of the use of bit streams, there are methods for generating
non-uniform random variables which make heavy use of the following bit stream
operations.

	Find the distance to the next one in a bit stream.

	Branch on the next bit in the stream.

In these situations, the bits read are removed from the stream (i.e., the 
pointer is advanced).  Also, it is necessary to create an exception if the
stream is exhausted.  Since these are tools, and other algorithms are
available which are computationally more complex but are better suited to
computer architecture, slow methods of accomplishing the ends will not be
used.  (BTW, the different methods yield different results.)

It should be evident that a procedure which only looks at a few bits is
computationally simpler than the addition of two numbers.  But if it takes
20 times as long, it will be only used if absolutely necessary.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

dtynan@sultra.UUCP (Der Tynan) (11/22/88)

In article <3300039@m.cs.uiuc.edu>, wsmith@m.cs.uiuc.edu writes:
> 
> >One thing that is difficult to do in software (counter examples are welcomed
> >please!) and easy to do in hardware is to reverse a bit stream.
> 
> What exactly to you mean by bit stream?   If you mean a single word, it is
> very easy.  If it is more than a word the same trick may be used on a larger
> scale:  (This is actually Pete Madany's code on loan.)
> 
> The 256 byte array, table,  begins { 0x00, 0x80, 0x40, 0xc0, 0x20, 0xa0 ... }
> 
> Bill Smith

The original author was talking about bit-reversal within a word.  The classic
example of this, is the addressing problem in a Fast Fourier Transform.  The
address bits must be reversed, during the last iteration.  The problem is not
so much the reversal (as shown by your table), but the fact that it is
dependant on the transform size.  For example, a 64-pt FFT must be bit-reversed
in six bits.  Your table can be used for this, but the result must be shifted
twice.  In real architectures, the bit-reversal can be anywhere from six to
twenty-one bits.  Thus, the table approach won't work.  TI has a barrel shifter
which does this (to sixteen bits -- the partnumber escapes me) bit-reversal,
which goes to prove the original point, that it can be done in hardware easier
than software.
						- Der
-- 
	dtynan@zorba.Tynan.COM  (Dermot Tynan @ Tynan Computers)
	{apple,mips,pyramid,uunet}!Tynan.COM!dtynan

 ---  If the Law is for the People, then why do we need Lawyers? ---