[comp.arch] Bitfield and loop instructions--a good idea?

cs450a03@uc780.umd.edu (04/16/91)

Zalman Stern writes:
>    rrib	Takes bit 0 of source register, rotates it by an amount in
>		a register and inserts it into the destination register.
>		(I'm not sure what this is used for.)

That's for setting up bit arrays where you're applying some predicate
to a large quantity of data.

>All in all this stuff is neat and adds character to the architecture.
>However, I'm not sure it is the best way to do things. Some codes
>will perform better due to the bit manipulation instructions, but I
>doubt they show up very often at all.

They don't in programs written in C.  C has no support for generic bit
crunching.  (The same goes for FORTRAN, and so on...)

Raul Rockwell

zalman@mips.com (Zalman Stern) (04/16/91)

Check out the IBM RS/6000 (POWER architecture for He-Man fans). It has
interesting mask and loop counter instructions.

The mask instructions come in many flavors. The basic idea is to construct
a mask from two 5 bit quantities called Mask Begin (MB) and Mask End (ME).
The following is from the IBM manual (*) page 5-205 (text in square
brackets is mine):

    If the MB value is less than the ME value + 1, then the mask bits between
    and including the starting point [bit numbered MB] and the end point [bit
    numbered ME] are set to ones. All other bits are set to zeros.

    If the MB value is the same as the ME value + 1, then all 32 mask bits are
    set to ones.

    If the MB value is greater than the ME value + 1, then the mask bits
    between and including the ME value + 1 and the MB value - 1 are set to
    zeros. All other bits are set to ones.

    [I'm pretty sure bit 0 is most significant.]

(*) IBM RISC System/6000 Assembler Language Reference (SC23-2197-00)

Here is a brief listing of instructions that use the mask mechanism:

    maskg	Given MB and ME in registers, construct a mask and put it
                a third register.
    maskir	Given an already constructed mask in a register, insert a
		source register into a destination register under the mask.
		That is, every where the is a one in the mask, copy the bit
		from the source register to the destination register. Where
		there are zeros in the mask, leave the bits alone. Note
		that the mask can be more general than the above
		description of masks.
    rlimi	Rotate left immediate then mask insert. Takes rotation
		amount, MB, and ME as 5 bit immediates. Rotates source
		register and then does a mask insert into destination
		register. (See above description of insertion.)
    rlinm	Rotate left immediate then AND with mask. Takes rotation
		amount, MB, and ME as 5 bit immediates. Rotates source
		register then ands result with mask. Places result of and in
		destination register.
    rlmi	Same as rlimi except rotate amount is in a register.
    rlnm	Same as rlinm except rotate amount is in a register.
    rrib	Takes bit 0 of source register, rotates it by an amount in
		a register and inserts it into the destination register.
		(I'm not sure what this is used for.)

Rlimi does left and right logical immediate shifts. (I think it corresponds
to what most shifter hardware looks like internally as well.) I hacked some
assembly for this machine and was amazed at how often rlimi came in handy.
(This instruction is so cool I'd almost advocate adding it to our
instruction set. But Earl would probably hit me over the head with an R4000
architecture manual and make me write "Thou shalt pay attention to dynamic
instruction frequency statistics!" 100 times on my whiteboard.)

The POWER architecture also provides a vast set of shift instructions.
These can be used to do double precision shifts and multiple precision
shifts with great efficiency. Add in the count leading zeros and you have a
fairly complete set of bit flicking instructions. (Herman Rubin might even
like this machine. Then again he'd probably bitch about it being worthless
due to the difficulty of using the FP hardware on integer data :-))

As to loop type instructions, the RIOS has a count register in the branch
unit (a separate chip in the current implementation). There is an
instruction that will decrement the count register and conditionally branch
based on the result of the decrement and possibly a condition register
field. This lends itself to loops which are testing a bound and an exit
condition. (I.e stepping through an array looking for a value.) The count
register cannot easily be used as an index since moving it from the branch
chip to the fixed point chip is expensive (i.e. it is not a general purpose
register (GPR)).

Loops which fit this model incur no overhead for the counter or the branch
as the branch unit executes instructions in parallel with the fixed point
and floating point units. In fact it is key to the architecture since
maintaining the counter in the fixed point unit would likely impose a delay
(up to three cycles) on every iteration of the loop. This will show up on
more complex loop conditions anyway. (Loops with function calls also lose
as the count register is caller save.) I'm convinced that compare-and
branch instructions (like on MIPS R-series and HP PA) are the way to go.
(Some other time maybe I'll post a comparison of the R4000 and RIOS branch
instructions. If the RIOS had a load instruction that set the condition
codes it would do better...)

All in all this stuff is neat and adds character to the architecture.
However, I'm not sure it is the best way to do things. Some codes will
perform better due to the bit manipulation instructions, but I doubt they
show up very often at all. (And for multi-precision shifts, there better be
one very important code to make it worthwhile because they never show up in
general purpose stuff.)

The only real way to know is to measure these things across large bodies of
code. I would like to hear what sort of measurements the IBM people did and
how they justified many of these instructions. Both MIPS and HP have been
rumored to use the "one percent rule" in that if an instruction can't boost
performance by 1%, then it doesn't belong in the architecture. For
measurement tools, MIPS has had pixie from very early on and Sun now has
spix for SPARC. What sort of tools do the IBM people use to analyze their
architecture? (At least one RIOS architect has posted here recently...)

Beyond performance, I'd argue that these features limit future
architectural flexibility.  For example, I bet doing an upward compatible
64 bit RIOS architecture is going to be much more difficult than going from
MIPS-II (R3000) to MIPS-III (R4000). (With 64 bit registers you now need 6
bits for those mask constants. It is a less important to make double
precision shifts fast when single precision will now do 64 bits. Etc...)

meissner@osf.org (Michael Meissner) (04/16/91)

In article <2302@spim.mips.COM> zalman@mips.com (Zalman Stern) writes:

| Rlimi does left and right logical immediate shifts. (I think it corresponds
| to what most shifter hardware looks like internally as well.) I hacked some
| assembly for this machine and was amazed at how often rlimi came in handy.
| (This instruction is so cool I'd almost advocate adding it to our
| instruction set. But Earl would probably hit me over the head with an R4000
| architecture manual and make me write "Thou shalt pay attention to dynamic
| instruction frequency statistics!" 100 times on my whiteboard.)

I wouldn't be surprised if a workstation running X, actually could do
a fair bit of these instructions in the server bitblt's.  Of course if
you don't have a direct display, but use an X terminal, you wouldn't
see these.  However, it's probably harder to measure dynamic
instruction frequency in such a case.....

--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?

zalman@mips.com (Zalman Stern) (04/17/91)

In article <2302@spim.mips.COM> I (zalman@mips.com) write:
[...]
>MIPS-II (R3000) to MIPS-III (R4000). (With 64 bit registers you now need 6

The mapping from R-series implementations to versions of the
architecture is something like:

    R2000  MIPS-I
    R2000A MIPS-I (same die as R3000)
    R3000  MIPS-I
    R3000A MIPS-I plus reverse endian bit
    R4000  MIPS-III
    R6000  MIPS-II (ECL implementation, all others CMOS)

MIPS-I is the instruction set documented in the Kane book. MIPS-II and
MIPS-III will be documented in a forthcoming revision of that book.
MIPS-II has a number of minor architectural improvements and MIPS-III does
64 bits. Each revision is of course upwards compatible with previous ones
(at least for user code).

My apologies for the above quoted confusion. I'm sure someone will correct
me if I got it wrong here too.

john@acorn.co.uk (John Bowler) (04/17/91)

In article <MEISSNER.91Apr16125239@curley.osf.org> meissner@osf.org (Michael Meissner) writes:
>In article <2302@spim.mips.COM> zalman@mips.com (Zalman Stern) writes:
>
>| Rlimi does left and right logical immediate shifts. (I think it corresponds
>| to what most shifter hardware looks like internally as well.) I hacked some
>| assembly for this machine and was amazed at how often rlimi came in handy.
>
>I wouldn't be surprised if a workstation running X, actually could do
>a fair bit of these instructions in the server bitblt's.  Of course if
>you don't have a direct display, but use an X terminal, you wouldn't
>see these.  However, it's probably harder to measure dynamic
>instruction frequency in such a case.....

The inner bitblt loop of an X server becomes very important under some
circumstances.  Particularly text scrolling and window movement (these both
have a significant effect on the user perception of performance).

The main problem with bit blt in X is reducing memory access.  Frequently
screen memory is not cached ('cause it tends not to be accessed in a
localised fashion), so what matters is avoiding accessing the same word
more than once.  The general scan line raster op is (I may get this
wrong...):-

	dst := dst & ~mask | (dst OP src) & mask

(mask for the plane mask and to deal with the start and end of the scan
line).  Given this and a machine architecture which favours word aligned
access (or, indeed, mandates it!) it is convenient to align the ``src''
value to the ``dst''.  Then the algorithm becomes:-

	load a word of dst
	load the next word of src
*	use previous src and newly loaded word to calculate a
		dst-aligned src word
	perform the calculation (above)
	loop

All the bit twiddling is in (*).  If the architecture *only* has one bit
shifts, resign and go to work for a company which uses a sensible
architecture.  If the architecture *only* has immediate constant shifts
write a raster-op compiler for the above code which will compile the
appropriate instructions for a given dst/src mis-alignment.  Do all the
optimisations you can (eg for co-aligned dst/src; as encountered in vertical
scrolling (xterm) - no shifts required!)

If the architecture has shifts by amounts held in registers, forget
raster-op compilers (the setup cost is too great), do the shifts in-line.
The best architecture allows the combination of logical shift operations
with logical arithmetic; you make ``src'' using (on a little endian
architecture):-

	src = prev >> shift;
	prev = *srcptr++;
	src |= prev << (BPW-shift);

(BPW = Bits Per Word).

Given that (BPW-shift) is ammenable to CSE by the compiler this is about
four instructions (well, it's three instructions on the ARM; for those who
understand the ARM assembler:-

	MOV	src, prev, LSR shift
	LDR	prev, [srcptr], #4
	ORR	src, src, prev, LSL BPWshift
)

Conclusions?  The general bit extract instruction is not necessary -
the bits are (by careful arrangement of the algorithm) always at one
end or the other of the word.  Logical shifts by variable amounts
are essential, being able to specify the amount in a register helps
(incidentally doing this doubles the execution time on the ARM -
four registers in the instruction rather than the normal three).
Being able to combine shifts with logical operations helps slightly.

The only way in which I can see a general bit extract helping is
if an instruction of the form:-

	extract 32 bits from this register *PAIR*

exists; then the above can be reduced to two instructions.  Is this
really worth it?

This loop is for a general raster op (ie the loop does all the possible X
raster ops, not just one of them) and for any source/destination alignment.
On an ARM the loop amounts to 17 instructions of which 14 are arithmetic
(logical), and 3 access memory.  The whole algorithm can be expressed in C
- not only are there no hardware ``tricks'' involved, there is no use of
inaccessible machine operations such as rotate.

Of these instructions only two involve bit shifts.  Hence a further
conclusion; even in code where bit level operations are (apparently)
important they are still relatively infrequent.  In the above loop (on the
ARM) the shift instructions account for under one seventh of the total time.
I'm not sure that any meaningful conclusions can be drawn from this, but,
for comparison, the memory access takes over one third of the time and
doing the raster op takes about one third.

When the source is being copied to the destination and the mask is
inactive (all ones) - the common case, the raster op is not necessary
and one memory access can be saved (loading dst).  In this case (again
on the ARM) 6 instructions are needed and almost one quarter of the
time goes in the shift operations.  One half goes in memory access
(the other quarter is in the two instructions which do the DBRA
equivalent - no delayed branches!).  This code is equivalent to
optimised ARM code for bcopy/memcpy; so the case is important in other
programs too.  Only in this case is the overhead of bit twiddling or
of loop handling starting to appear - and this is the most extreme
case which I can think of.

In practice (for memcpy) loop unwinding and special casing of the various
shifts starts to pay off.  But loop unwinding allows sequential memory
access for load/store which is a big win, potentially reducing memory access
time to a half of what it would otherwise be; forget the saving on loop
instructions!  Special casing the shifts halves the cost on the ARM - but
this is a matter of having the instuctions go fast, not of having wacky
instructions which do appropriate things.  The real irony is that these
*programming* techniques make the advantage of even the special purpose CISC
byte string movement instructions relatively small; even non-existent if
they are not actually implemented optimally (in particular if they do not
take advantage of sequential mode access to DRAM when other instructions
can).

John Bowler (jbowler@acorn.co.uk)

glew@pdx007.intel.com (Andy Glew) (04/19/91)

>The only way in which I can see a general bit extract helping is
>if an instruction of the form:-
>
>	extract 32 bits from this register *PAIR*
>
>exists; then the above can be reduced to two instructions.

The i386(tm) architecture's SHLD and SHRD (shift left double and shift
right double) allow 32 bits to be extracted from a register pair. They
overwrite one of their source operands, though, so a MOV is required
in the general case (although usually you can set things up so as not
to need one of the operands afterwards).

--

Andy Glew, glew@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
Hillsboro, Oregon 97124-6497

This is a private posting; it does not indicate opinions or positions
of Intel Corp.

tim@nucleus.amd.com (Tim Olson) (04/20/91)

In article <6444@acorn.co.uk> john@acorn.co.uk (John Bowler) writes:
| 	src = prev >> shift;
| 	prev = *srcptr++;
| 	src |= prev << (BPW-shift);
| 
| (BPW = Bits Per Word).
| 
| Given that (BPW-shift) is ammenable to CSE by the compiler this is about
| four instructions (well, it's three instructions on the ARM; for those who
| understand the ARM assembler:-
| 
| 	MOV	src, prev, LSR shift
| 	LDR	prev, [srcptr], #4
| 	ORR	src, src, prev, LSL BPWshift
| )
| 
| The only way in which I can see a general bit extract helping is
| if an instruction of the form:-
| 
| 	extract 32 bits from this register *PAIR*
| 
| exists; then the above can be reduced to two instructions.  Is this
| really worth it?

Sometimes it is.  That is exactly what the extract instruction on the
Am29000 processor does -- extract a 32-bit word from the concatenation
of two register values, starting at any bit.  It is useful for
BITBLTs, as well as performing arbitrary byte-aligned memory transfers
at an aligned word transfer rate.

--
	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)