cs450a03@uc780.umd.edu (04/16/91)
Zalman Stern writes: > rrib Takes bit 0 of source register, rotates it by an amount in > a register and inserts it into the destination register. > (I'm not sure what this is used for.) That's for setting up bit arrays where you're applying some predicate to a large quantity of data. >All in all this stuff is neat and adds character to the architecture. >However, I'm not sure it is the best way to do things. Some codes >will perform better due to the bit manipulation instructions, but I >doubt they show up very often at all. They don't in programs written in C. C has no support for generic bit crunching. (The same goes for FORTRAN, and so on...) Raul Rockwell
zalman@mips.com (Zalman Stern) (04/16/91)
Check out the IBM RS/6000 (POWER architecture for He-Man fans). It has interesting mask and loop counter instructions. The mask instructions come in many flavors. The basic idea is to construct a mask from two 5 bit quantities called Mask Begin (MB) and Mask End (ME). The following is from the IBM manual (*) page 5-205 (text in square brackets is mine): If the MB value is less than the ME value + 1, then the mask bits between and including the starting point [bit numbered MB] and the end point [bit numbered ME] are set to ones. All other bits are set to zeros. If the MB value is the same as the ME value + 1, then all 32 mask bits are set to ones. If the MB value is greater than the ME value + 1, then the mask bits between and including the ME value + 1 and the MB value - 1 are set to zeros. All other bits are set to ones. [I'm pretty sure bit 0 is most significant.] (*) IBM RISC System/6000 Assembler Language Reference (SC23-2197-00) Here is a brief listing of instructions that use the mask mechanism: maskg Given MB and ME in registers, construct a mask and put it a third register. maskir Given an already constructed mask in a register, insert a source register into a destination register under the mask. That is, every where the is a one in the mask, copy the bit from the source register to the destination register. Where there are zeros in the mask, leave the bits alone. Note that the mask can be more general than the above description of masks. rlimi Rotate left immediate then mask insert. Takes rotation amount, MB, and ME as 5 bit immediates. Rotates source register and then does a mask insert into destination register. (See above description of insertion.) rlinm Rotate left immediate then AND with mask. Takes rotation amount, MB, and ME as 5 bit immediates. Rotates source register then ands result with mask. Places result of and in destination register. rlmi Same as rlimi except rotate amount is in a register. rlnm Same as rlinm except rotate amount is in a register. rrib Takes bit 0 of source register, rotates it by an amount in a register and inserts it into the destination register. (I'm not sure what this is used for.) Rlimi does left and right logical immediate shifts. (I think it corresponds to what most shifter hardware looks like internally as well.) I hacked some assembly for this machine and was amazed at how often rlimi came in handy. (This instruction is so cool I'd almost advocate adding it to our instruction set. But Earl would probably hit me over the head with an R4000 architecture manual and make me write "Thou shalt pay attention to dynamic instruction frequency statistics!" 100 times on my whiteboard.) The POWER architecture also provides a vast set of shift instructions. These can be used to do double precision shifts and multiple precision shifts with great efficiency. Add in the count leading zeros and you have a fairly complete set of bit flicking instructions. (Herman Rubin might even like this machine. Then again he'd probably bitch about it being worthless due to the difficulty of using the FP hardware on integer data :-)) As to loop type instructions, the RIOS has a count register in the branch unit (a separate chip in the current implementation). There is an instruction that will decrement the count register and conditionally branch based on the result of the decrement and possibly a condition register field. This lends itself to loops which are testing a bound and an exit condition. (I.e stepping through an array looking for a value.) The count register cannot easily be used as an index since moving it from the branch chip to the fixed point chip is expensive (i.e. it is not a general purpose register (GPR)). Loops which fit this model incur no overhead for the counter or the branch as the branch unit executes instructions in parallel with the fixed point and floating point units. In fact it is key to the architecture since maintaining the counter in the fixed point unit would likely impose a delay (up to three cycles) on every iteration of the loop. This will show up on more complex loop conditions anyway. (Loops with function calls also lose as the count register is caller save.) I'm convinced that compare-and branch instructions (like on MIPS R-series and HP PA) are the way to go. (Some other time maybe I'll post a comparison of the R4000 and RIOS branch instructions. If the RIOS had a load instruction that set the condition codes it would do better...) All in all this stuff is neat and adds character to the architecture. However, I'm not sure it is the best way to do things. Some codes will perform better due to the bit manipulation instructions, but I doubt they show up very often at all. (And for multi-precision shifts, there better be one very important code to make it worthwhile because they never show up in general purpose stuff.) The only real way to know is to measure these things across large bodies of code. I would like to hear what sort of measurements the IBM people did and how they justified many of these instructions. Both MIPS and HP have been rumored to use the "one percent rule" in that if an instruction can't boost performance by 1%, then it doesn't belong in the architecture. For measurement tools, MIPS has had pixie from very early on and Sun now has spix for SPARC. What sort of tools do the IBM people use to analyze their architecture? (At least one RIOS architect has posted here recently...) Beyond performance, I'd argue that these features limit future architectural flexibility. For example, I bet doing an upward compatible 64 bit RIOS architecture is going to be much more difficult than going from MIPS-II (R3000) to MIPS-III (R4000). (With 64 bit registers you now need 6 bits for those mask constants. It is a less important to make double precision shifts fast when single precision will now do 64 bits. Etc...)
meissner@osf.org (Michael Meissner) (04/16/91)
In article <2302@spim.mips.COM> zalman@mips.com (Zalman Stern) writes: | Rlimi does left and right logical immediate shifts. (I think it corresponds | to what most shifter hardware looks like internally as well.) I hacked some | assembly for this machine and was amazed at how often rlimi came in handy. | (This instruction is so cool I'd almost advocate adding it to our | instruction set. But Earl would probably hit me over the head with an R4000 | architecture manual and make me write "Thou shalt pay attention to dynamic | instruction frequency statistics!" 100 times on my whiteboard.) I wouldn't be surprised if a workstation running X, actually could do a fair bit of these instructions in the server bitblt's. Of course if you don't have a direct display, but use an X terminal, you wouldn't see these. However, it's probably harder to measure dynamic instruction frequency in such a case..... -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142 Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?
zalman@mips.com (Zalman Stern) (04/17/91)
In article <2302@spim.mips.COM> I (zalman@mips.com) write: [...] >MIPS-II (R3000) to MIPS-III (R4000). (With 64 bit registers you now need 6 The mapping from R-series implementations to versions of the architecture is something like: R2000 MIPS-I R2000A MIPS-I (same die as R3000) R3000 MIPS-I R3000A MIPS-I plus reverse endian bit R4000 MIPS-III R6000 MIPS-II (ECL implementation, all others CMOS) MIPS-I is the instruction set documented in the Kane book. MIPS-II and MIPS-III will be documented in a forthcoming revision of that book. MIPS-II has a number of minor architectural improvements and MIPS-III does 64 bits. Each revision is of course upwards compatible with previous ones (at least for user code). My apologies for the above quoted confusion. I'm sure someone will correct me if I got it wrong here too.
john@acorn.co.uk (John Bowler) (04/17/91)
In article <MEISSNER.91Apr16125239@curley.osf.org> meissner@osf.org (Michael Meissner) writes: >In article <2302@spim.mips.COM> zalman@mips.com (Zalman Stern) writes: > >| Rlimi does left and right logical immediate shifts. (I think it corresponds >| to what most shifter hardware looks like internally as well.) I hacked some >| assembly for this machine and was amazed at how often rlimi came in handy. > >I wouldn't be surprised if a workstation running X, actually could do >a fair bit of these instructions in the server bitblt's. Of course if >you don't have a direct display, but use an X terminal, you wouldn't >see these. However, it's probably harder to measure dynamic >instruction frequency in such a case..... The inner bitblt loop of an X server becomes very important under some circumstances. Particularly text scrolling and window movement (these both have a significant effect on the user perception of performance). The main problem with bit blt in X is reducing memory access. Frequently screen memory is not cached ('cause it tends not to be accessed in a localised fashion), so what matters is avoiding accessing the same word more than once. The general scan line raster op is (I may get this wrong...):- dst := dst & ~mask | (dst OP src) & mask (mask for the plane mask and to deal with the start and end of the scan line). Given this and a machine architecture which favours word aligned access (or, indeed, mandates it!) it is convenient to align the ``src'' value to the ``dst''. Then the algorithm becomes:- load a word of dst load the next word of src * use previous src and newly loaded word to calculate a dst-aligned src word perform the calculation (above) loop All the bit twiddling is in (*). If the architecture *only* has one bit shifts, resign and go to work for a company which uses a sensible architecture. If the architecture *only* has immediate constant shifts write a raster-op compiler for the above code which will compile the appropriate instructions for a given dst/src mis-alignment. Do all the optimisations you can (eg for co-aligned dst/src; as encountered in vertical scrolling (xterm) - no shifts required!) If the architecture has shifts by amounts held in registers, forget raster-op compilers (the setup cost is too great), do the shifts in-line. The best architecture allows the combination of logical shift operations with logical arithmetic; you make ``src'' using (on a little endian architecture):- src = prev >> shift; prev = *srcptr++; src |= prev << (BPW-shift); (BPW = Bits Per Word). Given that (BPW-shift) is ammenable to CSE by the compiler this is about four instructions (well, it's three instructions on the ARM; for those who understand the ARM assembler:- MOV src, prev, LSR shift LDR prev, [srcptr], #4 ORR src, src, prev, LSL BPWshift ) Conclusions? The general bit extract instruction is not necessary - the bits are (by careful arrangement of the algorithm) always at one end or the other of the word. Logical shifts by variable amounts are essential, being able to specify the amount in a register helps (incidentally doing this doubles the execution time on the ARM - four registers in the instruction rather than the normal three). Being able to combine shifts with logical operations helps slightly. The only way in which I can see a general bit extract helping is if an instruction of the form:- extract 32 bits from this register *PAIR* exists; then the above can be reduced to two instructions. Is this really worth it? This loop is for a general raster op (ie the loop does all the possible X raster ops, not just one of them) and for any source/destination alignment. On an ARM the loop amounts to 17 instructions of which 14 are arithmetic (logical), and 3 access memory. The whole algorithm can be expressed in C - not only are there no hardware ``tricks'' involved, there is no use of inaccessible machine operations such as rotate. Of these instructions only two involve bit shifts. Hence a further conclusion; even in code where bit level operations are (apparently) important they are still relatively infrequent. In the above loop (on the ARM) the shift instructions account for under one seventh of the total time. I'm not sure that any meaningful conclusions can be drawn from this, but, for comparison, the memory access takes over one third of the time and doing the raster op takes about one third. When the source is being copied to the destination and the mask is inactive (all ones) - the common case, the raster op is not necessary and one memory access can be saved (loading dst). In this case (again on the ARM) 6 instructions are needed and almost one quarter of the time goes in the shift operations. One half goes in memory access (the other quarter is in the two instructions which do the DBRA equivalent - no delayed branches!). This code is equivalent to optimised ARM code for bcopy/memcpy; so the case is important in other programs too. Only in this case is the overhead of bit twiddling or of loop handling starting to appear - and this is the most extreme case which I can think of. In practice (for memcpy) loop unwinding and special casing of the various shifts starts to pay off. But loop unwinding allows sequential memory access for load/store which is a big win, potentially reducing memory access time to a half of what it would otherwise be; forget the saving on loop instructions! Special casing the shifts halves the cost on the ARM - but this is a matter of having the instuctions go fast, not of having wacky instructions which do appropriate things. The real irony is that these *programming* techniques make the advantage of even the special purpose CISC byte string movement instructions relatively small; even non-existent if they are not actually implemented optimally (in particular if they do not take advantage of sequential mode access to DRAM when other instructions can). John Bowler (jbowler@acorn.co.uk)
glew@pdx007.intel.com (Andy Glew) (04/19/91)
>The only way in which I can see a general bit extract helping is >if an instruction of the form:- > > extract 32 bits from this register *PAIR* > >exists; then the above can be reduced to two instructions. The i386(tm) architecture's SHLD and SHRD (shift left double and shift right double) allow 32 bits to be extracted from a register pair. They overwrite one of their source operands, though, so a MOV is required in the general case (although usually you can set things up so as not to need one of the operands afterwards). -- Andy Glew, glew@ichips.intel.com Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, Hillsboro, Oregon 97124-6497 This is a private posting; it does not indicate opinions or positions of Intel Corp.
tim@nucleus.amd.com (Tim Olson) (04/20/91)
In article <6444@acorn.co.uk> john@acorn.co.uk (John Bowler) writes: | src = prev >> shift; | prev = *srcptr++; | src |= prev << (BPW-shift); | | (BPW = Bits Per Word). | | Given that (BPW-shift) is ammenable to CSE by the compiler this is about | four instructions (well, it's three instructions on the ARM; for those who | understand the ARM assembler:- | | MOV src, prev, LSR shift | LDR prev, [srcptr], #4 | ORR src, src, prev, LSL BPWshift | ) | | The only way in which I can see a general bit extract helping is | if an instruction of the form:- | | extract 32 bits from this register *PAIR* | | exists; then the above can be reduced to two instructions. Is this | really worth it? Sometimes it is. That is exactly what the extract instruction on the Am29000 processor does -- extract a 32-bit word from the concatenation of two register values, starting at any bit. It is useful for BITBLTs, as well as performing arbitrary byte-aligned memory transfers at an aligned word transfer rate. -- -- Tim Olson Advanced Micro Devices (tim@amd.com)