[comp.arch] Auto-shifted registers

radford@calgary.UUCP (Radford Neal) (02/21/88)

Several instruction sets, such as the 68020 and the VAX, have
an "indexed" addressing mode, in which a register is shifted 
left by 0, 1, 2, or 3 bits before being added to a displacement,
allowing easy access to arrays of ints, etc.

I has occurred to me that this "auto-shifting" might be useful
for ALL register references. Every reference to a register as
a source operand would be accompanied by two bits giving a
shift amount - equivalet to a multiplication by 1, 2, 4, or 8.

This gets you indexed addressing for no extra work, given
register-offset addressing. It also lets you do various
other operations quickly. For example:

   add r1 = 2*r2+8*r2   Calculates 10*r2 in a single instruction
   
   add r1 = 2*r1+r2     Shifts a bit into the bottom r1

(I assume here an instruction set with three-address, register-to-
register instructions, a la the CDC 6600.)

Speaking from a position of marginal hardware literacy, it seems
to me that the shifting might be done in parallel with the
register address decode. The data path out of the register
bank would contain a one-bit shifter and a two-bit shifter (essentially
a very small barell-shifter). While the register address is
being decoded, the decision as to whether the data is shifted
in either of these two shifters is being made also. When the
data comes out of the register, it propagates through the
two shifters at close to the speed of light.

Anyone want to comment on this idea?

   Radford Neal
   The University of Calgary

kers@otter.hple.hp.com (Christopher Dollin) (02/23/88)

"radford@calgary.UUCP (Radford Neal)" says

|Several instruction sets, such as the 68020 and the VAX, have
|an "indexed" addressing mode, in which a register is shifted
|left by 0, 1, 2, or 3 bits before being added to a displacement,
|allowing easy access to arrays of ints, etc.
|
|I has occurred to me that this "auto-shifting" might be useful
|for ALL register references. Every reference to a register as
|a source operand would be accompanied by two bits giving a
|shift amount - equivalet to a multiplication by 1, 2, 4, or 8.

The Acorn Risc Machine has some of this.

The data manipulation instructions have the general form

    Target := Source1 Op Source2

where Target and Source1 are registers, and Source2 can be an immediate value
(an 8-bit field on an even bit boundary) or a register shifted left, right 
(logical or arithmetic), or rotated right, either by a small constant (0..31)
or by the content of another register (this adds an extra cycle).

The simple load/store instructions compute the memory address as

    Base {+|-} {12Bit | Register_with_Shift}

12Bit is a 12bit byte offset, RegisterWithShift is a register possibly shifted
(left/right (log/arith), rotated right) by a constant 0..31 (the register case
was eliminated from the second-generation chip as no-one ever used it and they
saved some chip area).


Regards,
Kers                                    | "Why Lisp if you can talk Poperly?"

firth@sei.cmu.edu (Robert Firth) (02/23/88)

In article <1370@vaxb.calgary.UUCP> radford@calgary.UUCP (Radford Neal) writes:
>Several instruction sets, such as the 68020 and the VAX, have
>an "indexed" addressing mode, in which a register is shifted 
>left by 0, 1, 2, or 3 bits before being added to a displacement,
>allowing easy access to arrays of ints, etc.
>
>I has occurred to me that this "auto-shifting" might be useful
>for ALL register references. Every reference to a register as
>a source operand would be accompanied by two bits giving a
>shift amount - equivalet to a multiplication by 1, 2, 4, or 8.

The problem with your suggestion, and with scaled-index modes, is that
they introduce yet another nonorthogonality into the architecture.
On the 68020, for example, arrays with component size 1 are a special
case (no scaling); arrays of component size 2, 4, 8 are another special
case (can use index mode), and arrays of any other size are another
special case (can't use index mode).  That's one case too many.

Moreover, if you are going to generate slick code for, say, arrays whose
component size is 12, you have to do all that loop induction, strength
reduction &c stuff.  Surprise: once you've done that, you can use those
algorithms for all arrays, junk the scaled-index mode, and get better
code.

My own preference is for as simple an architecture as possible (but no
simpler, as Einstein remarked).  However, if people want to build a
scaled index mode, PLEASE let the damn thing scale for ANY array a
modern HLL can generate.  In particular for component sizes SMALLER
than the addressable unit as well as larger.  Nothing is more tiresome
than a special feature that works some of the time.

The Data General Nova did some of what you want, and the Acorn does more.
Writing a codegenerator for either of them is a pig, because you have to
handle an unconscionable number of special cases for seemingly simple
things like multiply, add, subtract, &c.

jmd@granite.dec.com (John Danskin) (02/24/88)

In article <1370@vaxb.calgary.UUCP> radford@calgary.UUCP (Radford Neal) writes:
>Several instruction sets, such as the 68020 and the VAX, have
>an "indexed" addressing mode, in which a register is shifted 
>left by 0, 1, 2, or 3 bits before being added to a displacement,
>allowing easy access to arrays of ints, etc.
>
>I has occurred to me that this "auto-shifting" might be useful
>for ALL register references. Every reference to a register as
>a source operand would be accompanied by two bits giving a
>shift amount - equivalet to a multiplication by 1, 2, 4, or 8.
.
.stuff deleted.
.
>Anyone want to comment on this idea?
>
>   Radford Neal
>   The University of Calgary

Add with shift instructions can be very nice. They speed address
calculations, and can be used as part of a multiply by small constant.
At least one RISC floating point oriented chip set incorporates such
instructions. (Sorry, they don't like people to talk about their
chipset so I can't say who they are.) However, based on my own coding
practice, it isn't worth giving up two instruction bits for the
capability. Instruction bits are incredibly useful things, and it is
hard enough to stay at 32 bits without 'using just a couple more
bits' for something that isn't going to be used even every tenth
cycle.
-- 
John Danskin				| decwrl!jmd
DEC Workstation Systems Engineering	| (415)853-6724 
100 Hamilton Avenue			| My comments are my own.
Palo Alto, CA  94306			| I do not speak for DEC.

viggy@hpcupt1.HP.COM (Viggy Mokkarala) (02/24/88)

radford@calgary.UUCP (Radford Neal) writes:

>It has occurred to me that this "auto-shifting" might be useful
>for ALL register references. Every reference to a register as
>a source operand would be accompanied by two bits giving a
>shift amount - equivalet to a multiplication by 1, 2, 4, or 8.

...
...

>	Radford Neal
>	The University of Calgary

The HP Precision Architecture provides for these kinds of operations by its
Shift and Add instructions.  There is a pre-shifter before one of the inputs to
the CPU.  It allows for one of the operands to be pre-shifted by upto 3 bits
before an addition happens.  These Shift and Add instructions come in two
flavors - one which traps on an overflow, and the other which doesn't.  A
general integer multiply routine takes, on the average, less than 20
instructions.

The assumptions and details of this claim can be found in the
paper titled "Integer Multiplication and Division on the HP Precision
Architecture", Proceedings of the Second International Conference on
Architectural Support for Programming Languages and Operating Systems (ASPLOS
II), pp 90 - 99.


Viggy Mokkarala, Hewlett Packard Co., Cupertino, CA.
(hpda!viggy)

jk3k+@andrew.cmu.edu (Joseph G. Keane) (02/24/88)

The problem i see with your suggestion is that it takes up much-needed 
instruction bits that could probably be better used some other way.

--Joe

ok@quintus.UUCP (Richard A. O'Keefe) (02/24/88)

In article <4307@aw.sei.cmu.edu>, firth@sei.cmu.edu (Robert Firth) writes:
: In article <1370@vaxb.calgary.UUCP: radford@calgary.UUCP (Radford Neal) writes:
: :Several instruction sets, such as the 68020 and the VAX, have
: :an "indexed" addressing mode, in which a register is shifted 
: :left by 0, 1, 2, or 3 bits before being added to a displacement,
: :allowing easy access to arrays of ints, etc.
: Moreover, if you are going to generate slick code for, say, arrays whose
: component size is 12, you have to do all that loop induction, strength
: reduction &c stuff.  Surprise: once you've done that, you can use those
: algorithms for all arrays, junk the scaled-index mode, and get better
: code.
 
It's worth noting that doing the indexing in software is FASTER than
using the scaled modes on the MC68020.  The MC68020 has lots of fancy
addressing modes that the MC68010 has, but (no doubt because they had
to be shoehorned into an existing instruction set) it is usually faster
to forget about them.  (The full-size multiply and divide are quite
another matter!)

I wouldn't like to see auto-increment addressing junked, though.

andy@polya.STANFORD.EDU (Andy Freeman) (02/25/88)

In article <188@granite.dec.com> jmd@granite.UUCP (John Danskin) writes:
>In article <1370@vaxb.calgary.UUCP> radford@calgary.UUCP (Radford Neal) writes:
>>[Radford Neal suggests that the ability to shift a register's value
>> in any kind of register reference might be a good thing.]
>Add with shift instructions can be very nice. They speed address
>calculations, and can be used as part of a multiply by small constant.
>.... However, based on my own coding practice, it isn't worth giving
>up two instruction bits for the capability. Instruction bits are
>incredibly useful things, and it is hard enough to stay at 32 bits
>without 'using just a couple more bits' for something that isn't going
>to be used even every tenth cycle.

The MIPS-X compute instructions, which are all register to register,
use 12 bits to determine the operation.  There are 17 of them (79 if
you count the shift instructions separately - MIPS-X uses 7 bits to
encode the shift amount but more compact schemes exist).  Now, it may
be that a more compact encoding would slow the chip down more than
``shift any register value'' would help, but that's different than
saying that 32 bits is a tight fit for register reference instructions
in a load-store architecture.

-andy

radford@calgary.UUCP (Radford Neal) (02/26/88)

In article <4307@aw.sei.cmu.edu>, firth@sei.cmu.edu (Robert Firth) writes:

> The problem with your suggestion, and with scaled-index modes, is that
> they introduce yet another nonorthogonality into the architecture...

> The Data General Nova did some of what you want, and the Acorn does more.
> Writing a codegenerator for either of them is a pig, because you have to
> handle an unconscionable number of special cases for seemingly simple
> things like multiply, add, subtract, &c.


This brings up an interesting point. What distinguishes an unnecessary
complication in the compiler from an advantagous transfer of work
from run time to compile time?

To my mind, replacing a multiply by two by a shift is an entirely 
appropriate activity for a compiler. I don't really mind doing 
more of the same, if it gives better performance. The added complexity
in the compiler is at the level of detail, not overall design. The
shift is *always* better than the multiply, so there's no real difficulty.

On the other hand, I hate two-address instructions, no matter 
how orthogonal, because they introduce more real options for the
compiler to worry about. The locally optimal code destroys one 
operand. Which operand should you destroy? Should you insteadad save
both operands at the cost of another move? Answering these questions
is a lot more work than handling special cases for multiplication.

As for the auto-shifted registers: my idea is that the shifting is
done at no extra cost, in parellel with the register address decode.
If this isn't true, I'd agree that it is a bad idea. If it is, handling
it in the compiler seems like a small price for doing two operations
in parellel.

   Radford Neal

radford@calgary.UUCP (Radford Neal) (02/26/88)

In article <6310005@hpcupt1.HP.COM>, viggy@hpcupt1.HP.COM (Viggy Mokkarala) writes:

> The HP Precision Architecture provides for these kinds of operations by its
> Shift and Add instructions.  There is a pre-shifter before one of the inputs to
> the CPU.  It allows for one of the operands to be pre-shifted by upto 3 bits
> before an addition happens. 

Does this cost you anything (other than chip area), or does the shift
overlap another operation? If it does overlap, is there any reason
not to allow it for all relevant instructions (e.g. and and or)?
Is there any reason to keep the unshifted add (given a shift of zero 
is possible)?

    Radford Neal

viggy@hpcupt1.HP.COM (Viggy Mokkarala) (03/02/88)

/ hpcupt1:comp.arch / radford@calgary.UUCP (Radford Neal) /  3:35 pm  Feb 25, 1988 /
In article <6310005@hpcupt1.HP.COM>, viggy@hpcupt1.HP.COM (Viggy Mokkarala) writes:

> The HP Precision Architecture provides for these kinds of operations by its
> Shift and Add instructions.  There is a pre-shifter before one of the inputs to
> the CPU.  It allows for one of the operands to be pre-shifted by upto 3 bits
> before an addition happens. 

Does this cost you anything (other than chip area), or does the shift
overlap another operation? If it does overlap, is there any reason
not to allow it for all relevant instructions (e.g. and and or)?
Is there any reason to keep the unshifted add (given a shift of zero 
is possible)?

    Radford Neal
----------

viggy@hpcupt1.HP.COM (Viggy Mokkarala) (03/02/88)

radford@calgary.UUCP (Radford Neal) writes:

>In article <6310005@hpcupt1.HP.COM>, viggy@hpcupt1.HP.COM (Viggy Mokkarala) writes:

>> The HP Precision Architecture provides for these kinds of operations by its
>> Shift and Add instructions.  There is a pre-shifter before one of the inputs to
>> the CPU.  It allows for one of the operands to be pre-shifted by upto 3 bits
>> before an addition happens. 

>Does this cost you anything (other than chip area), or does the shift
>overlap another operation? If it does overlap, is there any reason
>not to allow it for all relevant instructions (e.g. and and or)?
>Is there any reason to keep the unshifted add (given a shift of zero 
>is possible)?

>    Radford Neal

Sorry about the previous failed posting.

The pre-shifter before one of the CPU inputs is the same shifter that is used
for the "indexed loads" of the HP Precision Arch.  The index register (which
can be any of the general registers), may be optionally shifted left by 1, 2,
or 3 bits so that integer addressing to half words, words, or double words is
possible.

Therefore, it turns out that the "shift" operation in the shift and add
instructions, which were included as primitives for integer multiplication,
comes for free.  Instructions such as Shift and And,or Shift and Or are not
frequently used instructions, and weren't considered.  There isn't a Shift Zero
and Add instruction (it is simplly called the Add instruction, to keep the
instruction names simple I guess :-) ).

In the HP 9000/840 (the first HPPA product - off the shelf TTL parts), the
pre-shift operation takes place in the beginning of the execute phase (the pipe
is 3 stages deep, and the execute phase is the second one).  The pre-shift
operation takes 6-7 nsec and happens in parallel with the immediate generation.
All register operands do pass through the pre-shifter and only some indexed
loads, and the shift and add instructions get shifted by more than zero.
In this implementation, it required about 8 or 9 packs extra to do this.

Viggy Mokkarala, Hewlett Packard Co., Cupertino, CA.
(hpda!viggy)