[comp.arch] New instructions for RISCs

gerry@zds-ux.UUCP (Gerry Gleason) (02/09/90)

In article <7345@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes:
>Of course, it would be nice if there were a single machine instruction that
>could achieve the effect of "(srcWord & mask) | (destWord & ~mask)," where
>"mask" is either a field of 1 bits followed by a field of zero bits, or else
>vice versa.

This might be a candidate instruction for future RISC designs.  Unfortunately
it has the difficulty of being a three operand instruction, so you would
have to designate a fixed register for the mask operand or it would need
a new instruction format (in conflict with an important RISC design goal).

The question is does it meet other RISC design criterion, I think it might.
Given that RISCs are very often used in workstations with bit mapped
screens, and the inner loops of various graphics functions that can use
such an instruction have very high frequencies of execution, such an
instruction is likely to be a win.  Also, three logical operations should
be pretty easy for the ALU to do fast enough for a single cycle without
stretching the cycle any.

Gerry Gleason

alan@oz.nm.paradyne.com (Alan Lovejoy) (02/09/90)

In article <168@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes:
>In article <7345@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes:
>>Of course, it would be nice if there were a single machine instruction that
>>could achieve the effect of "(srcWord & mask) | (destWord & ~mask)," where
>>"mask" is either a field of 1 bits followed by a field of zero bits, or else
>>vice versa.

>This might be a candidate instruction for future RISC designs.  Unfortunately
>it has the difficulty of being a three operand instruction, so you would
>have to designate a fixed register for the mask operand or it would need
>a new instruction format (in conflict with an important RISC design goal).

Yes, I was aware of the problem you raise.  I'm still thinking about how to
ameliorate it.  Let me broadcast some ideas on it, perhaps it will inspire
someone to generate the optimum solution.

What's wanted is an instruction, call it "splice," which does the following:

;assume a word is 32 bits
SPLICE	Rd, Rs, Rb; Rd = ((Rd >> Rb) << Rb) | ((Rs << (32 - Rb)) >> (32 - Rb))

Contents of Rd, Rs and Rb before instruction executes:
Rb = 00000000000000000000000000000111 (7 decimal)
Rs = 00000000000000000000000000000000 
Rd = 11111111111111111111111111111111

Contents of Rd, Rs and Rb after instruction executes:
Rb = 00000000000000000000000000000111 (7 decimal)
Rs = 00000000000000000000000000000000
Rd = 11111111111111111111111110000000

So SPLICE moves the rightmost Rb bits of Rs into Rd, leaving the leftmost
BitsPerWord-Rb bits of Rd unchanged.  Perhaps there should be "splice left"
and "splice right" instructions?

Perhaps more general, and hence more widely usefull would be a "bit merge"
instruction, as follows:

BMERGE Rd, Rs, Rm; Rd = (Rd & ~Rm) | (Rs & Rm)

Contents of Rd, Rs and Rm before instruction executes:
Rm = 00000000000001111111111100000000
Rs = 11000101011010000000110010110010 
Rd = 00011010111101111100000111101000

Contents of Rd, Rs and Rm after instruction executes:
Rm = 00000000000001111111111100000000
Rs = 11000101011010000000110010110010 
Rd = 00011010111100000000110011101000

For each bit of Rm which is set, BMERGE moves the corresponding bit of Rs
into Rd.  For each bit of Rm which is clear, BMERGE leaves Rd unchanged.

This uses the "destination register" as if it were the destination register
in a two-operand machine, only there are three operands.  Does this cause
undue problems for RISC?


____"Congress shall have the power to prohibit speech offensive to Congress"____
Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne.  They do not speak for me. 
Mottos:  << Many are cold, but few are frozen. >>     << Frigido, ergo sum. >>

peter@ficc.uu.net (Peter da Silva) (02/09/90)

In article <168@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes:
> In article <7345@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes:
> >Of course, it would be nice if there were a single machine instruction that
> >could achieve the effect of "(srcWord & mask) | (destWord & ~mask)," where
> >"mask" is either a field of 1 bits followed by a field of zero bits, or else
> >vice versa.

To really make this useful you need a "load word bit-aligned" operator, or
you're still going to be dominated by masking and shifting to get the source
word into memory in the first place. It'd also satisfy the people who want
wimpy stuff like byte-aligned reads.

> This might be a candidate instruction for future RISC designs.  Unfortunately
> it has the difficulty of being a three operand instruction,

Why is that a difficulty? Three-operand instructions are common in RISCs.
-- 
 _--_|\  Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>.
/      \
\_.--._/ Xenix Support -- it's not just a job, it's an adventure!
      v  "Have you hugged your wolf today?" `-_-'

franka@mentor.com (Frank A. Adrian) (02/10/90)

In article <7366@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes:
>BMERGE Rd, Rs, Rm; Rd = (Rd & ~Rm) | (Rs & Rm)
>
>Contents of Rd, Rs and Rm before instruction executes:
>Rm = 00000000000001111111111100000000
>Rs = 11000101011010000000110010110010 
>Rd = 00011010111101111100000111101000
>
>Contents of Rd, Rs and Rm after instruction executes:
>Rm = 00000000000001111111111100000000
>Rs = 11000101011010000000110010110010 
>Rd = 00011010111100000000110011101000

This is a good idea for an easily implementable instruction.

Instead of your proposed SPLICE instruction, I'd recommend a
BITEXT instruction which (using your notation, but where
[x:y] means bit indexing) looks like:

BITEXT Rd, Rs, Rdescr; Rd = Rs[Rdescr[11:6] : Rdescr[5:0]] >> Rdescr[5:0]

I believe the AMD 29K had a similar pair of instructions.
One problem with these types of instructions is that most
people are so used to poor perfomance with bitfields that
they code their own using the &, ~, ^, and | instructions
in C, rather than using the compiler's built in code gen-
eration.  This means that the instructions won't be used
much by existing software, so they won't increase bench-
mark performance very much, so they won't be included (1/2 :-).
Also to do the BITEXT (or your SPLICE) instruction ef-
ficiently, your barrel shifter has to become part of the
ALU datapath.  This may turn the whole thing into a net loss.

BTW, does anyone have an instruction format which uses the
source register specifiers as an immediate small constant
instead of a register specifier?  It seems to me, that as
most offsets and constants are small, the use of these
fields as direct operands, rather than having to burn a
register and an instruction for loading a small constant,
would be a win (even though it does start to look like
dat ol' debbil "addressing modes" :-).
-- 

Frank A. Adrian
Mentor Graphics, Inc.
franka@mntgfx.com

djh@osc.edu (David Heisterberg) (02/11/90)

In article <79N15F5xds13@ficc.uu.net>, peter@ficc.uu.net (Peter da Silva) writes:
> In article <168@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes:
> > In article <7345@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes:
> > >Of course, it would be nice if there were a single machine instruction that
> > >could achieve the effect of "(srcWord & mask) | (destWord & ~mask)," where
> > >"mask" is either a field of 1 bits followed by a field of zero bits, or else
> > >vice versa.
> 
> To really make this useful you need a "load word bit-aligned" operator, or
> you're still going to be dominated by masking and shifting to get the source
> word into memory in the first place. It'd also satisfy the people who want
> wimpy stuff like byte-aligned reads.
> 
> > This might be a candidate instruction for future RISC designs.  Unfortunately
> > it has the difficulty of being a three operand instruction,
> 
> Why is that a difficulty? Three-operand instructions are common in RISCs.
> -- 
Note that this is a three *source* operand instruction.  As I understand it,
one of the principles behind 3 operand instructions is idempotency.  CRAY
systems implement this instruction, the scalar merge, and have to sacrifice
idempotency in this case by overwriting one of the source operands.  The
usual syntax is:

	Si	Sj!Si&Sk

That is, register Si is replaced by ((Si & Sk) | (Sj & ~Sk)).  Most other
CRAY scalar instructions are of the form:

	Si	Sj op Sk

I don't understand your comment about needing a "load word bit-aligned"
operator; CRAYs certainly don't have such a thing, and are not byte adressable,
yet the scalar merge is a useful instruction.

David J. Heisterberg			djh@osc.edu
The Ohio Supercomputer Center		djh@ohstpy.binet
Columbus, Ohio				ohstpy::djh

peter@ficc.uu.net (Peter da Silva) (02/12/90)

In article <139@oscsunb.osc.edu> djh@osc.edu (David Heisterberg) writes:
> I don't understand your comment about needing a "load word bit-aligned"
> operator; CRAYs certainly don't have such a thing, and are not byte adressable,
> yet the scalar merge is a useful instruction.

The most common compute-intensive use of scalar merge (your terminology) in
workstations seems to be in display management: the BitBlt operator in one of
it's many guises.  And BitBlt would benefit from bit-aligned LOAD, an operation
that's no more expensive in real-estate than byte-aligned LOAD... a barrel-
shifter is a barrel-shifter is a rose.
-- 
 _--_|\  Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>.
/      \
\_.--._/ Xenix Support -- it's not just a job, it's an adventure!
      v  "Have you hugged your wolf today?" `-_-'

bvs@light.uucp (Bakul Shah) (02/12/90)

In article <1990Feb10.154033.4271@mentor.com> franka@mntgfx.UUCP (Frank A. Adrian) writes:
>In article <7366@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes:
>>BMERGE Rd, Rs, Rm; Rd = (Rd & ~Rm) | (Rs & Rm)
>
>Instead of your proposed SPLICE instruction, I'd recommend a
>BITEXT instruction which (using your notation, but where
>[x:y] means bit indexing) looks like:
>
>BITEXT Rd, Rs, Rdescr; Rd = Rs[Rdescr[11:6] : Rdescr[5:0]] >> Rdescr[5:0]
>
>I believe the AMD 29K had a similar pair of instructions.

BMERGE, BITEXT (bitfield extract), a bitfield extract with sign-
extension and a bitfield insert instns can all be useful though
the 29K does not have any of these.  It does have `extract' that
concatenates its two src operands into one 64 bit string, shifts
it left by some amount (less than 32, stored in a special
register) and sticks the *high* order bits in the dst reg.  For
example:

	   A        B
	12345678 9abcdef0
	     |||/////
	     6789abcd           C = ((A // B) << 12)[63:32]

			       where // is the concatenate operator

This is quite a versatile instn.  By making either A or B all
zeroes you can extract the top or bottom N bits.  If A & B are the
same register, you have a rotate by N instn.  By using succesive
registers in a sequence of extracts you can move an arbitrarily
long bitstring.  For example, to shift a 96 bit string left by 12
bits, you do the following:

	mtsr    FC, 12  ; funnelshift count = 12
    ; lr10 = 0, lr11,lr12,lr13 = the 96 bit string
	extract lr10, lr10, lr11
	extract lr11, lr11, lr12
	extract lr12, lr12, lr13
    ; lr10//lr11//lr12 = ((lr10//lr11//lr12//lr13) << 12)[127:32]

The last one is very handy for moving bitmaps around (large
bitmaps can be moved at about 3 cycles / 32 bits, assuming memory
can sustain single cycle access in page/static-column mode).

But extract is not ideal for bitfield operations (extracting /
inserting middle bits of a word will take about 4 instns).
Though, I do think it is possible to do bitfield{extract,extract-
wtih-sign-extend,insert} in single cycle -- the 29k already does
something similar for bytes.

>BTW, does anyone have an instruction format which uses the
>source register specifiers as an immediate small constant
>instead of a register specifier?

The 29K does.

-- Bakul Shah ..!{ames,sun,ucbvax,uunet}!amdcad!light!bvs

djh@osc.edu (David Heisterberg) (02/12/90)

In article <Q0P1T38xds13@ficc.uu.net>, peter@ficc.uu.net (Peter da Silva) writes
> 
> The most common compute-intensive use of scalar merge (your terminology) in
> workstations seems to be in display management: the BitBlt operator in one of
> it's many guises.  And BitBlt would benefit from bit-aligned LOAD, an operation
> that's no more expensive in real-estate than byte-aligned LOAD... a barrel-
> shifter is a barrel-shifter is a rose.

"scalar merge" is CRI's terminology, not mine, and, unfortunately, one
can't quite consider a CRAY to be a workstation.  In the case of BitBlt,
though, you're probably right.  However, that's not what I (or the cft77
compiler) use it for, but then there's more of you than there are of me.

In my experience, the main use of scalar merge is in generating masks
for vectorization code.  This is somewhat specialized, I must admit,
but unless your scalar support code is very fast, it, rather than
vector operations, will dominate the run time.

David J. Heisterberg			djh@osc.edu
The Ohio Supercomputer Center		djh@ohstpy.bitnet
Columbus, Ohio				ohstpy::djh

fotland@hpihoah.HP.COM (David Fotland) (02/13/90)

>Instead of your proposed SPLICE instruction, I'd recommend a
>BITEXT instruction which (using your notation, but where
>[x:y] means bit indexing) looks like:

>BITEXT Rd, Rs, Rdescr; Rd = Rs[Rdescr[11:6] : Rdescr[5:0]] >> Rdescr[5:0]

HP Precision Architecture has similar "extract" instructions:

VEXTRU r,len,t
(Variable extract unsigned)

The bit field in register r starting at the bit position specified by
the shift amount register, of length len (constant), is extracted and shifted
to the right of the target register.  The rest of the target is zeroed.
There is also a signed version that sign extends, and a version where the
starting bit position is in the instruction rather than the shift amount 
register.

The complementary "deposit" intructions take a right justified bit field
from the source and merges it into the target at a specified bit position.
There is also a zero and deposit that zeros the target before inserting
the bits instead of merging the bits.

The barrel shifter is in parallel with the ALU so it has little effect on
the cycle time.

>BTW, does anyone have an instruction format which uses the
>source register specifiers as an immediate small constant
>instead of a register specifier?  It seems to me, that as
>most offsets and constants are small, the use of these
>fields as direct operands, rather than having to burn a
>register and an instruction for loading a small constant,
>would be a win (even though it does start to look like
>dat ol' debbil "addressing modes" :-).

HP-PA uses register specifiers as small constants in several instructions.

There are load instructions that do base plus small immediate or base plus
index addressing.  The small immediate form has the immediate in the same place
as the index register specifier in the other instruction.

"Move immediate and branch" moves a small immediate into a register and 
branches.

"Compare immediate" and "add immediate and branch" both use 5 bit immediates in
the register field.

"Deposit immediate" deposits from a 5 bit sign extended immediate.

Some of the system control instructions use the 5 bit immediate.  The 5 bit
immediate always appears in the second register specifier field.

David Fotland

danh@halley.UUCP (Dan Hendrickson) (02/13/90)

In article <79N15F5xds13@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes:
>> >Of course, it would be nice if there were a single machine instruction that
>> >could achieve the effect of "(srcWord & mask) | (destWord & ~mask)," where
>> >"mask" is either a field of 1 bits followed by a field of zero bits, or else
>> >vice versa.
>
>> This might be a candidate instruction for future RISC designs.  Unfortunately
>> it has the difficulty of being a three operand instruction,
>
>Why is that a difficulty? Three-operand instructions are common in RISCs.

Yeah, but the three operands consist of two sources and one destination.  To
implement another path out of the register file to the ALU for a single inst.
would be a minimal gain.  "But then we could add this other neat xyz inst..."
Gee, isn't that why CISC's were invented?  Everyone on the architecture team
had their favorite instruction, and they wanted it included.  I would be
curious to see how often the above "bit mask" instruction would be used in
compiler-generated code.  Would the instruction be worthwhile if it slowed
the cycle time of the machine down by 10% because of the additional register
port?  Ain't nothin for free.

danh@halley.UUCP (Dan Hendrickson) (02/13/90)

}BTW, does anyone have an instruction format which uses the
}source register specifiers as an immediate small constant
}instead of a register specifier?
}-- 
Every ALU instruction in SPARC can use one of the source register fields
as a small (11 bit?) signed immediate field.  In the MIPS architecture,
there are seperate instructions for instructions which use immediate
values (16 bit signed).

peter@ficc.uu.net (Peter da Silva) (02/14/90)

In article <140@illini.osc.edu> djh@osc.edu (David Heisterberg) writes:
> "scalar merge" is CRI's terminology, not mine, and, unfortunately, one
> can't quite consider a CRAY to be a workstation.

One of these days...

> However, that's not what I (or the cft77
> compiler) use it for, but then there's more of you than there are of me.

Actually, my home workstation (an Amiga) has a BitBlt coprocessor already.
-- 
 _--_|\  Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>.
/      \
\_.--._/ Xenix Support -- it's not just a job, it's an adventure!
      v  "Have you hugged your wolf today?" `-_-'

baum@Apple.COM (Allen J. Baum) (02/14/90)

[]
>In article <1990Feb10.154033.4271@mentor.com> franka@mntgfx.UUCP (Frank A. Adrian) writes:
>Instead of your proposed SPLICE instruction, I'd recommend a
>BITEXT instruction which (using your notation, but where
>[x:y] means bit indexing) looks like:
>.....
>BTW, does anyone have an instruction format which uses the
>source register specifiers as an immediate small constant
>instead of a register specifier?

A quick summer of the HP precision bitfield instructions:

Extract src<start:start+length-1> -> dest
	start, length immediates in inst. use MSB if start+length-1>31
	takes a contiguous bit field and right justifies it
	variations:	sign extend
			take start pos. from special reg.
Deposit src->dest<start:start+length-1> -> dest
	start, length immediates in inst. use MSB if start+length-1>31
	takes a contiguous bit field and right justifies it
	variations:	zero bits outside field
			take start pos. from special reg.
			use 5 bit signed imm. as src
Double shift src1 cat src2 <start:start+31> -> dest
	start is an  immediate in inst.
	aligns a contiguous word that spans two regs.
	variations:	take start from special reg.


Note that desposit with an immediate source can be used as a bit/bitfield
set or clear.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

ccplumb@lion.waterloo.edu (Colin Plumb) (03/01/90)

In article <2377@castle.ed.ac.uk> aiadrmi@castle.ed.ac.uk (Alasdair Donald Robert McIntyre) writes:
>torbenm@diku.dk (Torben [gidius Mogensen) writes:
>>
>>On the ARM2 (VL 86C010) and ARM3 processors from Acorn Computers / VLSI
>>Technology, the load instruction can specify either a 12 bit signed constant
>>shifted to anywhere in a 32 bit word or an address that can be indexed
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> correction - immediate constants *can-not* be shifted

Well, you're sort of right... the standard 12-bit field is 8 bits of immediate
data and 4 bits of rotate amount, which is multiplied by 2 before application.
Thus, the ARM can form byte offsets from 0-255, word offsets from 0-1020,
etc.
-- 
	-Colin

hugo@acorn.co.uk (Hugo "Bignose" Tyson) (03/02/90)

In article <21417@watdragon.waterloo.edu> ccplumb@lion.waterloo.edu (Colin Plumb) writes:
>In article <2377@castle.ed.ac.uk> aiadrmi@castle.ed.ac.uk (Alasdair Donald Robert McIntyre) writes:
>>torbenm@diku.dk (Torben [gidius Mogensen) writes:
>>>
>>>On the ARM2 (VL 86C010) and ARM3 processors from Acorn Computers / VLSI
>>>Technology, the load instruction can specify either a 12 bit signed constant
>>>shifted to anywhere in a 32 bit word or an address that can be indexed
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> correction - immediate constants *can-not* be shifted
>
>Well, you're sort of right... the standard 12-bit field is 8 bits of immediate
>data and 4 bits of rotate amount, which is multiplied by 2 before application.
>Thus, the ARM can form byte offsets from 0-255, word offsets from 0-1020,
>etc.
>-- 
>	-Colin

Chaps, please, you're both both right and wrong:

The immediate field in Data Processing instructions (ADD, ORR, AND,
MOV, CMP etc.) *IS*
>			the standard 12-bit field of 8 bits of immediate
>data and 4 bits of rotate amount, which is multiplied by 2 before application.
Thus you can make 0-255, 256-1020 4-tuples, 1024-4080 16-tuples etc.
etc., including nice numbers like 0xFC000003, 0x0030C000, etc.
	MOV	R0, #0x00034400		; get a number
	ANDS	R0, R1, #0x83000000	; test some flag bits.
	TEQP	PC, #0xFC000002		; select processor mode 2
	ADD	R2, R2, R2, ASL #4	; R2 *= 17
	RSB	R3, R2, R3, ASR #1	; R3 = R3/2 - R2
	AND	R4, R5, R6, ASL R7	; R4 = R5 + R6 << R7
and other shifts by constants (0-31/1-32) or registers here too!

The immediate field in Data Transfer (LDR[B],STR[B]) instructions is a
12 bit *unsigned* number, which can be added to OR subtracted from the
base register depending on another bit entirely in the instruction.
So if you like it's a *13* bit signed number.  It is not shifted at
all.  Ever.
	LDR	R0, [ R1, #4092 ] 	; get a word from far away
	LDRB	R2, [ R3, #4095 ] 	; get a byte from far away
	LDRB	R4, [ R5, -#4095 ]	; get a byte from far backwards
Also
	LDRB	R6, [ R7, R8 ]		; get ((char *)R7)[R8]
	LDR	R9, [ R10, R11 ASL #4 ] ; get ((int *)R10)[R11]
and other shifts by constants (0-31/1-32) here too!

Note that offsets for a word transfer *are* byte resolution; it's just
that non-word aligned word transfers give undefined (well, consistent,
definable, but not guaranteed not to change from one ARM revision to
another; usual thing) answers on current ARMs.

	- Huge

BTW, I posted the truth about this earlier under "Re: New instructions
for RISCs" or whatever that thread was called.

DISCLAIMER: if there's a typo or lie here blame me, not Acorn, I speak
only for myself, etc.