gerry@zds-ux.UUCP (Gerry Gleason) (02/09/90)
In article <7345@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes: >Of course, it would be nice if there were a single machine instruction that >could achieve the effect of "(srcWord & mask) | (destWord & ~mask)," where >"mask" is either a field of 1 bits followed by a field of zero bits, or else >vice versa. This might be a candidate instruction for future RISC designs. Unfortunately it has the difficulty of being a three operand instruction, so you would have to designate a fixed register for the mask operand or it would need a new instruction format (in conflict with an important RISC design goal). The question is does it meet other RISC design criterion, I think it might. Given that RISCs are very often used in workstations with bit mapped screens, and the inner loops of various graphics functions that can use such an instruction have very high frequencies of execution, such an instruction is likely to be a win. Also, three logical operations should be pretty easy for the ALU to do fast enough for a single cycle without stretching the cycle any. Gerry Gleason
alan@oz.nm.paradyne.com (Alan Lovejoy) (02/09/90)
In article <168@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes: >In article <7345@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes: >>Of course, it would be nice if there were a single machine instruction that >>could achieve the effect of "(srcWord & mask) | (destWord & ~mask)," where >>"mask" is either a field of 1 bits followed by a field of zero bits, or else >>vice versa. >This might be a candidate instruction for future RISC designs. Unfortunately >it has the difficulty of being a three operand instruction, so you would >have to designate a fixed register for the mask operand or it would need >a new instruction format (in conflict with an important RISC design goal). Yes, I was aware of the problem you raise. I'm still thinking about how to ameliorate it. Let me broadcast some ideas on it, perhaps it will inspire someone to generate the optimum solution. What's wanted is an instruction, call it "splice," which does the following: ;assume a word is 32 bits SPLICE Rd, Rs, Rb; Rd = ((Rd >> Rb) << Rb) | ((Rs << (32 - Rb)) >> (32 - Rb)) Contents of Rd, Rs and Rb before instruction executes: Rb = 00000000000000000000000000000111 (7 decimal) Rs = 00000000000000000000000000000000 Rd = 11111111111111111111111111111111 Contents of Rd, Rs and Rb after instruction executes: Rb = 00000000000000000000000000000111 (7 decimal) Rs = 00000000000000000000000000000000 Rd = 11111111111111111111111110000000 So SPLICE moves the rightmost Rb bits of Rs into Rd, leaving the leftmost BitsPerWord-Rb bits of Rd unchanged. Perhaps there should be "splice left" and "splice right" instructions? Perhaps more general, and hence more widely usefull would be a "bit merge" instruction, as follows: BMERGE Rd, Rs, Rm; Rd = (Rd & ~Rm) | (Rs & Rm) Contents of Rd, Rs and Rm before instruction executes: Rm = 00000000000001111111111100000000 Rs = 11000101011010000000110010110010 Rd = 00011010111101111100000111101000 Contents of Rd, Rs and Rm after instruction executes: Rm = 00000000000001111111111100000000 Rs = 11000101011010000000110010110010 Rd = 00011010111100000000110011101000 For each bit of Rm which is set, BMERGE moves the corresponding bit of Rs into Rd. For each bit of Rm which is clear, BMERGE leaves Rd unchanged. This uses the "destination register" as if it were the destination register in a two-operand machine, only there are three operands. Does this cause undue problems for RISC? ____"Congress shall have the power to prohibit speech offensive to Congress"____ Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL. Disclaimer: I do not speak for AT&T Paradyne. They do not speak for me. Mottos: << Many are cold, but few are frozen. >> << Frigido, ergo sum. >>
peter@ficc.uu.net (Peter da Silva) (02/09/90)
In article <168@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes: > In article <7345@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes: > >Of course, it would be nice if there were a single machine instruction that > >could achieve the effect of "(srcWord & mask) | (destWord & ~mask)," where > >"mask" is either a field of 1 bits followed by a field of zero bits, or else > >vice versa. To really make this useful you need a "load word bit-aligned" operator, or you're still going to be dominated by masking and shifting to get the source word into memory in the first place. It'd also satisfy the people who want wimpy stuff like byte-aligned reads. > This might be a candidate instruction for future RISC designs. Unfortunately > it has the difficulty of being a three operand instruction, Why is that a difficulty? Three-operand instructions are common in RISCs. -- _--_|\ Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>. / \ \_.--._/ Xenix Support -- it's not just a job, it's an adventure! v "Have you hugged your wolf today?" `-_-'
franka@mentor.com (Frank A. Adrian) (02/10/90)
In article <7366@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes: >BMERGE Rd, Rs, Rm; Rd = (Rd & ~Rm) | (Rs & Rm) > >Contents of Rd, Rs and Rm before instruction executes: >Rm = 00000000000001111111111100000000 >Rs = 11000101011010000000110010110010 >Rd = 00011010111101111100000111101000 > >Contents of Rd, Rs and Rm after instruction executes: >Rm = 00000000000001111111111100000000 >Rs = 11000101011010000000110010110010 >Rd = 00011010111100000000110011101000 This is a good idea for an easily implementable instruction. Instead of your proposed SPLICE instruction, I'd recommend a BITEXT instruction which (using your notation, but where [x:y] means bit indexing) looks like: BITEXT Rd, Rs, Rdescr; Rd = Rs[Rdescr[11:6] : Rdescr[5:0]] >> Rdescr[5:0] I believe the AMD 29K had a similar pair of instructions. One problem with these types of instructions is that most people are so used to poor perfomance with bitfields that they code their own using the &, ~, ^, and | instructions in C, rather than using the compiler's built in code gen- eration. This means that the instructions won't be used much by existing software, so they won't increase bench- mark performance very much, so they won't be included (1/2 :-). Also to do the BITEXT (or your SPLICE) instruction ef- ficiently, your barrel shifter has to become part of the ALU datapath. This may turn the whole thing into a net loss. BTW, does anyone have an instruction format which uses the source register specifiers as an immediate small constant instead of a register specifier? It seems to me, that as most offsets and constants are small, the use of these fields as direct operands, rather than having to burn a register and an instruction for loading a small constant, would be a win (even though it does start to look like dat ol' debbil "addressing modes" :-). -- Frank A. Adrian Mentor Graphics, Inc. franka@mntgfx.com
djh@osc.edu (David Heisterberg) (02/11/90)
In article <79N15F5xds13@ficc.uu.net>, peter@ficc.uu.net (Peter da Silva) writes: > In article <168@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes: > > In article <7345@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes: > > >Of course, it would be nice if there were a single machine instruction that > > >could achieve the effect of "(srcWord & mask) | (destWord & ~mask)," where > > >"mask" is either a field of 1 bits followed by a field of zero bits, or else > > >vice versa. > > To really make this useful you need a "load word bit-aligned" operator, or > you're still going to be dominated by masking and shifting to get the source > word into memory in the first place. It'd also satisfy the people who want > wimpy stuff like byte-aligned reads. > > > This might be a candidate instruction for future RISC designs. Unfortunately > > it has the difficulty of being a three operand instruction, > > Why is that a difficulty? Three-operand instructions are common in RISCs. > -- Note that this is a three *source* operand instruction. As I understand it, one of the principles behind 3 operand instructions is idempotency. CRAY systems implement this instruction, the scalar merge, and have to sacrifice idempotency in this case by overwriting one of the source operands. The usual syntax is: Si Sj!Si&Sk That is, register Si is replaced by ((Si & Sk) | (Sj & ~Sk)). Most other CRAY scalar instructions are of the form: Si Sj op Sk I don't understand your comment about needing a "load word bit-aligned" operator; CRAYs certainly don't have such a thing, and are not byte adressable, yet the scalar merge is a useful instruction. David J. Heisterberg djh@osc.edu The Ohio Supercomputer Center djh@ohstpy.binet Columbus, Ohio ohstpy::djh
peter@ficc.uu.net (Peter da Silva) (02/12/90)
In article <139@oscsunb.osc.edu> djh@osc.edu (David Heisterberg) writes: > I don't understand your comment about needing a "load word bit-aligned" > operator; CRAYs certainly don't have such a thing, and are not byte adressable, > yet the scalar merge is a useful instruction. The most common compute-intensive use of scalar merge (your terminology) in workstations seems to be in display management: the BitBlt operator in one of it's many guises. And BitBlt would benefit from bit-aligned LOAD, an operation that's no more expensive in real-estate than byte-aligned LOAD... a barrel- shifter is a barrel-shifter is a rose. -- _--_|\ Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>. / \ \_.--._/ Xenix Support -- it's not just a job, it's an adventure! v "Have you hugged your wolf today?" `-_-'
bvs@light.uucp (Bakul Shah) (02/12/90)
In article <1990Feb10.154033.4271@mentor.com> franka@mntgfx.UUCP (Frank A. Adrian) writes: >In article <7366@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes: >>BMERGE Rd, Rs, Rm; Rd = (Rd & ~Rm) | (Rs & Rm) > >Instead of your proposed SPLICE instruction, I'd recommend a >BITEXT instruction which (using your notation, but where >[x:y] means bit indexing) looks like: > >BITEXT Rd, Rs, Rdescr; Rd = Rs[Rdescr[11:6] : Rdescr[5:0]] >> Rdescr[5:0] > >I believe the AMD 29K had a similar pair of instructions. BMERGE, BITEXT (bitfield extract), a bitfield extract with sign- extension and a bitfield insert instns can all be useful though the 29K does not have any of these. It does have `extract' that concatenates its two src operands into one 64 bit string, shifts it left by some amount (less than 32, stored in a special register) and sticks the *high* order bits in the dst reg. For example: A B 12345678 9abcdef0 |||///// 6789abcd C = ((A // B) << 12)[63:32] where // is the concatenate operator This is quite a versatile instn. By making either A or B all zeroes you can extract the top or bottom N bits. If A & B are the same register, you have a rotate by N instn. By using succesive registers in a sequence of extracts you can move an arbitrarily long bitstring. For example, to shift a 96 bit string left by 12 bits, you do the following: mtsr FC, 12 ; funnelshift count = 12 ; lr10 = 0, lr11,lr12,lr13 = the 96 bit string extract lr10, lr10, lr11 extract lr11, lr11, lr12 extract lr12, lr12, lr13 ; lr10//lr11//lr12 = ((lr10//lr11//lr12//lr13) << 12)[127:32] The last one is very handy for moving bitmaps around (large bitmaps can be moved at about 3 cycles / 32 bits, assuming memory can sustain single cycle access in page/static-column mode). But extract is not ideal for bitfield operations (extracting / inserting middle bits of a word will take about 4 instns). Though, I do think it is possible to do bitfield{extract,extract- wtih-sign-extend,insert} in single cycle -- the 29k already does something similar for bytes. >BTW, does anyone have an instruction format which uses the >source register specifiers as an immediate small constant >instead of a register specifier? The 29K does. -- Bakul Shah ..!{ames,sun,ucbvax,uunet}!amdcad!light!bvs
djh@osc.edu (David Heisterberg) (02/12/90)
In article <Q0P1T38xds13@ficc.uu.net>, peter@ficc.uu.net (Peter da Silva) writes > > The most common compute-intensive use of scalar merge (your terminology) in > workstations seems to be in display management: the BitBlt operator in one of > it's many guises. And BitBlt would benefit from bit-aligned LOAD, an operation > that's no more expensive in real-estate than byte-aligned LOAD... a barrel- > shifter is a barrel-shifter is a rose. "scalar merge" is CRI's terminology, not mine, and, unfortunately, one can't quite consider a CRAY to be a workstation. In the case of BitBlt, though, you're probably right. However, that's not what I (or the cft77 compiler) use it for, but then there's more of you than there are of me. In my experience, the main use of scalar merge is in generating masks for vectorization code. This is somewhat specialized, I must admit, but unless your scalar support code is very fast, it, rather than vector operations, will dominate the run time. David J. Heisterberg djh@osc.edu The Ohio Supercomputer Center djh@ohstpy.bitnet Columbus, Ohio ohstpy::djh
fotland@hpihoah.HP.COM (David Fotland) (02/13/90)
>Instead of your proposed SPLICE instruction, I'd recommend a >BITEXT instruction which (using your notation, but where >[x:y] means bit indexing) looks like: >BITEXT Rd, Rs, Rdescr; Rd = Rs[Rdescr[11:6] : Rdescr[5:0]] >> Rdescr[5:0] HP Precision Architecture has similar "extract" instructions: VEXTRU r,len,t (Variable extract unsigned) The bit field in register r starting at the bit position specified by the shift amount register, of length len (constant), is extracted and shifted to the right of the target register. The rest of the target is zeroed. There is also a signed version that sign extends, and a version where the starting bit position is in the instruction rather than the shift amount register. The complementary "deposit" intructions take a right justified bit field from the source and merges it into the target at a specified bit position. There is also a zero and deposit that zeros the target before inserting the bits instead of merging the bits. The barrel shifter is in parallel with the ALU so it has little effect on the cycle time. >BTW, does anyone have an instruction format which uses the >source register specifiers as an immediate small constant >instead of a register specifier? It seems to me, that as >most offsets and constants are small, the use of these >fields as direct operands, rather than having to burn a >register and an instruction for loading a small constant, >would be a win (even though it does start to look like >dat ol' debbil "addressing modes" :-). HP-PA uses register specifiers as small constants in several instructions. There are load instructions that do base plus small immediate or base plus index addressing. The small immediate form has the immediate in the same place as the index register specifier in the other instruction. "Move immediate and branch" moves a small immediate into a register and branches. "Compare immediate" and "add immediate and branch" both use 5 bit immediates in the register field. "Deposit immediate" deposits from a 5 bit sign extended immediate. Some of the system control instructions use the 5 bit immediate. The 5 bit immediate always appears in the second register specifier field. David Fotland
danh@halley.UUCP (Dan Hendrickson) (02/13/90)
In article <79N15F5xds13@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes: >> >Of course, it would be nice if there were a single machine instruction that >> >could achieve the effect of "(srcWord & mask) | (destWord & ~mask)," where >> >"mask" is either a field of 1 bits followed by a field of zero bits, or else >> >vice versa. > >> This might be a candidate instruction for future RISC designs. Unfortunately >> it has the difficulty of being a three operand instruction, > >Why is that a difficulty? Three-operand instructions are common in RISCs. Yeah, but the three operands consist of two sources and one destination. To implement another path out of the register file to the ALU for a single inst. would be a minimal gain. "But then we could add this other neat xyz inst..." Gee, isn't that why CISC's were invented? Everyone on the architecture team had their favorite instruction, and they wanted it included. I would be curious to see how often the above "bit mask" instruction would be used in compiler-generated code. Would the instruction be worthwhile if it slowed the cycle time of the machine down by 10% because of the additional register port? Ain't nothin for free.
danh@halley.UUCP (Dan Hendrickson) (02/13/90)
}BTW, does anyone have an instruction format which uses the }source register specifiers as an immediate small constant }instead of a register specifier? }-- Every ALU instruction in SPARC can use one of the source register fields as a small (11 bit?) signed immediate field. In the MIPS architecture, there are seperate instructions for instructions which use immediate values (16 bit signed).
peter@ficc.uu.net (Peter da Silva) (02/14/90)
In article <140@illini.osc.edu> djh@osc.edu (David Heisterberg) writes: > "scalar merge" is CRI's terminology, not mine, and, unfortunately, one > can't quite consider a CRAY to be a workstation. One of these days... > However, that's not what I (or the cft77 > compiler) use it for, but then there's more of you than there are of me. Actually, my home workstation (an Amiga) has a BitBlt coprocessor already. -- _--_|\ Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>. / \ \_.--._/ Xenix Support -- it's not just a job, it's an adventure! v "Have you hugged your wolf today?" `-_-'
baum@Apple.COM (Allen J. Baum) (02/14/90)
[] >In article <1990Feb10.154033.4271@mentor.com> franka@mntgfx.UUCP (Frank A. Adrian) writes: >Instead of your proposed SPLICE instruction, I'd recommend a >BITEXT instruction which (using your notation, but where >[x:y] means bit indexing) looks like: >..... >BTW, does anyone have an instruction format which uses the >source register specifiers as an immediate small constant >instead of a register specifier? A quick summer of the HP precision bitfield instructions: Extract src<start:start+length-1> -> dest start, length immediates in inst. use MSB if start+length-1>31 takes a contiguous bit field and right justifies it variations: sign extend take start pos. from special reg. Deposit src->dest<start:start+length-1> -> dest start, length immediates in inst. use MSB if start+length-1>31 takes a contiguous bit field and right justifies it variations: zero bits outside field take start pos. from special reg. use 5 bit signed imm. as src Double shift src1 cat src2 <start:start+31> -> dest start is an immediate in inst. aligns a contiguous word that spans two regs. variations: take start from special reg. Note that desposit with an immediate source can be used as a bit/bitfield set or clear. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
ccplumb@lion.waterloo.edu (Colin Plumb) (03/01/90)
In article <2377@castle.ed.ac.uk> aiadrmi@castle.ed.ac.uk (Alasdair Donald Robert McIntyre) writes: >torbenm@diku.dk (Torben [gidius Mogensen) writes: >> >>On the ARM2 (VL 86C010) and ARM3 processors from Acorn Computers / VLSI >>Technology, the load instruction can specify either a 12 bit signed constant >>shifted to anywhere in a 32 bit word or an address that can be indexed > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > correction - immediate constants *can-not* be shifted Well, you're sort of right... the standard 12-bit field is 8 bits of immediate data and 4 bits of rotate amount, which is multiplied by 2 before application. Thus, the ARM can form byte offsets from 0-255, word offsets from 0-1020, etc. -- -Colin
hugo@acorn.co.uk (Hugo "Bignose" Tyson) (03/02/90)
In article <21417@watdragon.waterloo.edu> ccplumb@lion.waterloo.edu (Colin Plumb) writes: >In article <2377@castle.ed.ac.uk> aiadrmi@castle.ed.ac.uk (Alasdair Donald Robert McIntyre) writes: >>torbenm@diku.dk (Torben [gidius Mogensen) writes: >>> >>>On the ARM2 (VL 86C010) and ARM3 processors from Acorn Computers / VLSI >>>Technology, the load instruction can specify either a 12 bit signed constant >>>shifted to anywhere in a 32 bit word or an address that can be indexed >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> correction - immediate constants *can-not* be shifted > >Well, you're sort of right... the standard 12-bit field is 8 bits of immediate >data and 4 bits of rotate amount, which is multiplied by 2 before application. >Thus, the ARM can form byte offsets from 0-255, word offsets from 0-1020, >etc. >-- > -Colin Chaps, please, you're both both right and wrong: The immediate field in Data Processing instructions (ADD, ORR, AND, MOV, CMP etc.) *IS* > the standard 12-bit field of 8 bits of immediate >data and 4 bits of rotate amount, which is multiplied by 2 before application. Thus you can make 0-255, 256-1020 4-tuples, 1024-4080 16-tuples etc. etc., including nice numbers like 0xFC000003, 0x0030C000, etc. MOV R0, #0x00034400 ; get a number ANDS R0, R1, #0x83000000 ; test some flag bits. TEQP PC, #0xFC000002 ; select processor mode 2 ADD R2, R2, R2, ASL #4 ; R2 *= 17 RSB R3, R2, R3, ASR #1 ; R3 = R3/2 - R2 AND R4, R5, R6, ASL R7 ; R4 = R5 + R6 << R7 and other shifts by constants (0-31/1-32) or registers here too! The immediate field in Data Transfer (LDR[B],STR[B]) instructions is a 12 bit *unsigned* number, which can be added to OR subtracted from the base register depending on another bit entirely in the instruction. So if you like it's a *13* bit signed number. It is not shifted at all. Ever. LDR R0, [ R1, #4092 ] ; get a word from far away LDRB R2, [ R3, #4095 ] ; get a byte from far away LDRB R4, [ R5, -#4095 ] ; get a byte from far backwards Also LDRB R6, [ R7, R8 ] ; get ((char *)R7)[R8] LDR R9, [ R10, R11 ASL #4 ] ; get ((int *)R10)[R11] and other shifts by constants (0-31/1-32) here too! Note that offsets for a word transfer *are* byte resolution; it's just that non-word aligned word transfers give undefined (well, consistent, definable, but not guaranteed not to change from one ARM revision to another; usual thing) answers on current ARMs. - Huge BTW, I posted the truth about this earlier under "Re: New instructions for RISCs" or whatever that thread was called. DISCLAIMER: if there's a typo or lie here blame me, not Acorn, I speak only for myself, etc.