hascall@atanasoff.cs.iastate.edu (John Hascall) (02/04/89)
Well, all this talk about different ways to add endian features to architectures has finally gotten to me. So here is a couple of macros to do endian-reversing moves. What do you think? Any one think of a better method? This one is in VAXmacro, but it should be fairly readable. John Hascall ISU Comp Center -------------------------------cut, fold, spindle, etc here--------- .title ENDTEST .sbttl Ten little endians ; ; Facility: Endian switching macros ; Author: John Hascall ; Written: 2 Feb 1989 ; .page .sbttl Macros ; ; The following macros move a word (16bits) or a longword(32bits) ; while reversing their endian-ness. Note the the condition codes ; are set based on the source operand. ; ; Move and Reverse Endian-ness Word ; .macro MREW src,dst MOVW src,-(SP) ; stack source operand MOVB 0(SP),-(SP) ; re-stack HI byte MOVB 2(SP),-(SP) ; re-stack LO byte MOVW (SP)+,dst ; pop it into destination operand TSTW (SP)+ ; clear stack, set cond codes .endm ; ; Move and Reverse Endian-ness Longword ; .macro MREL src,dst PUSHL src ; stack source operand MOVB 0(SP),-(SP) ; re-stack HI byte MOVB 2(SP),-(SP) ; re-stack HI-MID byte MOVB 4(SP),-(SP) ; re-stack LO-MID byte MOVB 6(SP),-(SP) ; re-stack LO byte MOVL (SP)+,dst ; pop it into destination operand TSTL (SP)+ ; clear stack, set cond code .endm .page .sbttl Readonly and Read/Write Storage .psect READONLY,RD,NOWRT,NOEXE AB: .ascii 'AB' ABCD: .ascii 'ABCD' .psect DATA,RD,WRT,NOEXE DESCR2: .long 2 ; string descriptiors for LIB$PUT_OUTPUT .address TWO DESCR4: .long 4 .address FOUR TWO: .ascii ' ' ; where to put AB and ABCD FOUR: .ascii ' ' .page .sbttl Program ; ; test the MRE{W|L} macros ; .psect CODE,RD,NOWRT,EXE .entry ETEST,^M<> MREW AB,TWO PUSHAQ DESCR2 CALLS #1,g^LIB$PUT_OUTPUT ; expect: BA MREL ABCD,FOUR PUSHAQ DESCR4 CALLS #1,g^LIB$PUT_OUTPUT ; expect: DCBA MOVW AB,R0 MREW R0,TWO PUSHAQ DESCR2 CALLS #1,g^LIB$PUT_OUTPUT ; expect: BA MOVL ABCD,R1 MREL R1,FOUR PUSHAQ DESCR4 CALLS #1,g^LIB$PUT_OUTPUT ; expect: DCBA MREW AB,R0 MREW R0,TWO PUSHAQ DESCR2 CALLS #1,g^LIB$PUT_OUTPUT ; expect: AB MREL ABCD,R1 MREL R1,FOUR PUSHAQ DESCR4 CALLS #1,g^LIB$PUT_OUTPUT ; expect ABCD RET .end ETEST -----------------------------end snipping zone--------------------
jk3k+@andrew.cmu.edu (Joe Keane) (02/07/89)
John Hascall writes: > ; > ; Move and Reverse Endian-ness Longword > ; > .macro MREL src,dst > PUSHL src ; stack source operand > MOVB 0(SP),-(SP) ; re-stack HI byte > MOVB 2(SP),-(SP) ; re-stack HI-MID byte > MOVB 4(SP),-(SP) ; re-stack LO-MID byte > MOVB 6(SP),-(SP) ; re-stack LO byte > MOVL (SP)+,dst ; pop it into destination operand > TSTL (SP)+ ; clear stack, set cond code > .endm That's 11 memory references for something which has nothing to do with memory. It only takes 6 instructions (0 memory references) on the RT.
hascall@atanasoff.cs.iastate.edu (John Hascall) (02/08/89)
In article <MXvZBhy00W428s8WI0@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes: >John Hascall writes: >> ; >> ; Move and Reverse Endian-ness Longword >> ; [....] >That's 11 memory references for something which has nothing to do with memory. >It only takes 6 instructions (0 memory references) on the RT. Well, if you have a better method, then by all means post it. BTW, I could do it in less than 11, if I assumed somethings about the addressing mode of the operands and/or the availability of a register to work in, but for my situation I neeeded a completely general method. John Hascall
zs01+@andrew.cmu.edu (Zalman Stern) (02/08/89)
> *Excerpts from ext.nn.comp.arch: 7-Feb-89 Re: Endian reversing MOVEs Joe* > *Keane@andrew.cmu.edu (713)* > That's 11 memory references for something which has nothing to do with memory. > It only takes 6 instructions (0 memory references) on the RT. > *Excerpts from ext.nn.comp.arch: 7-Feb-89 Re: Endian reversing MOVEs John* > *Hascall@atanasoff.c (636)* > BTW, I could do it in less than 11, if I assumed somethings about > the addressing mode of the operands and/or the availability of > a register to work in, but for my situation I neeeded a completely > general method. > John Hascall The RT has two advantages here. First, has no addressing modes to worry about. Second, the RT has instructions for moving characters around within registers. The first is a result of a RISC load/store architecture. The second might not exactly be RISC, but probably isn't too far off. (I don't have any figures as to how often the MC03 type instructions are used but they shouldn't add much complexity.) For comparison purposes I've jotted down code for the RT, the AMD 29000, and the MIPS R3000. These all assume you are byte switching a value from one register into a distinct register. This is a minor difference from the original macro, but I think it is the reasonable case to consider. Here's the RT code: mc31 SRC, DEST ; DEST[3] = SRC[1] mc23 DEST, DEST ; DEST[2] = DEST[3] mc32 SRC, DEST ; DEST[3] = SRC[2] mc13 DEST, DEST ; DEST[1] = SRC[3] mc30 SRC, DEST ; DEST[3] = SRC[0] mc03 SRC, DEST ; DEST[0] = SRC[3] 6 instructions, no temporary registers or other state modified. The AMD 29000 has byte insertion/extraction functions. In the following, setbp sets a special "byte pointer" register. The exbyte instruction takes the byte of the source addressed by the byte pointer and places it in the low order byte of the destination. (I use constants for the setbp instruction since the actual values depend on the byte order bit in the processor configuration register...) The code looks like so: and DEST, SRC, 0xff ; DEST = SRC & oxff sll DEST, DEST, 8 ; DEST =<< 8 setbp LOWMIDBYTE ; Setup special register with constant exbyte SRC, DEST, DEST ; DEST[0] = SRC[1] essentially sll DEST, DEST, 8 ; DEST =<< 8 setbp LOWHIGHBYTE ; Setup special register with constant exbyte SRC, DEST, DEST ; DEST[0] = SRC[2] sll DEST, DEST, 8 ; DEST =<< 8 setbp HIGHBYTE ; Setup special register with constant exbyte SRC, DEST, DEST ; DEST[0] = SRC[3] 10 single cycle instructions, and no temporary register but the byte pointer is modified. Finally, the R3000 routine, which is fairly generic and could be implemented on other architectures: andi temp, SRC, 0x00ff sll DEST, temp, 24 andi temp, SRC, 0xff00 sll temp, temp, 8 and DEST, temp, DEST lui temp, 0x00ff and temp, SRC, temp srl temp, temp, 8 and DEST, temp, DEST lui temp, 0xff00 and temp, SRC, temp srl temp, temp, 24 and temp, DEST, DEST 13 single cycle instructions, one temporary register. As I don't regularly program in asembly language, much less on all three of the above mentioned processors, I don't make any promises about the correctness or optimality of the above code. It is presented for rough comparison only. I'm sure someone will correct me if I got it wrong :-) Sincerely, Zalman Stern Internet: zs01+@andrew.cmu.edu Usenet: I'm soooo confused... Information Technology Center, Carnegie Mellon, Pittsburgh, PA 15213-3890
bimandre@kulcs.uucp (Andre Marien) (02/08/89)
> John Hascall writes: > > ; > > ; Move and Reverse Endian-ness Longword > > ; > > .macro MREL src,dst > > PUSHL src ; stack source operand > > MOVB 0(SP),-(SP) ; re-stack HI byte > > MOVB 2(SP),-(SP) ; re-stack HI-MID byte > > MOVB 4(SP),-(SP) ; re-stack LO-MID byte > > MOVB 6(SP),-(SP) ; re-stack LO byte > > MOVL (SP)+,dst ; pop it into destination operand > > TSTL (SP)+ ; clear stack, set cond code > > .endm > That's 11 memory references for something which has nothing to do with memory. > It only takes 6 instructions (0 memory references) on the RT. This can be done on the VAX too : */ int g,cg; conv() { asm(" movl _g,r0"); asm(" rotl $8,r0,r1"); asm(" rotl $24,r0,r2"); asm(" bicl2 $0xff00ff00,r1"); asm(" bicl2 $0xff00ff,r2"); asm(" addl3 r1,r2,_cg"); } main() { while(1) { scanf("%d",&g); conv(); printf("src = %x dst = %x\n",g,cg); } } /* Andre' Marien B.I.M. Belgium bimandre@cs.kuleuven.ac.be sun!sunuk!sunbim!kulcs!bimandre */
hascall@atanasoff.cs.iastate.edu (John Hascall) (02/08/89)
In article <QXvqHvy00Vs887oooq@andrew.cmu.edu> zs01+@andrew.cmu.edu (Zalman Stern) writes: >> *Excerpts from ext.nn.comp.arch: 7-Feb-89 Re: Endian reversing MOVEs Joe* >> *Keane@andrew.cmu.edu (713)* >> That's 11 memory references for something which has nothing to do with memory. >> It only takes 6 instructions (0 memory references) on the RT. >The RT has two advantages here. First, has no addressing modes to worry about. >Second, the RT has instructions for moving characters around within registers. >The first is a result of a RISC load/store architecture. The second might not >exactly be RISC, but probably isn't too far off. (I don't have any figures as to >how often the MC03 type instructions are used but they shouldn't add much >complexity.) > ... These all assume you are byte switching a value from one register >into a distinct register. This is a minor difference from the original macro, >but I think it is the reasonable case to consider. >Here's the RT code: >mc31 SRC, DEST ; DEST[3] = SRC[1] >mc23 DEST, DEST ; DEST[2] = DEST[3] >mc32 SRC, DEST ; DEST[3] = SRC[2] >mc13 DEST, DEST ; DEST[1] = SRC[3] >mc30 SRC, DEST ; DEST[3] = SRC[0] >mc03 SRC, DEST ; DEST[0] = SRC[3] At the risk of starting a religous "code-wars", if we assume the case above, (two distinct registers), we can write our VAX code as: ; source = S3,S2,S1,S0 INSV R0,#16,#8,R1 ; R1<7:0> <- R0<23:16> dest = ??,??,??,S2 ROTL #-8,R1 ; rotate R1 8 right dest = S2,??,??,?? INSV R0,#8,#8,R1 ; R1<7:0> <- R0<15:8> dest = S2,??,??,S1 ROTL #-8,R1 ; rotate... dest = S1,S2,??,?? INSV R0,#0,#8,R1 ; R1<7:0> <- R0<7:0> dest = S1,S2,??,S0 ROTL #-8,R1 ; rotate... dest = S0,S1,S2,?? INSV R0,#24,#8,R1 ; R1<7:0> <- R0<31:24> dest = S0,S1,S2,S3 7 instructions, 0 memory references.... [ This will also work with R0 and/or R1 replaced with a memory address (as long as they do not overlap), but that would be hideous :-) ] John Hascall ISU Comp Center p.s. Is ROTL #24,Rx faster/slower than ROTL #-8,Rx ? anyone? I'm guessing it's faster because you can write #24 as a short-literal...
davet@oakhill.UUCP (David Trissel) (02/09/89)
On the M68000 architecture: ROTL.W #8,D0 SWAP.W D0 ROTL.W #8,D0 No extra register or memory references required. -- Dave Trissel Motorola Austin
jensen@granite.dec.com (Paul Jensen) (02/09/89)
In article <772@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu (John Hascall) writes: >In article <MXvZBhy00W428s8WI0@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes: >>John Hascall writes: >>> ; >>> ; Move and Reverse Endian-ness Longword >>> ; > [....] >>That's 11 memory references for something which has nothing to do with memory. >>It only takes 6 instructions (0 memory references) on the RT. > > Well, if you have a better method, then by all means post it. > > BTW, I could do it in less than 11, if I assumed somethings about > the addressing mode of the operands and/or the availability of > a register to work in, but for my situation I neeeded a completely > general method. > > John Hascall I thought I would never write VAX macro again, especially in the context of a subject to which I am utterly indifferent (or agnostic, at any rate). However... if your machine includes rotate instructions, you can do a little better than directly moving the bytes. Rotating by half the word size swaps the upper and lower halves. Rotating a longword by +/- 8 bits moves 2 of the 4 bytes into the correct position. In the context of the VAX, you can save one instruction (while preserving the semantics of your macro) with something like pushl src rotl $8, src, -(sp) movb 4(sp),3(sp) movb 6(sp),1(sp) movl (sp)+,dst tstl (sp)+ I am not certain that one rotl is faster than 2 movb's, but I suspect it is. (The best thing about the endian debate is that it will encourage people to read "Gulliver's Travels", one of the funniest (& most vicious) books ever written.) -- /Paul Jensen Digital Equipment Corp. Palo Alto, CA
jk3k+@andrew.cmu.edu (Joe Keane) (02/09/89)
John Hascall writes VAX code with: > 7 instructions, 0 memory references.... Yeah, but it's 29 bytes instead of 12, and probably a lot slower to boot. Anyway, this is pretty silly; the RT happens to have instructions which make it easy. I don't know why they're there; i've never seen the compiler use them, although gcc might. > p.s. Is ROTL #24,Rx faster/slower than ROTL #-8,Rx ? anyone? > I'm guessing it's faster because you can write #24 as a > short-literal... It should be.
chase@Ozona.orc.olivetti.com (David Chase) (02/09/89)
Didn't we have this discussion just a few months ago already? anyhow, someone writes: >> Well, if you have a better method, then by all means post it. and, authored by some unknown programmer with Acorn, is the 4 instruction, no memory reference endian reversing code sequence: input is in R0, R0 = a,b,c,d r1 := ~ (0xFF << 8) // r1 := F,F,0,F r2 := r0 ^ (r0 ROTR 16) // r2 := a,b,c,d ^ c,d,a,b = a^c,b^d,a^c,b^d r2 := r1 & (r2 >> 8) // r2 := F,F,0,F & 0,a^c,b^d,a^c = 0,a^c,0,a^c r0 := r2 ^ (r0 ROTR 8) // r0 := 0,a^c,0,a^c ^ d,a,b,c = d,c,b,a Each of these lines translates into one Acorn RISC Machine instruction, and each of these instructions executes in one "cycle" (I don't know what they do with their clock, but the rotates don't slow it down). I would expect similar tricks to be possible on the HP Spectrum, since I understand that it also has shift-and-op instructions. (They're great for generating fast constant multiplication sequences, but I KNOW we've already had that discussed here.) Note that the first instruction need not be repeated if you are swapping several registers; thus, it is possible to swap bytes in 12 registers in 37 instructions (of course, then you wait all day to get the registers in and out of memory). I guess I should ask: what else are shift-and-op style instructions good for? So far I have 1) byte swapping 2) fast multiplication by constants David
mark@UNIX386.Convergent.COM (Mark Nudelman) (02/10/89)
In article <37562@oliveb.olivetti.com>, chase@Ozona.orc.olivetti.com (David Chase) writes: > [endian-reversing code optimized for the Acorn RISC machine] Interesting method. This could be rendered in 80386 code as: / input value in %eax movl %eax, %edx / edx = (a b c d) 2 clocks rorl $16, %edx / edx = (c d a b) 3 xorl %eax, %edx / edx = (a^c b^d a^c b^d) 2 shrl $8, %edx / edx = (0 a^c b^d a^c) 3 xorb %dh, %dh / edx = (0 a^c 0 a^c) 2 rorl $8, %eax / eax = (d a b c) 3 xorl %edx, %eax / eax = (d c b a) 2 However, since the 386 has instructions to manipulate bytes within 32 bit registers (and doesn't have the shift-and-op features of the Acorn), a faster and simpler version is: xchgb %al, %ah / eax = (a b d c) 3 clocks rorl $16, %eax / eax = (d c a b) 3 xchgb %al, %ah / eax = (d c b a) 3 Mark Nudelman {sun,decwrl,hplabs}!pyramid!ctnews!UNIX386!mark
maslen@polya.Stanford.EDU (Thomas Maslen) (02/10/89)
In article <780@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu (John Hascall) writes: ... > What an Idea! ... > The "International Obtuse 4-bytes-of-8-bits Endian-reversing > Contest": ... Fine, just don't do it on a normally useful newsgroup such as comp.arch. Thomas Maslen maslen@polya.stanford.edu
cik@l.cc.purdue.edu (Herman Rubin) (02/10/89)
We have seen a substantial discussion on exactly how to reverse endianness in software. Various methods have been proposed, different ones better for different machines. Now several possibilities occur to me. 1. It is not too important to do quickly. In this case, any reasonable method is OK. Whether it pays to do an efficient job becomes debatable, but since it is so short, assembly coding is no problem. 2. It is important to do quickly. That is to say, a significant amount of running time will be tied up by the lack of this. In that case, it should be added to the hardware. A small amount of "wiring" makes this as fast an instruction as a move. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
karl@hpclkwp.HP.COM (Karl Pettis) (02/14/89)
On the Hewlett-Packard Precision Architecture, reversing the bytes in a word can be done in three instructions: ; r1 = "ABCD" SHD r1,r1,16,r2 ; r2 = "CDAB" DEP r2,15,8,r2 ; r2 = "CBAB" SHD r1,r2,8,r2 ; r2 = "DCBA" Explanation: SHD r1,r2,p,t ; Shift Double Concatenates r1 and r2 and shifts the combination right by p bits, putting the low 32 bits of the result into t. DEP r,p,len,t ; Deposit Takes the len low order bits of r and places them into register t with the least significant bit at position p. Other bits of t are unchanged. HPPA numbers bits in a word in big-endian order: 0..31. This code was devised by Jerry Huck. I'm just posting it. - Karl Pettis karl@hpda
mash@mips.COM (John Mashey) (02/15/89)
In article <QXvqHvy00Vs887oooq@andrew.cmu.edu> zs01+@andrew.cmu.edu (Zalman Stern) writes: >For comparison purposes I've jotted down code for the RT, the AMD 29000, and the >MIPS R3000. These all assume you are byte switching a value from one register >into a distinct register. This is a minor difference from the original macro, .... >Finally, the R3000 routine, which is fairly generic and could be implemented on >other architectures: >andi temp, SRC, 0x00ff >sll DEST, temp, 24 >andi temp, SRC, 0xff00 >sll temp, temp, 8 >and DEST, temp, DEST >lui temp, 0x00ff >and temp, SRC, temp >srl temp, temp, 8 >and DEST, temp, DEST >lui temp, 0xff00 >and temp, SRC, temp >srl temp, temp, 24 >and temp, DEST, DEST > >13 single cycle instructions, one temporary register. Following is a C program that does this, 2 ways: a) Value starts in memory b) Value starts in register The relevant piece of the .s output is shown. The point is NOT that an R3000 is faster than Zalman's example, which was at least a credible try. The point is the relationship of hand-coded routines to code compiled from high-level languages; RISC chips were supposed to be designed for the latter. Could people perhaps post the best compiler-generated code for this function, and compare it with the best hand-coded code? (Use any C code you like). I'll start: C PROGRAM: struct x { unsigned char a, b, c, d}; unsigned y,z; main(p) struct x *p; { y = p->d << 24 | p->c << 16 | p->b << 8 | p->a; z = (y << 24) | ((y & 0xff00) << 8) | ((y << 8) & 0xff00) | (y >> 24); } GEENRATED .s: # 6 y = p->d << 24 | p->c << 16 | p->b << 8 | p->a; lbu $14, 3($4) sll $15, $14, 24 lbu $24, 2($4) sll $25, $24, 16 or $8, $15, $25 lbu $9, 1($4) sll $10, $9, 8 or $11, $8, $10 lbu $12, 0($4) or $13, $11, $12 sw $13, y * NOT COUNTED # 7 z = (y << 24) | ((y & 0xff00) << 8) | ((y << 8) & 0xff00) | (y >> 24); sll $14, $13, 24 and $24, $13, 65280 sll $15, $24, 8 or $25, $14, $15 sll $9, $13, 8 and $8, $9, 65280 or $10, $25, $8 srl $11, $13, 24 or $12, $10, $11 sw $12, z * NOT COUNTED The first case is 10 cycles (+ cache miss, if any); the second case is 9 cycles, both assuming 100% I-cache hit. As noted, I coded the C to take advantage of the fact that ands of 0xffff or less are good things. BTW: this isn't something I'd expect an R3000 to do especially well on. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
andrew@frip.gwd.tek.com (Andrew Klossner) (02/19/89)
And, for completeness, here's the endian-reversal code for the M88k (eyeballed but not tested): ; r2 is both src and dest; r3 is a temp ; r2: 0123 rot r3,r2,24 ; r3: 1230 and.u r3,r3,0xFF ; r3: z230 and r3,r3,0xFF ; r3: z2z0 rot r2,r2,8 ; r2: 3012 and.u r2,r2,0xFF00 ; r2: 3z12 and r2,r2,0xFF00 ; r2: 3z1z or r2,r2,r3 ; r2: 3210 (As you would expect, these are all single-cycle instructions.) -=- Andrew Klossner (uunet!tektronix!orca!frip!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
yuval@taux01.UUCP (Gideon Yuval) (02/19/89)
If you need wrong-endian language-semantics on right-endian H/W, the fastest way out is as follows: Define a new architecture, in which a byte's logical address is the NEGATION of its physical address (one's complement works better, but that's a 2nd-order effect). the old load/store op-codes become, under this new architecture, load&swap and store&swap op-codes; and if we restrict compiled code to a RISC-style load/store/reg-op architecture, the ONLY op-codes that know about byte-sex are load/store: registers have no high/ low addressing inside them, and memory data have no LSB/MSB in them!. This new architecture (which is simply another way to look at the existing H/W!) lets wrong-endian code get compiled for a right-endian CPU. The main penalty is in pointer-management: to dereference a pointer, we have to negate (or: one's complement) it first; and the same pointer cannot be used for byte, word and Dword access; but the compiler gurus estimate the time-penalty at 10-30% ("C" code), less for Fortran, and as near 0.0% for Cobol as makes no difference. -- Gideon Yuval, yuval@taux01.nsc.com, +972-2-690992 (home) ,-52-522255(work) Paper-mail: National Semiconductor, 6 Maskit St., Herzliyah, Israel TWX: 33691, fax: +972-52-558322
rpw3@amdcad.AMD.COM (Rob Warnock) (02/25/89)
Took me a little while to get around to it, but below is a fast (4 cycle)
byte-swap for the Am29000, to round out the mix. And here's what I've seem
to date:
machine who reported language cycles notes
======= ============ ======== ====== =====
29k rpw3@amd C 10
MIPS mashey@mips C 9
ARM rwilson@acorn C 7
VAX bimadre@kulcs assembler 5 Instr., not cycles!
88k klossner@tek assembler 7 (2+5n, maybe?)
ARM rwilson@acorn assembler 4 1+3n
29k rpw3@amd assembler 4 1+3n
The "1+3n" for ARM & 29k means that the work done by the first instruction
is not destroyed by the remainder, thus blocks of words can be swapped in
3 cycles, asymptotically. I also think the 88k code could do it in 2+5n if
the 0xFF00FF00 mask were generated explicitly. (But does it matter? Naaah!)
; byte-swap for Am29000: src = a, b, c, d
mtsrim ALU, (1<<5)+24 ; set BP=1, FC=24
extract tmp, src, src ; tmp = d, a, b, c (really a rotate)
exbyte dst, tmp, tmp ; dst = d, a, b, a
inbyte dst, dst, tmp ; dst = d, c, b, a
The 4-cycle method for the 29k thus bears some resemblance to the original
poster's version which did the swap in-memory. In this case, we use the 29k
instructions that support byte load/store (on the 29k you do a word load,
then an EXBYTE.) Normally, the second arg of EXBYTE is an immediate 0, but
here we *use* the fact that EXBYTE is really a specialized merge. In addition,
we use the EXTRACT both-sources-same idiom to get a rotate. Finally, one
instruction is saved by setting up both the Byte Pointer and the Funnel-
shifter Count by loading the entire ALU status reg, rather than explicit
loads of the equivalent BP and FC regs.
(Makes one almost wish for an explicit ROTATE, rather than the 2-cycle
"set FC, extract" idiom. But rotates of varying values probably aren't
worth it, and the 29k can do several rotates of the same amount in 1+1n.)
The C version was compiled with the Metaware C compiler with "-O" turned
on (and a lot of compiler noise elided):
unsigned foo(x)
unsigned int x;
{
return (x << 24) | ((x & 0xff00) << 8) | ((x >> 8) & 0xff00) | (x >> 24);
}
_foo:
sll gr120,lr2,24
const gr119,65280 ; (0xff00)
and gr121,lr2,gr119
sll gr121,gr121,8
or gr120,gr121,gr120
srl gr121,lr2,8
and gr121,gr121,gr119
or gr120,gr121,gr120
srl gr121,lr2,24
jmpi lr0
or gr96,gr121,gr120
-----------
Rob Warnock
Systems Architecture Consultant
UUCP: {amdcad,fortune,sun}!redwood!rpw3
ATTmail: !rpw3
DDD: (415)572-2607
USPS: 627 26th Ave, San Mateo, CA 94403
mash@mips.COM (John Mashey) (02/26/89)
In article <24618@amdcad.AMD.COM> rpw3@amdcad.amd.com (Rob Warnock) writes: >Took me a little while to get around to it, but below is a fast (4 cycle) >byte-swap for the Am29000, to round out the mix. And here's what I've seem >to date: > machine who reported language cycles notes > ======= ============ ======== ====== ===== > 29k rpw3@amd C 10 > MIPS mashey@mips C 9 MIPS mashey@mips assembler 9 (was same as C) > ARM rwilson@acorn C 7 > VAX bimadre@kulcs assembler 5 Instr., not cycles! > 88k klossner@tek assembler 7 (2+5n, maybe?) > ARM rwilson@acorn assembler 4 1+3n > 29k rpw3@amd assembler 4 1+3n Good table. It would really be nice to 1) get the 88K C-compiled code #, and 2) the C & Assembler #s for HP Precision, which ought to be fairly good at this sort of thing, I think. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
w-colinp@microsoft.UUCP (Colin Plumb) (02/27/89)
rpw3@amdcad.amd.com (Rob Warnock) wrote: > (Makes one almost wish for an explicit ROTATE, rather than the 2-cycle > "set FC, extract" idiom. But rotates of varying values probably aren't > worth it, and the 29k can do several rotates of the same amount in 1+1n.) Is there some magical way to synthesize bitfield insertion/extraction that doesn't want lots of shifts? True, these days people don't use a lot of tricky bit-packing, but you still find it in annoying places like device registers. (Read: interrupt service routines. I feel The Need...) For tricky messing about, it would be easier to rotate a bit field to be in the least significant bits and bash on it using short immediate constants, then rotate it back, than to use lots of long immediate masks. (Admittedly, most device registers are 8 bits, so short immediate constants on the 29000 will do fine, but I still think a case can be made.) -- -Colin (uunet!microsoft!w-colinp) "Don't listen to me. I never do."
mahar@weitek.UUCP (Mike Mahar) (02/28/89)
In article <24618@amdcad.AMD.COM> rpw3@amdcad.amd.com (Rob Warnock) writes: >Took me a little while to get around to it, but below is a fast (4 cycle) >byte-swap for the Am29000, to round out the mix. And here's what I've seem >to date: > > machine who reported language cycles notes > ======= ============ ======== ====== ===== > 29k rpw3@amd C 10 > MIPS mashey@mips C 9 > ARM rwilson@acorn C 7 > VAX bimadre@kulcs assembler 5 Instr., not cycles! > 88k klossner@tek assembler 7 (2+5n, maybe?) > ARM rwilson@acorn assembler 4 1+3n > 29k rpw3@amd assembler 4 1+3n > The Weitek XL-Series has an instruction "perfect exchange". It is capable of swaping just about any thing. It does a half word swap, a byte swap, a nybble swap, or a bit swap. pexch rb,im5,ra ra,rb - registers im5 - 5 bit immediate This instruction performs a perfect exchange among the bits of the contents of register rb. The result is placed in register ra. The bits in the immediate field control which parts of the word are exchanged. The most significant bit, of the immediate field, causes the upper half- word to be exchanged with the lower halfword. The second most significant bit causes the lower two bytes to be exchanged and the upper two bytes to be exchanged. The least significant bit in the immediate field causes each pair of adjacent bits to be ex- changed. Assume the immediate field is 1 1 1 1 1. a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F - original word q r s t u v w x y z A B C D E F a b c d e f g h i j k l m n o p - bit 4 applied y z A B C D E F q r s t u v w x i j k l m n o p a b c d e f g h - bit 3 applied C D E F y z A B u v w x q r s t m n o p i j k l e f g h a b c d - bit 2 applied E F C D