[comp.arch] Endian reversing MOVEs

hascall@atanasoff.cs.iastate.edu (John Hascall) (02/04/89)

	Well, all this talk about different ways to add endian
	features to architectures has finally gotten to me.  So
	here is a couple of macros to do endian-reversing moves.

	What do you think?  Any one think of a better method?
	This one is in VAXmacro, but it should be fairly readable.

John Hascall
ISU Comp Center

-------------------------------cut, fold, spindle, etc here---------
	.title	ENDTEST
	.sbttl	Ten little endians
;
;	Facility:	Endian switching macros
;	Author:		John Hascall
;	Written:	2 Feb 1989
;
	.page
	.sbttl	Macros
;
;	The following macros move a word (16bits) or a longword(32bits)
;	while reversing their endian-ness.  Note the the condition codes
;	are set based on the source operand.

;
;	Move and Reverse Endian-ness Word
;
	.macro	MREW	src,dst
	MOVW	src,-(SP)		; stack source operand
	MOVB	0(SP),-(SP)		; re-stack HI byte
	MOVB	2(SP),-(SP)		; re-stack LO byte
	MOVW	(SP)+,dst		; pop it into destination operand
	TSTW	(SP)+			; clear stack, set cond codes
	.endm
;
;	Move and Reverse Endian-ness Longword
;
	.macro	MREL	src,dst
	PUSHL	src			; stack source operand
	MOVB	0(SP),-(SP)		; re-stack HI byte
	MOVB	2(SP),-(SP)		; re-stack HI-MID byte
	MOVB	4(SP),-(SP)		; re-stack LO-MID byte
	MOVB	6(SP),-(SP)		; re-stack LO byte
	MOVL	(SP)+,dst		; pop it into destination operand
	TSTL	(SP)+			; clear stack, set cond code
	.endm

	.page
	.sbttl	Readonly and Read/Write Storage
	.psect	READONLY,RD,NOWRT,NOEXE

AB:	.ascii	'AB'
ABCD:	.ascii	'ABCD'


	.psect	DATA,RD,WRT,NOEXE

DESCR2:	.long	2		; string descriptiors for LIB$PUT_OUTPUT
	.address TWO
DESCR4:	.long	4
	.address FOUR

TWO:	.ascii	'    '		; where to put AB and ABCD
FOUR:	.ascii	'    '


	.page
	.sbttl	Program
;
;	test the MRE{W|L} macros
;
	.psect	CODE,RD,NOWRT,EXE
	.entry	ETEST,^M<>
	MREW	AB,TWO
	PUSHAQ	DESCR2
	CALLS	#1,g^LIB$PUT_OUTPUT		; expect: BA

	MREL	ABCD,FOUR
	PUSHAQ	DESCR4
	CALLS	#1,g^LIB$PUT_OUTPUT		; expect: DCBA

	MOVW	AB,R0
	MREW	R0,TWO
	PUSHAQ	DESCR2
	CALLS	#1,g^LIB$PUT_OUTPUT		; expect: BA

	MOVL	ABCD,R1
	MREL	R1,FOUR
	PUSHAQ	DESCR4
	CALLS	#1,g^LIB$PUT_OUTPUT		; expect: DCBA

	MREW	AB,R0
	MREW	R0,TWO
	PUSHAQ	DESCR2
	CALLS	#1,g^LIB$PUT_OUTPUT		; expect: AB

	MREL	ABCD,R1
	MREL	R1,FOUR
	PUSHAQ	DESCR4
	CALLS	#1,g^LIB$PUT_OUTPUT		; expect ABCD

	RET

	.end	ETEST
-----------------------------end snipping zone--------------------

jk3k+@andrew.cmu.edu (Joe Keane) (02/07/89)

John Hascall writes:
> ;
> ;     Move and Reverse Endian-ness Longword
> ;
>       .macro  MREL    src,dst
>       PUSHL   src                     ; stack source operand
>       MOVB    0(SP),-(SP)             ; re-stack HI byte
>       MOVB    2(SP),-(SP)             ; re-stack HI-MID byte
>       MOVB    4(SP),-(SP)             ; re-stack LO-MID byte
>       MOVB    6(SP),-(SP)             ; re-stack LO byte
>       MOVL    (SP)+,dst               ; pop it into destination operand
>       TSTL    (SP)+                   ; clear stack, set cond code
>       .endm

That's 11 memory references for something which has nothing to do with memory.
It only takes 6 instructions (0 memory references) on the RT.

hascall@atanasoff.cs.iastate.edu (John Hascall) (02/08/89)

In article <MXvZBhy00W428s8WI0@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes:
>John Hascall writes:
>> ;
>> ;     Move and Reverse Endian-ness Longword
>> ;
           [....]
>That's 11 memory references for something which has nothing to do with memory.
>It only takes 6 instructions (0 memory references) on the RT.

   Well, if you have a better method, then by all means post it.

   BTW, I could do it in less than 11, if I assumed somethings about
   the addressing mode of the operands and/or the availability of
   a register to work in, but for my situation I neeeded a completely
   general method.

   John Hascall

zs01+@andrew.cmu.edu (Zalman Stern) (02/08/89)

> *Excerpts from ext.nn.comp.arch: 7-Feb-89 Re: Endian reversing MOVEs Joe*
> *Keane@andrew.cmu.edu (713)*
> That's 11 memory references for something which has nothing to do with memory.
> It only takes 6 instructions (0 memory references) on the RT.
> *Excerpts from ext.nn.comp.arch: 7-Feb-89 Re: Endian reversing MOVEs John*
> *Hascall@atanasoff.c (636)*
>    BTW, I could do it in less than 11, if I assumed somethings about
>    the addressing mode of the operands and/or the availability of
>    a register to work in, but for my situation I neeeded a completely
>    general method.
>    John Hascall
The RT has two advantages here. First, has no addressing modes to worry about.
Second, the RT has instructions for moving characters around within registers.
The first is a result of a RISC load/store architecture. The second might not
exactly be RISC, but probably isn't too far off. (I don't have any figures as to
how often the MC03 type instructions are used but they shouldn't add much
complexity.)

For comparison purposes I've jotted down code for the RT, the AMD 29000, and the
MIPS R3000. These all assume you are byte switching a value from one register
into a distinct register. This is a minor difference from the original macro,
but I think it is the reasonable case to consider.

Here's the RT code:

mc31    SRC, DEST       ; DEST[3] = SRC[1]
mc23    DEST, DEST      ; DEST[2] = DEST[3]
mc32    SRC, DEST       ; DEST[3] = SRC[2]
mc13    DEST, DEST      ; DEST[1] = SRC[3]
mc30    SRC, DEST       ; DEST[3] = SRC[0]
mc03    SRC, DEST       ; DEST[0] = SRC[3]

6 instructions, no temporary registers or other state modified.

The AMD 29000 has byte insertion/extraction functions. In the following, setbp
sets a special "byte pointer" register. The exbyte instruction takes the byte of
the source addressed by the byte pointer and places it in the low order byte of
the destination. (I use constants for the setbp instruction since the actual
values depend on the byte order bit in the processor configuration register...)
The code looks like so:

and     DEST, SRC, 0xff         ; DEST = SRC & oxff
sll     DEST, DEST, 8           ; DEST =<< 8
setbp   LOWMIDBYTE              ; Setup special register with constant
exbyte  SRC, DEST, DEST ; DEST[0] = SRC[1] essentially
sll     DEST, DEST, 8           ; DEST =<< 8
setbp   LOWHIGHBYTE             ; Setup special register with constant
exbyte  SRC, DEST, DEST ; DEST[0] = SRC[2]
sll     DEST, DEST, 8           ; DEST =<< 8
setbp   HIGHBYTE                ; Setup special register with constant
exbyte  SRC, DEST, DEST ; DEST[0] = SRC[3]

10 single cycle instructions, and no temporary register but the byte pointer is
modified.

Finally, the R3000 routine, which is fairly generic and could be implemented on
other architectures:

andi    temp, SRC, 0x00ff
sll     DEST, temp, 24
andi    temp, SRC, 0xff00
sll     temp, temp, 8
and     DEST, temp, DEST
lui     temp, 0x00ff
and     temp, SRC, temp
srl     temp, temp, 8
and     DEST, temp, DEST
lui     temp, 0xff00
and     temp, SRC, temp
srl     temp, temp, 24
and     temp, DEST, DEST

13 single cycle instructions, one temporary register.

As I don't regularly program in asembly language, much less on all three of the
above mentioned processors, I don't make any promises about the correctness or
optimality of the above code. It is presented for rough comparison only. I'm
sure someone will correct me if I got it wrong :-)

Sincerely,
Zalman Stern
Internet: zs01+@andrew.cmu.edu     Usenet: I'm soooo confused...
Information Technology Center, Carnegie Mellon, Pittsburgh, PA 15213-3890

bimandre@kulcs.uucp (Andre Marien) (02/08/89)

> John Hascall writes:
> > ;
> > ;     Move and Reverse Endian-ness Longword
> > ;
> >       .macro  MREL    src,dst
> >       PUSHL   src                     ; stack source operand
> >       MOVB    0(SP),-(SP)             ; re-stack HI byte
> >       MOVB    2(SP),-(SP)             ; re-stack HI-MID byte
> >       MOVB    4(SP),-(SP)             ; re-stack LO-MID byte
> >       MOVB    6(SP),-(SP)             ; re-stack LO byte
> >       MOVL    (SP)+,dst               ; pop it into destination operand
> >       TSTL    (SP)+                   ; clear stack, set cond code
> >       .endm

> That's 11 memory references for something which has nothing to do with memory.
> It only takes 6 instructions (0 memory references) on the RT.

This can be done on the VAX too :
*/
int g,cg;

conv()
{
  asm("	movl	_g,r0");
  asm("	rotl	$8,r0,r1");
  asm("	rotl	$24,r0,r2");
  asm("	bicl2	$0xff00ff00,r1");
  asm("	bicl2	  $0xff00ff,r2");
  asm("	addl3	r1,r2,_cg");
}

main()
{ while(1)
  { scanf("%d",&g);
    conv();
    printf("src = %x dst = %x\n",g,cg);
  }
}

/*
Andre' Marien
B.I.M.
Belgium
bimandre@cs.kuleuven.ac.be
sun!sunuk!sunbim!kulcs!bimandre
*/

hascall@atanasoff.cs.iastate.edu (John Hascall) (02/08/89)

In article <QXvqHvy00Vs887oooq@andrew.cmu.edu> zs01+@andrew.cmu.edu (Zalman Stern) writes:
>> *Excerpts from ext.nn.comp.arch: 7-Feb-89 Re: Endian reversing MOVEs Joe*
>> *Keane@andrew.cmu.edu (713)*
>> That's 11 memory references for something which has nothing to do with memory.
>> It only takes 6 instructions (0 memory references) on the RT.

>The RT has two advantages here. First, has no addressing modes to worry about.
>Second, the RT has instructions for moving characters around within registers.
>The first is a result of a RISC load/store architecture. The second might not
>exactly be RISC, but probably isn't too far off. (I don't have any figures as to
>how often the MC03 type instructions are used but they shouldn't add much
>complexity.)

>       ... These all assume you are byte switching a value from one register
>into a distinct register. This is a minor difference from the original macro,
>but I think it is the reasonable case to consider.

>Here's the RT code:

>mc31    SRC, DEST       ; DEST[3] = SRC[1]
>mc23    DEST, DEST      ; DEST[2] = DEST[3]
>mc32    SRC, DEST       ; DEST[3] = SRC[2]
>mc13    DEST, DEST      ; DEST[1] = SRC[3]
>mc30    SRC, DEST       ; DEST[3] = SRC[0]
>mc03    SRC, DEST       ; DEST[0] = SRC[3]

   At the risk of starting a religous "code-wars", if we assume the case
   above, (two distinct registers), we can write our VAX code as:

                                 ;                      source = S3,S2,S1,S0
      INSV    R0,#16,#8,R1       ; R1<7:0> <- R0<23:16>   dest = ??,??,??,S2
      ROTL    #-8,R1             ; rotate R1 8 right      dest = S2,??,??,??
      INSV    R0,#8,#8,R1        ; R1<7:0> <- R0<15:8>    dest = S2,??,??,S1
      ROTL    #-8,R1             ; rotate...              dest = S1,S2,??,??
      INSV    R0,#0,#8,R1        ; R1<7:0> <- R0<7:0>     dest = S1,S2,??,S0
      ROTL    #-8,R1             ; rotate...              dest = S0,S1,S2,??
      INSV    R0,#24,#8,R1       ; R1<7:0> <- R0<31:24>   dest = S0,S1,S2,S3

   7 instructions, 0 memory references....

 [ This will also work with R0 and/or R1 replaced with a memory address (as long
   as they do not overlap), but that would be hideous :-)  ]

  John Hascall
  ISU Comp Center

  p.s.  Is  ROTL #24,Rx  faster/slower than  ROTL #-8,Rx  ?  anyone?
	I'm guessing it's faster because you can write #24 as a
	short-literal...

davet@oakhill.UUCP (David Trissel) (02/09/89)

On the M68000 architecture:

	ROTL.W   #8,D0
	SWAP.W   D0
	ROTL.W   #8,D0

No extra register or memory references required.

-- Dave Trissel   Motorola Austin

jensen@granite.dec.com (Paul Jensen) (02/09/89)

In article <772@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu (John Hascall) writes:
>In article <MXvZBhy00W428s8WI0@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes:
>>John Hascall writes:
>>> ;
>>> ;     Move and Reverse Endian-ness Longword
>>> ;
>           [....]
>>That's 11 memory references for something which has nothing to do with memory.
>>It only takes 6 instructions (0 memory references) on the RT.
>
>   Well, if you have a better method, then by all means post it.
>
>   BTW, I could do it in less than 11, if I assumed somethings about
>   the addressing mode of the operands and/or the availability of
>   a register to work in, but for my situation I neeeded a completely
>   general method.
>
>   John Hascall

I thought I would never write VAX macro again, especially
in the context of a subject to which I am utterly indifferent (or
agnostic, at any rate).  However... if your machine includes
rotate instructions, you can do a little better than directly moving
the bytes.  Rotating by half the word size swaps
the upper and lower halves.  Rotating a longword by +/- 8 bits
moves 2 of the 4 bytes into the correct position.  In the context
of the VAX, you can save one instruction (while preserving
the semantics of your macro) with something like

        pushl   src
        rotl    $8, src, -(sp)
        movb    4(sp),3(sp)
        movb    6(sp),1(sp)
        movl    (sp)+,dst
        tstl    (sp)+

I am not certain that one rotl is faster than 2 movb's, but I suspect it is.
(The best thing about the endian debate is that it will encourage people
to read "Gulliver's Travels", one of the funniest (& most vicious) books
ever written.)


-- 
					/Paul Jensen
					 Digital Equipment Corp.
					 Palo Alto, CA

jk3k+@andrew.cmu.edu (Joe Keane) (02/09/89)

John Hascall writes VAX code with:
> 7 instructions, 0 memory references....
Yeah, but it's 29 bytes instead of 12, and probably a lot slower to boot.
Anyway, this is pretty silly; the RT happens to have instructions which make it
easy.  I don't know why they're there; i've never seen the compiler use them,
although gcc might.

>  p.s.  Is  ROTL #24,Rx  faster/slower than  ROTL #-8,Rx  ?  anyone?
>       I'm guessing it's faster because you can write #24 as a
>       short-literal...
It should be.

chase@Ozona.orc.olivetti.com (David Chase) (02/09/89)

Didn't we have this discussion just a few months ago already?

anyhow, someone writes:
>>   Well, if you have a better method, then by all means post it.

and, authored by some unknown programmer with Acorn, is the 4
instruction, no memory reference endian reversing code sequence:

input is in R0, R0 = a,b,c,d

r1 := ~ (0xFF << 8)     // r1 := F,F,0,F
r2 := r0 ^ (r0 ROTR 16) // r2 := a,b,c,d ^ c,d,a,b = a^c,b^d,a^c,b^d
r2 := r1 & (r2 >> 8)    // r2 := F,F,0,F & 0,a^c,b^d,a^c = 0,a^c,0,a^c
r0 := r2 ^ (r0 ROTR 8)  // r0 := 0,a^c,0,a^c ^ d,a,b,c = d,c,b,a

Each of these lines translates into one Acorn RISC Machine
instruction, and each of these instructions executes in one "cycle" (I
don't know what they do with their clock, but the rotates don't slow
it down).  I would expect similar tricks to be possible on the HP
Spectrum, since I understand that it also has shift-and-op
instructions.  (They're great for generating fast constant
multiplication sequences, but I KNOW we've already had that discussed
here.)

Note that the first instruction need not be repeated if you are
swapping several registers; thus, it is possible to swap bytes in
12 registers in 37 instructions (of course, then you wait all day to
get the registers in and out of memory).

I guess I should ask: what else are shift-and-op style instructions
good for?  So far I have

1) byte swapping
2) fast multiplication by constants

David

mark@UNIX386.Convergent.COM (Mark Nudelman) (02/10/89)

In article <37562@oliveb.olivetti.com>, chase@Ozona.orc.olivetti.com (David Chase) writes:
> [endian-reversing code optimized for the Acorn RISC machine]

Interesting method.  This could be rendered in 80386 code as:

	/ input value in %eax
	movl	%eax, %edx		/ edx = (a b c d)	    2 clocks
	rorl	$16, %edx		/ edx = (c d a b)	    3
	xorl	%eax, %edx		/ edx = (a^c b^d a^c b^d)   2
	shrl	$8, %edx		/ edx = (0 a^c b^d a^c)	    3
	xorb	%dh, %dh		/ edx = (0 a^c 0 a^c)	    2
	rorl	$8, %eax		/ eax = (d a b c)	    3
	xorl	%edx, %eax		/ eax = (d c b a)	    2

However, since the 386 has instructions to manipulate bytes within
32 bit registers (and doesn't have the shift-and-op features of the
Acorn), a faster and simpler version is:

	xchgb	%al, %ah		/ eax = (a b d c)	    3 clocks
	rorl	$16, %eax		/ eax = (d c a b)	    3
	xchgb	%al, %ah		/ eax = (d c b a)	    3

Mark Nudelman
{sun,decwrl,hplabs}!pyramid!ctnews!UNIX386!mark

maslen@polya.Stanford.EDU (Thomas Maslen) (02/10/89)

In article <780@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu
(John Hascall) writes:
...
>   What an Idea!
...
>     The "International Obtuse 4-bytes-of-8-bits Endian-reversing
>     Contest":
...

Fine, just don't do it on a normally useful newsgroup such as comp.arch.

Thomas Maslen					maslen@polya.stanford.edu

cik@l.cc.purdue.edu (Herman Rubin) (02/10/89)

We have seen a substantial discussion on exactly how to reverse endianness
in software.  Various methods have been proposed, different ones better for
different machines.  Now several possibilities occur to me.

1.  It is not too important to do quickly.  In this case, any reasonable
method is OK.  Whether it pays to do an efficient job becomes debatable, but
since it is so short, assembly coding is no problem.

2.  It is important to do quickly.  That is to say, a significant amount of
running time will be tied up by the lack of this.  In that case, it should
be added to the hardware.  A small amount of "wiring" makes this as fast
an instruction as a move.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

karl@hpclkwp.HP.COM (Karl Pettis) (02/14/89)

On the Hewlett-Packard Precision Architecture, reversing the bytes in
a word can be done in three instructions:

                                ;       r1 = "ABCD"
        SHD     r1,r1,16,r2     ;       r2 = "CDAB"
        DEP     r2,15,8,r2      ;       r2 = "CBAB"
        SHD     r1,r2,8,r2      ;       r2 = "DCBA"

Explanation:

        SHD     r1,r2,p,t       ;       Shift Double

          Concatenates r1 and r2 and shifts the combination right by
          p bits, putting the low 32 bits of the result into t.

        DEP     r,p,len,t       ;       Deposit

          Takes the len low order bits of r and places them into
          register t with the least significant bit at position p.
          Other bits of t are unchanged.

HPPA numbers bits in a word in big-endian order:  0..31.

This code was devised by Jerry Huck.  I'm just posting it.

- Karl Pettis
karl@hpda

mash@mips.COM (John Mashey) (02/15/89)

In article <QXvqHvy00Vs887oooq@andrew.cmu.edu> zs01+@andrew.cmu.edu (Zalman Stern) writes:

>For comparison purposes I've jotted down code for the RT, the AMD 29000, and the
>MIPS R3000. These all assume you are byte switching a value from one register
>into a distinct register. This is a minor difference from the original macro,
....
>Finally, the R3000 routine, which is fairly generic and could be implemented on
>other architectures:

>andi    temp, SRC, 0x00ff
>sll     DEST, temp, 24
>andi    temp, SRC, 0xff00
>sll     temp, temp, 8
>and     DEST, temp, DEST
>lui     temp, 0x00ff
>and     temp, SRC, temp
>srl     temp, temp, 8
>and     DEST, temp, DEST
>lui     temp, 0xff00
>and     temp, SRC, temp
>srl     temp, temp, 24
>and     temp, DEST, DEST
>
>13 single cycle instructions, one temporary register.

Following is a C program that does this, 2 ways:
	a) Value starts in memory
	b) Value starts in register
The relevant piece of the .s output is shown.
The point is NOT that an R3000 is faster than Zalman's example, which was at
least a credible try.  The point is the relationship of hand-coded routines
to code compiled from high-level languages; RISC chips were supposed to be
designed for the latter.  Could people perhaps post the best compiler-generated
code for this function, and compare it with the best hand-coded code?
(Use any C code you like).  I'll start:

C PROGRAM:
struct x { unsigned char a, b, c, d};
unsigned y,z;
main(p)
	struct x *p;
{
	y = p->d << 24 | p->c << 16 | p->b << 8 | p->a;
	z = (y << 24) | ((y & 0xff00) << 8) | ((y << 8) & 0xff00) | (y >> 24);
}

GEENRATED .s:
 #   6		y = p->d << 24 | p->c << 16 | p->b << 8 | p->a;
	lbu	$14, 3($4)
	sll	$15, $14, 24
	lbu	$24, 2($4)
	sll	$25, $24, 16
	or	$8, $15, $25
	lbu	$9, 1($4)
	sll	$10, $9, 8
	or	$11, $8, $10
	lbu	$12, 0($4)
	or	$13, $11, $12
	sw	$13, y	* NOT COUNTED

 #   7		z = (y << 24) | ((y & 0xff00) << 8) | ((y << 8) & 0xff00) | (y >> 24);
	sll	$14, $13, 24
	and	$24, $13, 65280
	sll	$15, $24, 8
	or	$25, $14, $15
	sll	$9, $13, 8
	and	$8, $9, 65280
	or	$10, $25, $8
	srl	$11, $13, 24
	or	$12, $10, $11
	sw	$12, z   * NOT COUNTED

The first case is 10 cycles (+ cache miss, if any); the second case is
9 cycles, both assuming 100% I-cache hit.  As noted, I coded the C to take
advantage of the fact that ands of 0xffff or less are good things.
BTW: this isn't something I'd expect an R3000 to do especially well on.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

andrew@frip.gwd.tek.com (Andrew Klossner) (02/19/89)

And, for completeness, here's the endian-reversal code for the M88k
(eyeballed but not tested):

	; r2 is both src and dest; r3 is a temp
					; r2: 0123
	rot	r3,r2,24		; r3: 1230
	and.u	r3,r3,0xFF		; r3: z230
	and	r3,r3,0xFF		; r3: z2z0
	rot	r2,r2,8			; r2: 3012
	and.u	r2,r2,0xFF00		; r2: 3z12
	and	r2,r2,0xFF00		; r2: 3z1z
	or	r2,r2,r3		; r2: 3210

(As you would expect, these are all single-cycle instructions.)

  -=- Andrew Klossner   (uunet!tektronix!orca!frip!andrew)      [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

yuval@taux01.UUCP (Gideon Yuval) (02/19/89)

If   you  need  wrong-endian  language-semantics  on  right-endian  H/W,  the
fastest way out is as follows:

Define a  new  architecture,  in  which  a  byte's  logical  address  is  the
NEGATION  of  its physical address (one's complement works better, but that's
a  2nd-order  effect).  the  old  load/store  op-codes become, under this new
architecture,   load&swap   and  store&swap  op-codes;  and  if  we  restrict
compiled  code  to  a  RISC-style  load/store/reg-op  architecture,  the ONLY
op-codes that know about byte-sex are load/store:  registers  have  no  high/
low addressing inside them, and memory data have no LSB/MSB in them!.

This  new  architecture  (which is simply another way to look at the existing
H/W!)  lets  wrong-endian  code get compiled for a right-endian CPU. The main
penalty is in pointer-management:  to  dereference  a  pointer,  we  have  to
negate  (or:  one's complement) it first; and the same pointer cannot be used
for byte, word  and  Dword  access;  but  the  compiler  gurus  estimate  the
time-penalty  at  10-30% ("C"  code),  less for Fortran, and as near 0.0% for
Cobol as makes no difference.
-- 
Gideon Yuval, yuval@taux01.nsc.com, +972-2-690992 (home) ,-52-522255(work)
 Paper-mail: National Semiconductor, 6 Maskit St., Herzliyah, Israel
                                                TWX: 33691, fax: +972-52-558322

rpw3@amdcad.AMD.COM (Rob Warnock) (02/25/89)

Took me a little while to get around to it, but below is a fast (4 cycle)
byte-swap for the Am29000, to round out the mix. And here's what I've seem
to date:

	machine	who reported	language	cycles	notes
	=======	============	========	======	=====
	29k	rpw3@amd	C		10
	MIPS	mashey@mips	C		9
	ARM	rwilson@acorn	C		7
	VAX	bimadre@kulcs	assembler	5	Instr., not cycles!
	88k	klossner@tek	assembler	7	(2+5n, maybe?)
	ARM	rwilson@acorn	assembler	4	1+3n
	29k	rpw3@amd	assembler	4	1+3n

The "1+3n" for ARM & 29k means that the work done by the first instruction
is not destroyed by the remainder, thus blocks of words can be swapped in
3 cycles, asymptotically. I also think the 88k code could do it in 2+5n if
the 0xFF00FF00 mask were generated explicitly. (But does it matter? Naaah!)

	; byte-swap for Am29000:  src = a, b, c, d
	mtsrim	ALU, (1<<5)+24	; set BP=1, FC=24
	extract	tmp, src, src	; tmp = d, a, b, c  (really a rotate)
	exbyte	dst, tmp, tmp	; dst = d, a, b, a
	inbyte	dst, dst, tmp	; dst = d, c, b, a

The 4-cycle method for the 29k thus bears some resemblance to the original
poster's version which did the swap in-memory. In this case, we use the 29k
instructions that support byte load/store (on the 29k you do a word load,
then an EXBYTE.) Normally, the second arg of EXBYTE is an immediate 0, but
here we *use* the fact that EXBYTE is really a specialized merge. In addition,
we use the EXTRACT both-sources-same idiom to get a rotate. Finally, one
instruction is saved by setting up both the Byte Pointer and the Funnel-
shifter Count by loading the entire ALU status reg, rather than explicit
loads of the equivalent BP and FC regs.

(Makes one almost wish for an explicit ROTATE, rather than the 2-cycle
"set FC, extract" idiom. But rotates of varying values probably aren't
worth it, and the 29k can do several rotates of the same amount in 1+1n.)

The C version was compiled with the Metaware C compiler with "-O" turned
on (and a lot of compiler noise elided):

unsigned foo(x)
unsigned int x;
{
  return (x << 24) | ((x & 0xff00) << 8) | ((x >> 8) & 0xff00) | (x >> 24);
}

_foo:
	sll	gr120,lr2,24
	const	gr119,65280	; (0xff00)
	and	gr121,lr2,gr119
	sll	gr121,gr121,8
	or	gr120,gr121,gr120
	srl	gr121,lr2,8
	and	gr121,gr121,gr119
	or	gr120,gr121,gr120
	srl	gr121,lr2,24
	jmpi	lr0
	or	gr96,gr121,gr120

-----------

Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

mash@mips.COM (John Mashey) (02/26/89)

In article <24618@amdcad.AMD.COM> rpw3@amdcad.amd.com (Rob Warnock) writes:
>Took me a little while to get around to it, but below is a fast (4 cycle)
>byte-swap for the Am29000, to round out the mix. And here's what I've seem
>to date:

>	machine	who reported	language	cycles	notes
>	=======	============	========	======	=====
>	29k	rpw3@amd	C		10
>	MIPS	mashey@mips	C		9
	MIPS	mashey@mips	assembler	9	(was same as C)
>	ARM	rwilson@acorn	C		7
>	VAX	bimadre@kulcs	assembler	5	Instr., not cycles!
>	88k	klossner@tek	assembler	7	(2+5n, maybe?)
>	ARM	rwilson@acorn	assembler	4	1+3n
>	29k	rpw3@amd	assembler	4	1+3n

Good table.  It would really be nice to 1) get the 88K C-compiled code #,
and 2) the C & Assembler #s for HP Precision, which ought to be fairly
good at this sort of thing, I think.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

w-colinp@microsoft.UUCP (Colin Plumb) (02/27/89)

rpw3@amdcad.amd.com (Rob Warnock) wrote:
> (Makes one almost wish for an explicit ROTATE, rather than the 2-cycle
> "set FC, extract" idiom. But rotates of varying values probably aren't
> worth it, and the 29k can do several rotates of the same amount in 1+1n.)

Is there some magical way to synthesize bitfield insertion/extraction that
doesn't want lots of shifts?  True, these days people don't use a lot of
tricky bit-packing, but you still find it in annoying places like device
registers.  (Read: interrupt service routines.  I feel The Need...)

For tricky messing about, it would be easier to rotate a bit field to
be in the least significant bits and bash on it using short immediate
constants, then rotate it back, than to use lots of long immediate masks.

(Admittedly, most device registers are 8 bits, so short immediate constants
on the 29000 will do fine, but I still think a case can be made.)
-- 
	-Colin (uunet!microsoft!w-colinp)

"Don't listen to me.  I never do."

mahar@weitek.UUCP (Mike Mahar) (02/28/89)

In article <24618@amdcad.AMD.COM> rpw3@amdcad.amd.com (Rob Warnock) writes:
>Took me a little while to get around to it, but below is a fast (4 cycle)
>byte-swap for the Am29000, to round out the mix. And here's what I've seem
>to date:
>
>	machine	who reported	language	cycles	notes
>	=======	============	========	======	=====
>	29k	rpw3@amd	C		10
>	MIPS	mashey@mips	C		9
>	ARM	rwilson@acorn	C		7
>	VAX	bimadre@kulcs	assembler	5	Instr., not cycles!
>	88k	klossner@tek	assembler	7	(2+5n, maybe?)
>	ARM	rwilson@acorn	assembler	4	1+3n
>	29k	rpw3@amd	assembler	4	1+3n
>

The Weitek XL-Series has an instruction "perfect exchange".  It is capable
of swaping just about any thing.  It does a half word swap, a byte swap, a
nybble swap, or a bit swap.

pexch rb,im5,ra

ra,rb - registers 
im5 - 5 bit immediate 

This instruction performs
a perfect exchange among the bits of the contents of register rb.
The result is placed in register ra. The bits  in  the  immediate
field  control  which  parts  of the word are exchanged. The most
significant bit, of the immediate field, causes the  upper  half-
word  to  be  exchanged  with the lower halfword. The second most
significant bit causes the lower two bytes to  be  exchanged  and
the upper two bytes to be exchanged. The least significant bit in
the immediate field causes each pair of adjacent bits to  be  ex-
changed.  
Assume the immediate field is 1 1 1 1 1.  
a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F - original word
q r s t u v w x y z A B C D E F a b c d e f g h i j k l m n o p - bit 4 applied y z A B C D E F q r s t u v w x i j k l m n o p a b c d e f g h - bit 3 applied C D E F y z A B u v w x q r s t m n o p i j k l e f g h a b c d - bit 2 applied E F C D