[comp.arch] 88k end-for-end routine

andrew@frip.gwd.tek.com (Andrew Klossner) (11/19/88)

[An earlier posting under this title went out with some egregious bugs.
 I cancelled it; my apologies if you saw it.  The lesson: never post something
 when you're due in a meeting in five minutes.]

For comparison, here's an 88k routine to do the bitwise inversion using
a 256-byte lookup table.  It the entire routine and the entire table
are in cache (a reasonable assumption if the routine is heavily used;
the smallest I and D cache sizes are each 16k), then the routine takes
17 cycles, including the return-to-caller.

As my colleagues did, I point out that this was a ten-minute hack, and
I'd welcome suggestions for improvement ... we will actually need a
routine like this in a few months for graphics-intensive work.

; end-for-end routine

; Register usage:
; r2 = parameter (and return value)
; r3 -> 256-byte table of inverted bytes
; r4 = inversion of least significant parameter byte
; r5 = inversion of second parameter byte
; r6 = inversion of third parameter byte
; r7 = inversion of most significant parameter byte
; All of these registers are caller-saved.

	global	_end_for_end
_end_for_end:							; Cycle count

; Load address of inverted byte table.
	or.u	r3,r0,hi16(end_for_end_table)			; 1
	or	r3,r3,lo16(end_for_end_table)			; 2

; Start loading the inverse of the first byte.
	extu	r4,r2,8<0>					; 3
	ld.bu	r4,r3,r4					; 4

; Start loading the inverse of the second byte.
	extu	r5,r2,8<8>					; 5
	ld.bu	r5,r3,r5					; 6

; Start loading the inverse of the third byte.
	extu	r6,r2,8<16>					; 7
	ld.bu	r6,r3,r6					; 8

; Now the data pipeline's full.
; Compute the address of the fourth byte inversion.
	extu	r7,r2,8<24>					; 9

; Stall waiting for the first inverse to come in.
	mak	r2,r4,8<24>					; 10 (r4 stall)

; Start loading the inverse of the fourth byte.
	ld.bu	r7,r3,r7					; 11

; Stall on, then assemble the remaining bytes into the return value.
	mak	r5,r5,8<16>					; 12 (r5 stall)
	or	r2,r2,r5					; 13
	mak	r6,r6,8<8>					; 14 (r6 stall)
	or	r2,r2,r6					; 15
	jmp.n	r1						: 16
	or	r2,r2,r7					; 17 (r7 stall)

  -=- Andrew Klossner   (uunet!tektronix!hammer!frip!andrew)    [UUCP]
                        (andrew%frip.gwd.tek.com@relay.cs.net)  [ARPA]