[comp.arch] HW v. SW

hjm@cernvax.UUCP (Hubert Matthews) (11/03/88)

The INMOS T800 has an instruction bitrevword, which turns a
little-endian word into a big-endian word, effectively doing a
reflection in the middle.  Great for FFT shuffle routines.  In
software, it takes quite some time.  In hardware it takes just over 
1 microsecond on a 30MHz part.  There is also a bitrevnbits which
reverses the bottom n bits in a byte and it takes n+4 cycles at 30MHz.
Again, not very quick in software.
-- 

	Hubert Matthews

tim@crackle.amd.com (Tim Olson) (11/08/88)

In article <866@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
| 
| 
| The INMOS T800 has an instruction bitrevword, which turns a
| little-endian word into a big-endian word, effectively doing a
| reflection in the middle.  Great for FFT shuffle routines.  In
| software, it takes quite some time.  In hardware it takes just over 
| 1 microsecond on a 30MHz part.  There is also a bitrevnbits which
| reverses the bottom n bits in a byte and it takes n+4 cycles at 30MHz.
| Again, not very quick in software.

This actually can be done *faster* in software.  1 microsecond @ 30MHz
== 30 cycles, but with an 8-bit table-lookup it can be easily done in
less than 22 cycles.


	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

firth@sei.cmu.edu (Robert Firth) (11/08/88)

In article <866@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
>The INMOS T800 has an instruction bitrevword, which turns a
>little-endian word into a big-endian word, effectively doing a
>reflection in the middle.  Great for FFT shuffle routines.  In
>software, it takes quite some time.  In hardware it takes just over 
>1 microsecond on a 30MHz part.

1 usec at 30MHz is about 30 cycles, I guess.

Here's my quick and dirty attempt at the same in software, on the
MIPS R2000.  We assume DEST := bitrevword(SRC), where both are
4-byte variables, and T is a 256-byte lookup table of reversed
bytes:

	la U0,SRC	; fetch address of SRC
	la U1,DEST	; fetch address of DEST
	la U2,T		; fetch address of T

	lb U3,3(U0)	; get byte 3 of SRC
	lb U4,2(U0)
	add U5,U3,U2	; index lookup table
	lb U6,0(U5)	; get translation
	add U7,U4,U2
	stb U6,0(U1)	; store in byte 0 of DEST
	lb U8,0(U7)
	lb U3,1(U0)
	stb U8,1(U1)
	add U5,U3,U2
	lb U4,0(U0)
	lb U6,0(U5)
	add U7,U4,U2
	lb U8,0(U7)
	stb U6,2(U1)
	stb U8,3(U1)

(the interleaving is necessary to avoid load delays)

The above is 19 cycles, in software.

jangr@microsoft.UUCP (Jan Gray) (11/08/88)

In article <7629@aw.sei.cmu.edu> firth@bd.sei.cmu.edu (Robert Firth) writes:
>Here's my quick and dirty attempt at the same in software, on the
>MIPS R2000.  We assume DEST := bitrevword(SRC), where both are
>4-byte variables, and T is a 256-byte lookup table of reversed
>bytes:

You can avoid all memory references using this slightly slower approach:
	; r0 -- input	
	; r1 -- scratch
	; r2 -- 0x00FF00FF
	; r3 -- 0x0F0F0F0F
	; r4 -- 0x33333333
	; r5 -- 0x55555555
					;abcdefghijklmnopqrstuvwxyz123456
	srl	r1,r0,16 ; swap 16 bits	;----------------abcdefghijklmnop
	sll	r0,r0,16		;qrstuvwxyz123456----------------
	or	r0,r0,r1		;qrstuvwxyz123456abcdefghijklmnop

	and	r1,r0,r2 ; swap 8 bits	;--------yz123456--------ijklmnop
	xor	r0,r0,r1		;qrstuvwx--------abcdefgh--------
	srl	r0,r0,8			;--------qrstuvwx--------abcdefgh
	sll	r1,r1,8			;yz123456--------ijklmnop--------
	or	r0,r0,r1		;yz123456qrstuvwxijklmnopabcdefgh

	and	r1,r0,r3 ; swap 4 bits	;----3456----uvwx----mnop----efgh
	xor	r0,r0,r1		;yz12----qrst----ijkl----abcd----
	srl	r0,r0,4			;----yz12----qrst----ijkl----abcd
	sll	r1,r1,4			;3456----uvwx----mnop----efgh----
	or	r0,r0,r1		;3456yz12uvwxqrstmnopijklefghabcd

	and	r1,r0,r4 ; swap 2 bits	;--56--12--wx--st--op--kl--gh--cd
	xor	r0,r0,r1		;34--yz--uv--qr--mn--ij--ef--ab--
	srl	r0,r0,2			;--34--yz--uv--qr--mn--ij--ef--ab
	sll	r1,r1,2			;56--12--wx--st--op--kl--gh--cd--
	or	r0,r0,r1		;563412yzwxuvstqropmnklijghefcdab

	and	r1,r0,r5 ; swap 1 bits	;-6-4-2-z-x-v-t-r-p-n-l-j-h-f-d-b
	xor	r0,r0,r1		;5-3-1-y-w-u-s-q-o-m-k-i-g-e-c-a-
	srl	r0,r0,1			;-5-3-1-y-w-u-s-q-o-m-k-i-g-e-c-a
	sll	r1,r1,1			;6-4-2-z-x-v-t-r-p-n-l-j-h-f-d-b-
	or	r0,r0,r1		;654321zyxwvutsrqponmlkjihgfedcba

We could even save a couple of instructions if the Rx000 supplied an
"op rd,rs1,rs2<<n" form, a la Acorn or HP Precision Architecture.

According to an old preprint of Knuth chapter 7.1, *any* permutation of n
bits can be achieved in O(lg n) logical operations.

Jan Gray  uunet!microsoft!jangr  Microsoft Corp., Redmond Wash.  206-882-8080

bcase@cup.portal.com (Brian bcase Case) (11/08/88)

>The INMOS T800 has an instruction bitrevword, which turns a
>little-endian word into a big-endian word, effectively doing a
>reflection in the middle.  Great for FFT shuffle routines.  In
>software, it takes quite some time.  In hardware it takes just over 
>1 microsecond on a 30MHz part.  There is also a bitrevnbits which
>reverses the bottom n bits in a byte and it takes n+4 cycles at 30MHz.
>Again, not very quick in software.

It must be slimilar for other RISCs (the T800 is not my idea, but...),
but for the 29K, this takes at most 13 instructions or 0.520 microsecond
on a 25 MHz part.  I am sure that a better algorithm can be found that
the one that I thought of in 15 seconds.

daveh@cbmvax.UUCP (Dave Haynie) (11/09/88)

in article <866@cernvax.UUCP>, hjm@cernvax.UUCP (Hubert Matthews) says:

> The INMOS T800 has an instruction bitrevword, which turns a
> little-endian word into a big-endian word, effectively doing a
> reflection in the middle.  Great for FFT shuffle routines.  In
> software, it takes quite some time.  In hardware it takes just over 
> 1 microsecond on a 30MHz part.  

That's 30 cycles, which would be 30 instructions for a true instruction/cycle 
RISC machine.  The claim for the very simple ARM machine was something like 5 
instructions, which would total 10 cycles if the quoted 1.9 cycles per 
instruction is reasonable (and you'd of course have to imagine a 30MHz ARM
machine).  If a baby RISC machines does this 3x more efficiently than the
Transputer's hardwired instruction, I'd certainly expect a more grown-up 
RISC chip like a MIPS or 88K to do better.  I believe there are arguments for
implementing operations, even complex ones, in hardware.  This ain't one of
'em.

> 	Hubert Matthews
-- 
Dave Haynie  "The 32 Bit Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
              Amiga -- It's not just a job, it's an obsession

henry@utzoo.uucp (Henry Spencer) (11/09/88)

In article <866@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
>The INMOS T800 has an instruction bitrevword, which turns a
>little-endian word into a big-endian word...
>In software, it takes quite some time.  In hardware it takes just over 
>1 microsecond on a 30MHz part.  There is also a bitrevnbits which
>reverses the bottom n bits in a byte and it takes n+4 cycles at 30MHz.
>Again, not very quick in software.

Don't be quite so quick to assume that things like this are impossibly
slow in software.  If you're willing to invest some memory in it, most
bit manipulations can be done *much* more quickly by table lookup than
by "doing it the hard way".  Bitrevnbits is mask, add, memory access,
and or; could be as little as 4 (not n+4) cycles on a 29000, for example.
Skip the or if you're only interested in those bits; skip the mask if
you've got them isolated in a register in the first place.  And you can
do any mapping, not just a reversal.

(This admittedly does not apply to bitrevword unless you're willing to
use a biiiiig table! :-))
-- 
The Earth is our mother.        |    Henry Spencer at U of Toronto Zoology
Our nine months are up.         |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

hjm@cernvax.UUCP (Hubert Matthews) (11/09/88)

In article <866@cernvax.UUCP>, hjm@cernvax.UUCP (Hubert Matthews) (me!) says:
>
>> The INMOS T800 has an instruction bitrevword, which turns a
>> little-endian word into a big-endian word, effectively doing a
>> reflection in the middle.  Great for FFT shuffle routines.  In
>> software, it takes quite some time.  In hardware it takes just over 
>> 1 microsecond on a 30MHz part.  
>

OK, OK.  So I picked a lousy implementation of the bitrevword operation.
All I was trying to point out is that reversing the bits in a word can be
done quickly in hardware (should be one cycle), whereas as it takes an
order of magnitude longer in software.  INMOS's microcode isn't very good
on this one - I think it must be using 32 shifts to do it - but it they
used some hardware it would be 1 cycle.

I now realise that my example wasn't pedagogically suitable, so please
stop junking my mailbox :-)
-- 

	Hubert Matthews

dorn@fabscal.UUCP (Alan Dorn Hetzel Jr.) (11/10/88)

Well, when you get right down to it, bit-reversal is the sort of operation
which when done *properly* in hardware, should take *zero* cycles, being
the sort of thing you reduce to an addressing mode.

For example, the Motorola DSP-56000 has an address modifier mode called
"carry-reversal" which causes address increments to start on the high bit
and carry downwards.  This properly produces the order of addresses needed
for the FFT bit-reversed addressing mode, and does it with NO extra clock
cycles above the standard increment mode.

When you think about it, bit-reversal in hardware is little more than
twisting the data path one-half revolution!

Dorn
gatech.edu!fabscal!dorn

robert@hemingway.WEITEK.COM (Robert Plamondon) (11/10/88)

>In article <866@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
>>The INMOS T800 has an instruction bitrevword, which turns a
>>little-endian word into a big-endian word, effectively doing a
>>reflection in the middle.  Great for FFT shuffle routines.  In
>>software, it takes quite some time.  In hardware it takes just over 
>>1 microsecond on a 30MHz part.

The WEITEK XL-Series RISC processors do this in one cycle with the
Perfect Exchange (pexch) instruction, which does field swaps on 2, 4,
8, 16, and 32-bit boundaries. An endian-conversion would be done with

	pexch source, 0b1100, dest # source and dest are registers

a bit-reverse operation (move bit 0 to bit 31, bit 1 to bit 30, etc.)
is done with:

	pexch source, 0b1111, dest

and so on.  Depending on speed grade, this will be done in 1/8 to 1/12
of a microsecond.

	-- Robert
-- 
    Robert Plamondon
    robert@weitek.COM
    "No Toon can resist the old 'Shave and a Hair-Cut'"

baum@Apple.COM (Allen J. Baum) (11/11/88)

[]
>In article <646@fabscal.UUCP> dorn@fabscal.UUCP (Alan Dorn Hetzel Jr.) writes:
>...For example, the Motorola DSP-56000 has an address modifier mode called
>"carry-reversal" which causes address increments to start on the high bit
>and carry downwards.  This properly produces the order of addresses needed
>for the FFT bit-reversed addressing mode, and does it with NO extra clock
>cycles above the standard increment mode.

While it takes no extra cycles, it probably causes the speed of the adder to 
increase, possibly increasing the overall cycle time. This is a classic
RISC-CISC tradeoff. Given the applications the 56000 was intended for, they
probably made the tradeoff correctly. Its a pretty clever hack.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

aglew@urbsdc.Urbana.Gould.COM (11/11/88)

>Well, when you get right down to it, bit-reversal is the sort of operation
>which when done *properly* in hardware, should take *zero* cycles, being
>the sort of thing you reduce to an addressing mode.

An addressing mode!!!!!  An array of wires, maybe, but man, addressing memory
is slow and getting slower. RISCs can get away with it by caching and 
prefetching, but caching and prefetching does not do you very much good
for large lookup tables when the inputs are uniformly distributed.

idall@augean.OZ (Ian Dall) (11/14/88)

In article <28200235@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>
>>Well, when you get right down to it, bit-reversal is the sort of operation
>>which when done *properly* in hardware, should take *zero* cycles, being
>>the sort of thing you reduce to an addressing mode.
>
>An addressing mode!!!!!  An array of wires, maybe, but man, addressing memory
>is slow and getting slower. RISCs can get away with it by caching and 
>prefetching, but caching and prefetching does not do you very much good
>for large lookup tables when the inputs are uniformly distributed.

If the main use for the bitreverse instruction is going to be for FFTs then
an addressing mode might be appropriate. If it is more generally useful then
perhaps a general instruction is the way to go. I can't think of any
application besides FFTs which would be significantly affected by the speed
of the bit reverse operation so an addressing mode might not be so silly.

FFT sequencing involves a fairly tight loop containing an operation like:

increment (offset)
load (base_address + bitreversal(offset))

On most conventional machines it is actually more efficient to do it
differently.

-- 
 Ian Dall           life (n). A sexually transmitted disease which afflicts
                              some people more severely than others.
idall@augean.oz