hjm@cernvax.UUCP (Hubert Matthews) (11/03/88)
The INMOS T800 has an instruction bitrevword, which turns a little-endian word into a big-endian word, effectively doing a reflection in the middle. Great for FFT shuffle routines. In software, it takes quite some time. In hardware it takes just over 1 microsecond on a 30MHz part. There is also a bitrevnbits which reverses the bottom n bits in a byte and it takes n+4 cycles at 30MHz. Again, not very quick in software. -- Hubert Matthews
tim@crackle.amd.com (Tim Olson) (11/08/88)
In article <866@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes: | | | The INMOS T800 has an instruction bitrevword, which turns a | little-endian word into a big-endian word, effectively doing a | reflection in the middle. Great for FFT shuffle routines. In | software, it takes quite some time. In hardware it takes just over | 1 microsecond on a 30MHz part. There is also a bitrevnbits which | reverses the bottom n bits in a byte and it takes n+4 cycles at 30MHz. | Again, not very quick in software. This actually can be done *faster* in software. 1 microsecond @ 30MHz == 30 cycles, but with an 8-bit table-lookup it can be easily done in less than 22 cycles. -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)
firth@sei.cmu.edu (Robert Firth) (11/08/88)
In article <866@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes: >The INMOS T800 has an instruction bitrevword, which turns a >little-endian word into a big-endian word, effectively doing a >reflection in the middle. Great for FFT shuffle routines. In >software, it takes quite some time. In hardware it takes just over >1 microsecond on a 30MHz part. 1 usec at 30MHz is about 30 cycles, I guess. Here's my quick and dirty attempt at the same in software, on the MIPS R2000. We assume DEST := bitrevword(SRC), where both are 4-byte variables, and T is a 256-byte lookup table of reversed bytes: la U0,SRC ; fetch address of SRC la U1,DEST ; fetch address of DEST la U2,T ; fetch address of T lb U3,3(U0) ; get byte 3 of SRC lb U4,2(U0) add U5,U3,U2 ; index lookup table lb U6,0(U5) ; get translation add U7,U4,U2 stb U6,0(U1) ; store in byte 0 of DEST lb U8,0(U7) lb U3,1(U0) stb U8,1(U1) add U5,U3,U2 lb U4,0(U0) lb U6,0(U5) add U7,U4,U2 lb U8,0(U7) stb U6,2(U1) stb U8,3(U1) (the interleaving is necessary to avoid load delays) The above is 19 cycles, in software.
jangr@microsoft.UUCP (Jan Gray) (11/08/88)
In article <7629@aw.sei.cmu.edu> firth@bd.sei.cmu.edu (Robert Firth) writes: >Here's my quick and dirty attempt at the same in software, on the >MIPS R2000. We assume DEST := bitrevword(SRC), where both are >4-byte variables, and T is a 256-byte lookup table of reversed >bytes: You can avoid all memory references using this slightly slower approach: ; r0 -- input ; r1 -- scratch ; r2 -- 0x00FF00FF ; r3 -- 0x0F0F0F0F ; r4 -- 0x33333333 ; r5 -- 0x55555555 ;abcdefghijklmnopqrstuvwxyz123456 srl r1,r0,16 ; swap 16 bits ;----------------abcdefghijklmnop sll r0,r0,16 ;qrstuvwxyz123456---------------- or r0,r0,r1 ;qrstuvwxyz123456abcdefghijklmnop and r1,r0,r2 ; swap 8 bits ;--------yz123456--------ijklmnop xor r0,r0,r1 ;qrstuvwx--------abcdefgh-------- srl r0,r0,8 ;--------qrstuvwx--------abcdefgh sll r1,r1,8 ;yz123456--------ijklmnop-------- or r0,r0,r1 ;yz123456qrstuvwxijklmnopabcdefgh and r1,r0,r3 ; swap 4 bits ;----3456----uvwx----mnop----efgh xor r0,r0,r1 ;yz12----qrst----ijkl----abcd---- srl r0,r0,4 ;----yz12----qrst----ijkl----abcd sll r1,r1,4 ;3456----uvwx----mnop----efgh---- or r0,r0,r1 ;3456yz12uvwxqrstmnopijklefghabcd and r1,r0,r4 ; swap 2 bits ;--56--12--wx--st--op--kl--gh--cd xor r0,r0,r1 ;34--yz--uv--qr--mn--ij--ef--ab-- srl r0,r0,2 ;--34--yz--uv--qr--mn--ij--ef--ab sll r1,r1,2 ;56--12--wx--st--op--kl--gh--cd-- or r0,r0,r1 ;563412yzwxuvstqropmnklijghefcdab and r1,r0,r5 ; swap 1 bits ;-6-4-2-z-x-v-t-r-p-n-l-j-h-f-d-b xor r0,r0,r1 ;5-3-1-y-w-u-s-q-o-m-k-i-g-e-c-a- srl r0,r0,1 ;-5-3-1-y-w-u-s-q-o-m-k-i-g-e-c-a sll r1,r1,1 ;6-4-2-z-x-v-t-r-p-n-l-j-h-f-d-b- or r0,r0,r1 ;654321zyxwvutsrqponmlkjihgfedcba We could even save a couple of instructions if the Rx000 supplied an "op rd,rs1,rs2<<n" form, a la Acorn or HP Precision Architecture. According to an old preprint of Knuth chapter 7.1, *any* permutation of n bits can be achieved in O(lg n) logical operations. Jan Gray uunet!microsoft!jangr Microsoft Corp., Redmond Wash. 206-882-8080
bcase@cup.portal.com (Brian bcase Case) (11/08/88)
>The INMOS T800 has an instruction bitrevword, which turns a >little-endian word into a big-endian word, effectively doing a >reflection in the middle. Great for FFT shuffle routines. In >software, it takes quite some time. In hardware it takes just over >1 microsecond on a 30MHz part. There is also a bitrevnbits which >reverses the bottom n bits in a byte and it takes n+4 cycles at 30MHz. >Again, not very quick in software. It must be slimilar for other RISCs (the T800 is not my idea, but...), but for the 29K, this takes at most 13 instructions or 0.520 microsecond on a 25 MHz part. I am sure that a better algorithm can be found that the one that I thought of in 15 seconds.
daveh@cbmvax.UUCP (Dave Haynie) (11/09/88)
in article <866@cernvax.UUCP>, hjm@cernvax.UUCP (Hubert Matthews) says: > The INMOS T800 has an instruction bitrevword, which turns a > little-endian word into a big-endian word, effectively doing a > reflection in the middle. Great for FFT shuffle routines. In > software, it takes quite some time. In hardware it takes just over > 1 microsecond on a 30MHz part. That's 30 cycles, which would be 30 instructions for a true instruction/cycle RISC machine. The claim for the very simple ARM machine was something like 5 instructions, which would total 10 cycles if the quoted 1.9 cycles per instruction is reasonable (and you'd of course have to imagine a 30MHz ARM machine). If a baby RISC machines does this 3x more efficiently than the Transputer's hardwired instruction, I'd certainly expect a more grown-up RISC chip like a MIPS or 88K to do better. I believe there are arguments for implementing operations, even complex ones, in hardware. This ain't one of 'em. > Hubert Matthews -- Dave Haynie "The 32 Bit Guy" Commodore-Amiga "The Crew That Never Rests" {uunet|pyramid|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy Amiga -- It's not just a job, it's an obsession
henry@utzoo.uucp (Henry Spencer) (11/09/88)
In article <866@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes: >The INMOS T800 has an instruction bitrevword, which turns a >little-endian word into a big-endian word... >In software, it takes quite some time. In hardware it takes just over >1 microsecond on a 30MHz part. There is also a bitrevnbits which >reverses the bottom n bits in a byte and it takes n+4 cycles at 30MHz. >Again, not very quick in software. Don't be quite so quick to assume that things like this are impossibly slow in software. If you're willing to invest some memory in it, most bit manipulations can be done *much* more quickly by table lookup than by "doing it the hard way". Bitrevnbits is mask, add, memory access, and or; could be as little as 4 (not n+4) cycles on a 29000, for example. Skip the or if you're only interested in those bits; skip the mask if you've got them isolated in a register in the first place. And you can do any mapping, not just a reversal. (This admittedly does not apply to bitrevword unless you're willing to use a biiiiig table! :-)) -- The Earth is our mother. | Henry Spencer at U of Toronto Zoology Our nine months are up. |uunet!attcan!utzoo!henry henry@zoo.toronto.edu
hjm@cernvax.UUCP (Hubert Matthews) (11/09/88)
In article <866@cernvax.UUCP>, hjm@cernvax.UUCP (Hubert Matthews) (me!) says: > >> The INMOS T800 has an instruction bitrevword, which turns a >> little-endian word into a big-endian word, effectively doing a >> reflection in the middle. Great for FFT shuffle routines. In >> software, it takes quite some time. In hardware it takes just over >> 1 microsecond on a 30MHz part. > OK, OK. So I picked a lousy implementation of the bitrevword operation. All I was trying to point out is that reversing the bits in a word can be done quickly in hardware (should be one cycle), whereas as it takes an order of magnitude longer in software. INMOS's microcode isn't very good on this one - I think it must be using 32 shifts to do it - but it they used some hardware it would be 1 cycle. I now realise that my example wasn't pedagogically suitable, so please stop junking my mailbox :-) -- Hubert Matthews
dorn@fabscal.UUCP (Alan Dorn Hetzel Jr.) (11/10/88)
Well, when you get right down to it, bit-reversal is the sort of operation which when done *properly* in hardware, should take *zero* cycles, being the sort of thing you reduce to an addressing mode. For example, the Motorola DSP-56000 has an address modifier mode called "carry-reversal" which causes address increments to start on the high bit and carry downwards. This properly produces the order of addresses needed for the FFT bit-reversed addressing mode, and does it with NO extra clock cycles above the standard increment mode. When you think about it, bit-reversal in hardware is little more than twisting the data path one-half revolution! Dorn gatech.edu!fabscal!dorn
robert@hemingway.WEITEK.COM (Robert Plamondon) (11/10/88)
>In article <866@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes: >>The INMOS T800 has an instruction bitrevword, which turns a >>little-endian word into a big-endian word, effectively doing a >>reflection in the middle. Great for FFT shuffle routines. In >>software, it takes quite some time. In hardware it takes just over >>1 microsecond on a 30MHz part. The WEITEK XL-Series RISC processors do this in one cycle with the Perfect Exchange (pexch) instruction, which does field swaps on 2, 4, 8, 16, and 32-bit boundaries. An endian-conversion would be done with pexch source, 0b1100, dest # source and dest are registers a bit-reverse operation (move bit 0 to bit 31, bit 1 to bit 30, etc.) is done with: pexch source, 0b1111, dest and so on. Depending on speed grade, this will be done in 1/8 to 1/12 of a microsecond. -- Robert -- Robert Plamondon robert@weitek.COM "No Toon can resist the old 'Shave and a Hair-Cut'"
baum@Apple.COM (Allen J. Baum) (11/11/88)
[] >In article <646@fabscal.UUCP> dorn@fabscal.UUCP (Alan Dorn Hetzel Jr.) writes: >...For example, the Motorola DSP-56000 has an address modifier mode called >"carry-reversal" which causes address increments to start on the high bit >and carry downwards. This properly produces the order of addresses needed >for the FFT bit-reversed addressing mode, and does it with NO extra clock >cycles above the standard increment mode. While it takes no extra cycles, it probably causes the speed of the adder to increase, possibly increasing the overall cycle time. This is a classic RISC-CISC tradeoff. Given the applications the 56000 was intended for, they probably made the tradeoff correctly. Its a pretty clever hack. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
aglew@urbsdc.Urbana.Gould.COM (11/11/88)
>Well, when you get right down to it, bit-reversal is the sort of operation >which when done *properly* in hardware, should take *zero* cycles, being >the sort of thing you reduce to an addressing mode. An addressing mode!!!!! An array of wires, maybe, but man, addressing memory is slow and getting slower. RISCs can get away with it by caching and prefetching, but caching and prefetching does not do you very much good for large lookup tables when the inputs are uniformly distributed.
idall@augean.OZ (Ian Dall) (11/14/88)
In article <28200235@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes: > >>Well, when you get right down to it, bit-reversal is the sort of operation >>which when done *properly* in hardware, should take *zero* cycles, being >>the sort of thing you reduce to an addressing mode. > >An addressing mode!!!!! An array of wires, maybe, but man, addressing memory >is slow and getting slower. RISCs can get away with it by caching and >prefetching, but caching and prefetching does not do you very much good >for large lookup tables when the inputs are uniformly distributed. If the main use for the bitreverse instruction is going to be for FFTs then an addressing mode might be appropriate. If it is more generally useful then perhaps a general instruction is the way to go. I can't think of any application besides FFTs which would be significantly affected by the speed of the bit reverse operation so an addressing mode might not be so silly. FFT sequencing involves a fairly tight loop containing an operation like: increment (offset) load (base_address + bitreversal(offset)) On most conventional machines it is actually more efficient to do it differently. -- Ian Dall life (n). A sexually transmitted disease which afflicts some people more severely than others. idall@augean.oz