ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) (04/16/91)
On the 68000 family, the DBcc loop instructions decrement the 16-bit count register and stop looping when it reaches -1 (or 65535), not 0. To compensate for this, you do *not* subtract one from the initial loop count. Instead, you enter the loop by branching to the DBcc at its end, rather than by falling in the top. Thus, here's an example sequence to copy a block of bytes: ; get source pointer into a0 ; get destination pointer into a1 ; get 16-bit count into d0 bra.s @2 ; enter loop at end @1 move.b (a0)+, (a1)+ @2 dbra d0, @1 The idea behind this is, if the loop count is initially zero, the DBcc will fall through without executing the loop at all--no need for a separate test. I just wondered if other CPU designers thought this was a neat concept as well, or whether they tended to go for the "decrement-until-zero" brand of loop instruction. Lawrence D'Oliveiro fone: +64-71-562-889 Computer Services Dept fax: +64-71-384-066 University of Waikato electric mail: ldo@waikato.ac.nz Hamilton, New Zealand 37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00 "[Word processing and spreadsheets] aren't very interesting to me." -- Bill Joy
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (04/22/91)
In article <1991Apr16.152438.3445@waikato.ac.nz> ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes: >On the 68000 family, the DBcc loop instructions decrement the 16-bit count >register and stop looping when it reaches -1 (or 65535), not 0. Compiler writers dislike this instruction, but not because of the test semantics. The killer is that the count is 16 bits, on a machine where variables and expressions are naturally 32 bits. This means that the compiler can only use the instruction when it possesses range information. A sophisticated compiler (with optimization enabled) can often deduce ranges. However, when that machine came out, the instruction was essentially only usable by assembler programmers. -- Don D.C.Lindsay Carnegie Mellon Robotics Institute
dhinds@elaine18.Stanford.EDU (David Hinds) (04/22/91)
In article <12739@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: >In article <1991Apr16.152438.3445@waikato.ac.nz> > ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes: >>On the 68000 family, the DBcc loop instructions decrement the 16-bit count >>register and stop looping when it reaches -1 (or 65535), not 0. > >Compiler writers dislike this instruction, but not because of the >test semantics. The killer is that the count is 16 bits, on a machine >where variables and expressions are naturally 32 bits. This means >that the compiler can only use the instruction when it possesses >range information. > >A sophisticated compiler (with optimization enabled) can often deduce >ranges. However, when that machine came out, the instruction was >essentially only usable by assembler programmers. I'm not familiar with the 68000 instruction set, but couldn't this instruction be adapted to 32-bit counts by just splitting the count into upper and lower half-words and using a nested pair of 16-bit loops? This would have minimal overhead - a few instructions in loop setup - but would have essentially the same performance as the 16 bit count for long loops. -David Hinds dhinds@cb-iris.stanford.edu
torek@elf.ee.lbl.gov (Chris Torek) (04/22/91)
>In article <12739@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu >(Donald Lindsay) writes: >>Compiler writers dislike [the 68000 DBcc] instruction, but not because >>of the test semantics. The killer is that the count is 16 bits, on a >>machine where variables and expressions are naturally 32 bits. ... In article <1991Apr21.210031.16749@leland.Stanford.EDU> dhinds@elaine18.Stanford.EDU (David Hinds) writes: > I'm not familiar with the 68000 instruction set, but couldn't this >instruction be adapted to 32-bit counts by just splitting the count into >upper and lower half-words and using a nested pair of 16-bit loops? No splitting is necessary: instead of compiling do { ... } while (i-- != 0); /* i now dead, hence need not = -1 */ as jra Lloop Ltop: subql #1,d2 Lloop: ... tstl d2 jne Ltop one compiles it as (approximately): Lloop: ... dbra d2,Loop /* * at this point, low(d2) == 0xffff; * high(d2) is unchanged but should be decremented; * the loop is finished iff d2==-1 afterward */ subl #0x10000,d2 cmpl #-1,d2 jne Lloop However, it turns out that on the 68020 it is almost invariably faster to avoid DBcc anyway (bcopy, for instance, should be unrolled). Score 0 for fancy instructions :-) (Note that a dbra bcopy is the fastest available on the 68010, but not on the 68000!) -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
npw@eleazar.dartmouth.edu (Nicholas Wilt) (04/22/91)
In article <1991Apr21.210031.16749@leland.Stanford.EDU> dhinds@elaine18.Stanford.EDU (David Hinds) writes: >>Compiler writers dislike this instruction, but not because of the >>test semantics. The killer is that the count is 16 bits, on a machine >>where variables and expressions are naturally 32 bits. This means >>that the compiler can only use the instruction when it possesses >>range information. >> > I'm not familiar with the 68000 instruction set, but couldn't this >instruction be adapted to 32-bit counts by just splitting the count into >upper and lower half-words and using a nested pair of 16-bit loops? This >would have minimal overhead - a few instructions in loop setup - but would >have essentially the same performance as the 16 bit count for long loops. Unless the loop counts were very large (more range information the compiler would need), this would not really be worth the trouble. DBcc isn't _that_ much faster. Also, the extra register required might be better utilized elsewhere. > > -David Hinds > dhinds@cb-iris.stanford.edu --Nick npw@eleazar.dartmouth.edu
gtephx (Wild Rider) (04/24/91)
In article <12739@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: >In article <1991Apr16.152438.3445@waikato.ac.nz> > ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes: >>On the 68000 family, the DBcc loop instructions decrement the 16-bit count >>register and stop looping when it reaches -1 (or 65535), not 0. > >Compiler writers dislike this instruction, but not because of the >test semantics. The killer is that the count is 16 bits, on a machine >where variables and expressions are naturally 32 bits. This means >that the compiler can only use the instruction when it possesses >range information. (sound of u.s.s. enterprise red-alert siren going off ... ) ("phasors locked on target, sir") uh, not exactly. are you a compiler writer, mr. lindsay? if not, where did you get your info? for your own edification, run any of sun's ancient c compilers on some looping code & check the generated output (hint: cc -O -S loops.c)... it's not at all difficult to use the dbcc instructions _in conjunction with subtract & bcc instructions following the dbcc_ to employ a full 32-bit counter in a loop that also takes advantage of the dbcc instruction for most of the work. remember, too, that this trick (cough, cough) was employed years ago with plain vanilla sun cc compilers; we're not talking cutting edge by _any_ stretch of the imagination. if gcc doesn't use this little shortcut, take the codegen people out back & put them out of their misery... >Don D.C.Lindsay Carnegie Mellon Robotics Institute cheers && happy compiling, wild rider -- Wallace Roberts, AG (formerly GTE) Communication Systems, Phoenix, AZ UUCP: ...!{ncar!noao!asuvax | uunet!zardoz!hrc | att}!gtephx!robertsw Internet: gtephx!robertsw@asuvax.eas.asu.edu Bike: '82 GS1100L Suz voice: (602)581-4555 fax: (602)582-7624 Cage: '89 Mustang GT
ts@cup.portal.com (Tim W Smith) (04/24/91)
Chris Torek says: > However, it turns out that on the 68020 it is almost invariably faster > to avoid DBcc anyway (bcopy, for instance, should be unrolled). Score If you unroll too far, don't you start to miss on the instruction cache? The optimum seems to be unrolled enough to lower loop overhead but rolled enough to fit the loop in the cache. I tried to calculate the proper amount of unrolling and came up with about 11 move.l instructions per dbra. This at first seemed somewhat low, but it seems that larger amounts of unrolling, while still being able to fit in the cache, can lose because the first time through the loop there are more cache misses. I don't really trust my calculations that much, but measurements of a 1K copy routine I had to write for an application on my Mac II showed that the best 2^N unrollings were 8 and 16, which tends to support the calculation of 11. Tim Smith
mlord@bwdls58.bnr.ca (Mark Lord) (04/25/91)
In article <1991Apr21.210031.16749@leland.Stanford.EDU> dhinds@elaine18.Stanford.EDU (David Hinds) writes: <In article <12739@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: <>In article <1991Apr16.152438.3445@waikato.ac.nz> <>>On the 68000 family, the DBcc loop instructions decrement the 16-bit count <>>register and stop looping when it reaches -1 (or 65535), not 0. <> <>Compiler writers dislike this instruction, but not because of the <>test semantics. The killer is that the count is 16 bits, on a machine <>where variables and expressions are naturally 32 bits. This means <>that the compiler can only use the instruction when it possesses <>range information. < < I'm not familiar with the 68000 instruction set, but couldn't this <instruction be adapted to 32-bit counts by just splitting the count into <upper and lower half-words and using a nested pair of 16-bit loops? This <would have minimal overhead - a few instructions in loop setup - but would <have essentially the same performance as the 16 bit count for long loops. This is exactly how I code it manually. Pretty easy for a compiler to do also. Note that the outer loop consists of a single DBcc instruction! -- MLORD@BNR.CA Ottawa, Ontario *** Personal views only *** begin 644 NOTSHARE.COM ; Free MS-DOS utility - use instead of SHARE.EXE MZQ.0@/P/=`J`_!9T!2[_+H``L/_/+HX&+`"T2<TAO@,!OX0`N1(`C,B.P/.DS <^K@A-<TAB1Z``(P&@@"ZA`"X(27-(?NZE@#-)P#-5 `` end
ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) (04/25/91)
In article <1991Apr21.210031.16749@leland.Stanford.EDU>,
dhinds@elaine18.Stanford.EDU (David Hinds) asks:
"I'm not familiar with the 68000 instruction set, but couldn't [DBRA]
be adapted to 32-bit counts by just splitting the count into upper and
lower half-words and using a nested pair of 16-bit loops? This would
have minimal overhead - a few instructions in loop setup - but would
have essentially the same performance as the 16 bit count for long loops."
Yup, and here's how you do it:
; loop initialization, including loading 32-bit
; unsigned count into d0, comes here
bra.s @7 ; enter loop at bottom, as usual
@1 swap d0
@2 ; ... body of loop
@7 dbra d0, @2
swap d0
dbra d0, @1
My instruction timings indicate that DBRA is preferable to a simple SUBQ/BNE
on the 68000, 68010, 68020 and 68030. This is because DBRA is faster when
the branch is taken.
Lawrence D'Oliveiro fone: +64-71-562-889
Computer Services Dept fax: +64-71-384-066
University of Waikato electric mail: ldo@waikato.ac.nz
Hamilton, New Zealand 37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00
MADAME MACLUHAN, fortune-telling and back rubs. "The medium is the massage."
uad1077@dircon.co.uk (Ian Kemmish) (04/26/91)
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: >In article <1991Apr16.152438.3445@waikato.ac.nz> > ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes: >>On the 68000 family, the DBcc loop instructions decrement the 16-bit count >>register and stop looping when it reaches -1 (or 65535), not 0. >Compiler writers dislike this instruction, but not because of the >test semantics. The killer is that the count is 16 bits, on a machine >where variables and expressions are naturally 32 bits. This means >that the compiler can only use the instruction when it possesses >range information. >A sophisticated compiler (with optimization enabled) can often deduce >ranges. However, when that machine came out, the instruction was >essentially only usable by assembler programmers. >-- >Don D.C.Lindsay Carnegie Mellon Robotics Institute The Sun compiler seemed to have a hack that let people use that instruction. The sources to NeWS 1.0 included bits of the pixrect library. The *second* register variable in a proc. was a short, and was fudged so that loops ran to -1, and the grotty code that did this was always surrounded by comments to the effect that this ``let the compiler use dbra''..... And yes, it did seem to matter whether it was the *second* or not:-> -- Ian D. Kemmish Tel. +44 767 601 361 18 Durham Close uad1077@dircon.UUCP Biggleswade ukc!dircon!uad1077 Beds SG18 8HZ United Kingdom uad1077@dircon.co.uk
torek@elf.ee.lbl.gov (Chris Torek) (04/29/91)
>In article <12330@dog.ee.lbl.gov> I wrote: >>However, it turns out that on the 68020 it is almost invariably faster >>to avoid DBcc anyway (bcopy, for instance, should be unrolled). ... (You can, of course, run the unrolled loop using dbra.) In article <41612@cup.portal.com> ts@cup.portal.com (Tim W Smith) writes: >I tried to calculate the proper amount of unrolling and came up with >about 11 move.l instructions per dbra. Yes, this is about right, although it depends on your memory system as well. (If memory is very slow, smaller loops are better, since it takes less time to get the whole loop into the cache and the branch overhead `disappears' into the memory access time.) -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
donc@microsoft.UUCP (Don CORBITT) (04/30/91)
>In article <12739@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: >>In article <1991Apr16.152438.3445@waikato.ac.nz> >> ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes: >>>On the 68000 family, the DBcc loop instructions decrement the 16-bit count >>>register and stop looping when it reaches -1 (or 65535), not 0. >> >>Compiler writers dislike this instruction, but not because of the >>test semantics. The killer is that the count is 16 bits, on a machine >>where variables and expressions are naturally 32 bits. This means >>that the compiler can only use the instruction when it possesses >>range information. >> >>A sophisticated compiler (with optimization enabled) can often deduce >>ranges. However, when that machine came out, the instruction was >>essentially only usable by assembler programmers. [several readers post workarounds to 16-bittedness of this instruction] The problem is that you have to do these work-arounds. The compiler would like to generate the smallest and fastest code. If the DBcc instruction used all 32 bits of the register, the resulting code should be smallest, and fastest. Since you need the special case code to test for upper 16 bits == -1, this code will be larger than required, although it should still be fastest. [excuse non-Motorola syntax, it's been too many years] Today, you need to write something like this: again: // loop body DBRA D0, again ADD.L 1, D0 BRZ really_done SUB.L 1, D0 BRA again really_done: It would (obviously) much nicer to do: again: // loop body DBRA D0, again -- Don Corbitt Microsoft Windows Development
pplacewa@bbn.com (Paul W Placeway) (04/30/91)
ts@cup.portal.com (Tim W Smith) writes:
< Chris Torek says:
< > However, it turns out that on the 68020 it is almost invariably faster
< > to avoid DBcc anyway (bcopy, for instance, should be unrolled). Score
< If you unroll too far, don't you start to miss on the instruction
< cache? The optimum seems to be unrolled enough to lower loop overhead
< but rolled enough to fit the loop in the cache.
After hacking too many DSP things, all I have to say about loop
unrolling is that it's a good technique to make up for a bad
architecture.
What I want in my processor is a zero-overhead-per-loop down-counting
loop instruction. The TMS320 series, and Motorola DSP 56000 and 96000
have had this sort of thing for quite a while. For those of you who
happen to be unfamilar, the idea is that the PC addressing hardware
has a loop beginning, end, and count register and the hardware does a
decrement-branch-nonzero when the PC == end-of-loop, resetting it to
beginning-of-loop, while in the instruction fetch stage.
The 320C30 also has a repeat-single mode which just does the next
instruction N times. Unfortunately the processor also *ignores*
interrupts during the loop, so I can't use it, but if it wasn't broken
in that small way it would be very useful indeed.
Note that this could also be done in a bunch of different ways, like
for instance in a superscaler just doing a DBNZ in parallel with the
previous instruction. I don't care much how it actually works, but it
is quite fun to have things like a vector sum or vector-vector add or
an inline bcopy() that run at one instruction per operation.
Now if only the "real" processor in my workstation could do such
things.
-- Paul Placeway <pplaceway@bbn.com>
pegram@kira.UUCP (Robert B. Pegram) (05/01/91)
From article <63942@bbn.BBN.COM>, by pplacewa@bbn.com (Paul W Placeway): > ts@cup.portal.com (Tim W Smith) writes: > > < Chris Torek says: > < > However, it turns out that on the 68020 it is almost invariably faster > < > to avoid DBcc anyway (bcopy, for instance, should be unrolled). Score > > < If you unroll too far, don't you start to miss on the instruction > < cache? The optimum seems to be unrolled enough to lower loop overhead > < but rolled enough to fit the loop in the cache. > > After hacking too many DSP things, all I have to say about loop > unrolling is that it's a good technique to make up for a bad > architecture. > > What I want in my processor is a zero-overhead-per-loop down-counting > loop instruction. The TMS320 series, and Motorola DSP 56000 and 96000 > have had this sort of thing for quite a while. For those of you who > happen to be unfamilar, the idea is that the PC addressing hardware > has a loop beginning, end, and count register and the hardware does a > decrement-branch-nonzero when the PC == end-of-loop, resetting it to > beginning-of-loop, while in the instruction fetch stage. > Yup, that's very familiar, I see it in my Analog Devices 2100s also. The other nice thing is the modified Harvard architecture. It allows for two data fetches at one time - provided the instructions are in the cache, and some of the data is stored in the instruction memory space. Frankly, I think it's good for general purpose computer architects to get away from the idea that all programs are geared to the time frame of human beings - the optimizations necessary to maintain milli or microsecond responses can also be useful in more general purpose programs. > Note that this could also be done in a bunch of different ways, like > for instance in a superscaler just doing a DBNZ in parallel with the > previous instruction. I don't care much how it actually works, but it > is quite fun to have things like a vector sum or vector-vector add or > an inline bcopy() that run at one instruction per operation. > > Now if only the "real" processor in my workstation could do such > things. > > -- Paul Placeway <pplaceway@bbn.com> Amen to that. Bob Pegram pegram@griffin.uvm.edu or ...!uvm-gen!pegram
zalman@mips.com (Zalman Stern) (05/01/91)
In article <63942@bbn.BBN.COM> pplacewa@bbn.com (Paul W Placeway) writes: [...] >After hacking too many DSP things, all I have to say about loop >unrolling is that it's a good technique to make up for a bad >architecture. > >What I want in my processor is a zero-overhead-per-loop down-counting >loop instruction. The TMS320 series, and Motorola DSP 56000 and 96000 [...] > -- Paul Placeway <pplaceway@bbn.com> The IBM RISC System/6000 (RIOS) has such an instruction. In fact, you can also fold a condition register bit into the loop test and so long as the condition is computed far enough in advance (three cycles), the branch is free. A simple decrement and branch loop instruction is always free. Of course there are tradeoffs to consider here. Setting up the loop count register takes longer than putting the value into a general purpose register (GPR). (I'm not sure what the latency of moving a GPR to the count register is. I'd guess two or three cycles. Anyone out there know?) If there are function calls in the loop, you have to save and restore the count register. If you need to use the loop variable as an index, you need an add in the loop anyway. (On the RIOS you usually finese this by using a load and update instruction to step an induction variable.) All in all, this is a win for the IBM machine since they have already decoupled the integer ALU and instruction fetching. Its not clear how this would apply to other RISC architectures though. A first criterion would be for the instruction to take a GPR for the counter. It would also have a delay slot. I doubt such an instruction requires an extra write port into the register file. (The branch instruction doesn't write anything so its writeback slot can be used to update the counter register.) It might require some extra ALU hardware since the count register gets decremented and an offset is added to the PC in one cycle. Its also a lot less important in certain superscalar implementations where the counter update can overlap something else in the loop. -- Zalman Stern, MIPS Computer Systems, 928 E. Arques 1-03, Sunnyvale, CA 94088 zalman@mips.com OR {ames,decwrl,prls,pyramid}!mips!zalman (408) 524 8395 "Never rub another man's rhubarb" -- Pop Will Eat Itself
mshute@cs.man.ac.uk (Malcolm Shute) (05/10/91)
In article <63942@bbn.BBN.COM> pplacewa@bbn.com (Paul W Placeway) writes: >What I want in my processor is a zero-overhead-per-loop down-counting >loop instruction. The TMS320 series, and Motorola DSP 56000 and 96000 >have had this sort of thing for quite a while. For those of you who >happen to be unfamilar, the idea is that the PC addressing hardware >has a loop beginning, end, and count register and the hardware does a >decrement-branch-nonzero when the PC == end-of-loop, resetting it to >beginning-of-loop, while in the instruction fetch stage. One interesting[*] idea which I saw once [[*]where "interesting" == "novel", "neat" and "mind-stimulating", but probably != "efficient", "commercially cost effective"] made the instruction set completely block structured: the fetch part of the instruction cycle worked with one stack (with the IR and PC as the top-of-stack, and next-on-stack) respectively; and the execute part of the instruction cycle worked with the other (i.e. arithmetic) stack (again with the top two items in registers... effectively the ACC and ARG registers of the machine). Every instruction had a repetition field which counted down to zero, while the op-code part of the instruction was repeatedly obeyed this number of times. Most arithmetic instructions would just carry a repetition count of one, and do their normal work on the arithmetic stack. Occasionally a PUSH instruction with a higher repetition count would be used to move a contiguous block of data around (block move). Subroutine calls simply involved pushing the new PC value on to the system stack, once the calling instruction (presumably with a repetition count of 1) had been POPed from the top (i.e had finished executing in the IR). Loops were then implemented using this subroutine-call mechanism, with the appropriate iteration count set in the calling instruction (some sort of bit twiddling of fields would be required if anything other than a compile-time constant number of iterations was required). Its designer (Prof Osmon of City University, in London) named it CAC (clean architecture computer -- due to its minimal and regular hardware) just before the RISC label became fashionable. Has any one any thoughts on how this could be put into practical use? Or, indeed, has any one ever seen a real example embodying any of these principles? -- Malcolm SHUTE. (The AM Mollusc: v_@_ ) Disclaimer: all
pkl@ee.mu.OZ.AU (Peter LAWREY) (05/17/91)
In article <2518@m1.cs.man.ac.uk> mshute@cs.man.ac.uk (Malcolm Shute) writes: >In article <63942@bbn.BBN.COM> pplacewa@bbn.com (Paul W Placeway) writes: >>What I want in my processor is a zero-overhead-per-loop down-counting >>loop instruction. Say we have an instruction loop[m] N. Where n is the number of instuctions in the loop and N is the number of times the loop is executed. This is implimented using two counters. One to keep track of the number of instructions to execute and another to keep the loop counter. eg // A simplistic vector operation loop4 r7 // Variable length loop. load.32 r4=[r1++] load.32 r5=[r2++] fadd.s r6=r5+r4 store.32 [r3++]=r6 // Copy a byte character string. // Uses a break operator. loop3 r0 // Forever. load.8 r3=[r1++] store.8 [r2++]=r3 break r3==0 // Wonderous numbers. // r1 contains the number to test. // r2 contains the number of loops. loop6 r0 break r1==1 // 1 add r2=r2+1 // 2 and r3=r1 & 0x0001 // 3 bra even,r3=0 // 4 add r3=r1+r1 // 5 add r1=r1+r3 // 6 even: shl r1=r1,1 // 5 shl r0=r0,r0 // 6 // This however executes from the even label after finishing the loop // but it shows the versitility of the command. These loops will have trouble with traps etc. as it represents a special mode of execution.