[comp.arch] Loop instructions

ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) (04/16/91)

On the 68000 family, the DBcc loop instructions decrement the 16-bit count
register and stop looping when it reaches -1 (or 65535), not 0. To compensate
for this, you do *not* subtract one from the initial loop count. Instead,
you enter the loop by branching to the DBcc at its end, rather than by
falling in the top. Thus, here's an example sequence to copy a block
of bytes:

	; get source pointer into a0
	; get destination pointer into a1
	; get 16-bit count into d0
		bra.s	@2		; enter loop at end
	@1	move.b	(a0)+, (a1)+
	@2	dbra	d0, @1

The idea behind this is, if the loop count is initially zero, the
DBcc will fall through without executing the loop at all--no need for
a separate test.

I just wondered if other CPU designers thought this was a neat concept
as well, or whether they tended to go for the "decrement-until-zero"
brand of loop instruction.

Lawrence D'Oliveiro                       fone: +64-71-562-889
Computer Services Dept                     fax: +64-71-384-066
University of Waikato            electric mail: ldo@waikato.ac.nz
Hamilton, New Zealand    37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00
"[Word processing and spreadsheets] aren't very interesting to me." -- Bill Joy

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (04/22/91)

In article <1991Apr16.152438.3445@waikato.ac.nz> 
	ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes:
>On the 68000 family, the DBcc loop instructions decrement the 16-bit count
>register and stop looping when it reaches -1 (or 65535), not 0.

Compiler writers dislike this instruction, but not because of the
test semantics. The killer is that the count is 16 bits, on a machine
where variables and expressions are naturally 32 bits. This means
that the compiler can only use the instruction when it possesses
range information. 

A sophisticated compiler (with optimization enabled) can often deduce
ranges. However, when that machine came out, the instruction was
essentially only usable by assembler programmers.
-- 
Don		D.C.Lindsay 	Carnegie Mellon Robotics Institute

dhinds@elaine18.Stanford.EDU (David Hinds) (04/22/91)

In article <12739@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>In article <1991Apr16.152438.3445@waikato.ac.nz> 
>	ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes:
>>On the 68000 family, the DBcc loop instructions decrement the 16-bit count
>>register and stop looping when it reaches -1 (or 65535), not 0.
>
>Compiler writers dislike this instruction, but not because of the
>test semantics. The killer is that the count is 16 bits, on a machine
>where variables and expressions are naturally 32 bits. This means
>that the compiler can only use the instruction when it possesses
>range information. 
>
>A sophisticated compiler (with optimization enabled) can often deduce
>ranges. However, when that machine came out, the instruction was
>essentially only usable by assembler programmers.

    I'm not familiar with the 68000 instruction set, but couldn't this
instruction be adapted to 32-bit counts by just splitting the count into
upper and lower half-words and using a nested pair of 16-bit loops?  This
would have minimal overhead - a few instructions in loop setup - but would
have essentially the same performance as the 16 bit count for long loops.

 -David Hinds
  dhinds@cb-iris.stanford.edu

torek@elf.ee.lbl.gov (Chris Torek) (04/22/91)

>In article <12739@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu
>(Donald Lindsay) writes:
>>Compiler writers dislike [the 68000 DBcc] instruction, but not because
>>of the test semantics. The killer is that the count is 16 bits, on a
>>machine where variables and expressions are naturally 32 bits. ...

In article <1991Apr21.210031.16749@leland.Stanford.EDU>
dhinds@elaine18.Stanford.EDU (David Hinds) writes:
>    I'm not familiar with the 68000 instruction set, but couldn't this
>instruction be adapted to 32-bit counts by just splitting the count into
>upper and lower half-words and using a nested pair of 16-bit loops?

No splitting is necessary: instead of compiling

	do {
		...
	} while (i-- != 0);
	/* i now dead, hence need not = -1 */

as

		jra	Lloop
	Ltop:
		subql	#1,d2
	Lloop:
		...
		tstl	d2
		jne	Ltop

one compiles it as (approximately):

	Lloop:
		...
		dbra	d2,Loop
		/*
		 * at this point, low(d2) == 0xffff;
		 * high(d2) is unchanged but should be decremented;
		 * the loop is finished iff d2==-1 afterward
		 */
		subl	#0x10000,d2
		cmpl	#-1,d2
		jne	Lloop

However, it turns out that on the 68020 it is almost invariably faster
to avoid DBcc anyway (bcopy, for instance, should be unrolled).  Score
0 for fancy instructions :-)  (Note that a dbra bcopy is the fastest
available on the 68010, but not on the 68000!)
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

npw@eleazar.dartmouth.edu (Nicholas Wilt) (04/22/91)

In article <1991Apr21.210031.16749@leland.Stanford.EDU> dhinds@elaine18.Stanford.EDU (David Hinds) writes:
>>Compiler writers dislike this instruction, but not because of the
>>test semantics. The killer is that the count is 16 bits, on a machine
>>where variables and expressions are naturally 32 bits. This means
>>that the compiler can only use the instruction when it possesses
>>range information. 
>>
>    I'm not familiar with the 68000 instruction set, but couldn't this
>instruction be adapted to 32-bit counts by just splitting the count into
>upper and lower half-words and using a nested pair of 16-bit loops?  This
>would have minimal overhead - a few instructions in loop setup - but would
>have essentially the same performance as the 16 bit count for long loops.

Unless the loop counts were very large (more range information the compiler 
would need), this would not really be worth the trouble.  DBcc isn't _that_
much faster.  Also, the extra register required might be better utilized 
elsewhere.

>
> -David Hinds
>  dhinds@cb-iris.stanford.edu

--Nick
  npw@eleazar.dartmouth.edu

gtephx (Wild Rider) (04/24/91)

In article <12739@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>In article <1991Apr16.152438.3445@waikato.ac.nz> 
>	ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes:
>>On the 68000 family, the DBcc loop instructions decrement the 16-bit count
>>register and stop looping when it reaches -1 (or 65535), not 0.
>
>Compiler writers dislike this instruction, but not because of the
>test semantics. The killer is that the count is 16 bits, on a machine
>where variables and expressions are naturally 32 bits. This means
>that the compiler can only use the instruction when it possesses
>range information. 

	(sound of u.s.s. enterprise red-alert siren going off ... )

	("phasors locked on target, sir")

	uh, not exactly.  are you a compiler writer, mr. lindsay?  if not,
	where did you get your info?  for your own edification, run any of
	sun's ancient c compilers on some looping code & check the generated
	output (hint: cc -O -S loops.c)... it's not at all difficult to use
	the dbcc instructions _in conjunction with subtract & bcc instructions
	following the dbcc_ to employ a full 32-bit counter in a loop that
	also takes advantage of the dbcc instruction for most of the work.
	remember, too, that this trick (cough, cough) was employed years ago
	with plain vanilla sun cc compilers; we're not talking cutting edge
	by _any_ stretch of the imagination.  if gcc doesn't use this little
	shortcut, take the codegen people out back & put them out of their misery...

>Don		D.C.Lindsay 	Carnegie Mellon Robotics Institute

	cheers && happy compiling,
	wild rider
-- 
Wallace Roberts, AG (formerly GTE) Communication Systems, Phoenix, AZ
UUCP: ...!{ncar!noao!asuvax | uunet!zardoz!hrc | att}!gtephx!robertsw
Internet: gtephx!robertsw@asuvax.eas.asu.edu    Bike: '82 GS1100L Suz
voice: (602)581-4555    fax: (602)582-7624      Cage: '89 Mustang  GT

ts@cup.portal.com (Tim W Smith) (04/24/91)

Chris Torek says:
> However, it turns out that on the 68020 it is almost invariably faster
> to avoid DBcc anyway (bcopy, for instance, should be unrolled).  Score

If you unroll too far, don't you start to miss on the instruction
cache?  The optimum seems to be unrolled enough to lower loop overhead
but rolled enough to fit the loop in the cache.

I tried to calculate the proper amount of unrolling and came up with
about 11 move.l instructions per dbra.  This at first seemed somewhat
low, but it seems that larger amounts of unrolling, while still being
able to fit in the cache, can lose because the first time through
the loop there are more cache misses.

I don't really trust my calculations that much, but measurements
of a 1K copy routine I had to write for an application on my Mac II
showed that the best 2^N unrollings were 8 and 16, which tends to
support the calculation of 11.

					Tim Smith

mlord@bwdls58.bnr.ca (Mark Lord) (04/25/91)

In article <1991Apr21.210031.16749@leland.Stanford.EDU> dhinds@elaine18.Stanford.EDU (David Hinds) writes:
<In article <12739@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
<>In article <1991Apr16.152438.3445@waikato.ac.nz> 
<>>On the 68000 family, the DBcc loop instructions decrement the 16-bit count
<>>register and stop looping when it reaches -1 (or 65535), not 0.
<>
<>Compiler writers dislike this instruction, but not because of the
<>test semantics. The killer is that the count is 16 bits, on a machine
<>where variables and expressions are naturally 32 bits. This means
<>that the compiler can only use the instruction when it possesses
<>range information. 
<
<    I'm not familiar with the 68000 instruction set, but couldn't this
<instruction be adapted to 32-bit counts by just splitting the count into
<upper and lower half-words and using a nested pair of 16-bit loops?  This
<would have minimal overhead - a few instructions in loop setup - but would
<have essentially the same performance as the 16 bit count for long loops.

This is exactly how I code it manually.  Pretty easy for a compiler to do also.
Note that the outer loop consists of a single DBcc instruction!
-- 
MLORD@BNR.CA  Ottawa, Ontario *** Personal views only ***
begin 644 NOTSHARE.COM ; Free MS-DOS utility - use instead of SHARE.EXE
MZQ.0@/P/=`J`_!9T!2[_+H``L/_/+HX&+`"T2<TAO@,!OX0`N1(`C,B.P/.DS
<^K@A-<TAB1Z``(P&@@"ZA`"X(27-(?NZE@#-)P#-5
``
end

ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) (04/25/91)

In article <1991Apr21.210031.16749@leland.Stanford.EDU>,
dhinds@elaine18.Stanford.EDU (David Hinds) asks:

"I'm not familiar with the 68000 instruction set, but couldn't [DBRA]
be adapted to 32-bit counts by just splitting the count into upper and
lower half-words and using a nested pair of 16-bit loops?  This would
have minimal overhead - a few instructions in loop setup - but would
have essentially the same performance as the 16 bit count for long loops."

Yup, and here's how you do it:

	; loop initialization, including loading 32-bit
	; unsigned count into d0, comes here
	bra.s	@7		; enter loop at bottom, as usual
@1	swap	d0
@2	; ... body of loop
@7	dbra	d0, @2
	swap	d0
	dbra	d0, @1

My instruction timings indicate that DBRA is preferable to a simple SUBQ/BNE
on the 68000, 68010, 68020 and 68030. This is because DBRA is faster when
the branch is taken.

Lawrence D'Oliveiro                       fone: +64-71-562-889
Computer Services Dept                     fax: +64-71-384-066
University of Waikato            electric mail: ldo@waikato.ac.nz
Hamilton, New Zealand    37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00
MADAME MACLUHAN, fortune-telling and back rubs. "The medium is the massage."

uad1077@dircon.co.uk (Ian Kemmish) (04/26/91)

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:

>In article <1991Apr16.152438.3445@waikato.ac.nz> 
>	ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes:
>>On the 68000 family, the DBcc loop instructions decrement the 16-bit count
>>register and stop looping when it reaches -1 (or 65535), not 0.

>Compiler writers dislike this instruction, but not because of the
>test semantics. The killer is that the count is 16 bits, on a machine
>where variables and expressions are naturally 32 bits. This means
>that the compiler can only use the instruction when it possesses
>range information. 

>A sophisticated compiler (with optimization enabled) can often deduce
>ranges. However, when that machine came out, the instruction was
>essentially only usable by assembler programmers.
>-- 
>Don		D.C.Lindsay 	Carnegie Mellon Robotics Institute
The Sun compiler seemed to have a hack that let people use that
instruction.  The sources to NeWS 1.0 included bits of the
pixrect library.  The *second* register variable in a proc.
was a short, and was fudged so that loops ran to -1, and the
grotty code that did this was always surrounded by comments to
the effect that this ``let the compiler use dbra''.....

And yes, it did seem to matter whether it was the *second* or not:->

-- 
Ian D. Kemmish                    Tel. +44 767 601 361
18 Durham Close                   uad1077@dircon.UUCP
Biggleswade                       ukc!dircon!uad1077
Beds SG18 8HZ United Kingdom    uad1077@dircon.co.uk

torek@elf.ee.lbl.gov (Chris Torek) (04/29/91)

>In article <12330@dog.ee.lbl.gov> I wrote:
>>However, it turns out that on the 68020 it is almost invariably faster
>>to avoid DBcc anyway (bcopy, for instance, should be unrolled). ...

(You can, of course, run the unrolled loop using dbra.)

In article <41612@cup.portal.com> ts@cup.portal.com (Tim W Smith) writes:
>I tried to calculate the proper amount of unrolling and came up with
>about 11 move.l instructions per dbra.

Yes, this is about right, although it depends on your memory system as
well.  (If memory is very slow, smaller loops are better, since it
takes less time to get the whole loop into the cache and the branch
overhead `disappears' into the memory access time.)
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

donc@microsoft.UUCP (Don CORBITT) (04/30/91)

>In article <12739@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>>In article <1991Apr16.152438.3445@waikato.ac.nz> 
>>	ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) writes:
>>>On the 68000 family, the DBcc loop instructions decrement the 16-bit count
>>>register and stop looping when it reaches -1 (or 65535), not 0.
>>
>>Compiler writers dislike this instruction, but not because of the
>>test semantics. The killer is that the count is 16 bits, on a machine
>>where variables and expressions are naturally 32 bits. This means
>>that the compiler can only use the instruction when it possesses
>>range information. 
>>
>>A sophisticated compiler (with optimization enabled) can often deduce
>>ranges. However, when that machine came out, the instruction was
>>essentially only usable by assembler programmers.

[several readers post workarounds to 16-bittedness of this instruction]

The problem is that you have to do these work-arounds.  The compiler would
like to generate the smallest and fastest code.  If the DBcc instruction
used all 32 bits of the register, the resulting code should be smallest, and
fastest.  Since you need the special case code to test for upper 16 bits
== -1, this code will be larger than required, although it should still be
fastest.

[excuse non-Motorola syntax, it's been too many years]

Today, you need to write something like this:

again:
	// loop body
	DBRA	D0, again
	ADD.L	1, D0
	BRZ	really_done
	SUB.L	1, D0
	BRA	again
really_done:

It would (obviously) much nicer to do:

again:
	// loop body
	DBRA	D0, again

--
Don Corbitt
Microsoft Windows Development

pplacewa@bbn.com (Paul W Placeway) (04/30/91)

ts@cup.portal.com (Tim W Smith) writes:

< Chris Torek says:
< > However, it turns out that on the 68020 it is almost invariably faster
< > to avoid DBcc anyway (bcopy, for instance, should be unrolled).  Score

< If you unroll too far, don't you start to miss on the instruction
< cache?  The optimum seems to be unrolled enough to lower loop overhead
< but rolled enough to fit the loop in the cache.

After hacking too many DSP things, all I have to say about loop
unrolling is that it's a good technique to make up for a bad
architecture.

What I want in my processor is a zero-overhead-per-loop down-counting
loop instruction.  The TMS320 series, and Motorola DSP 56000 and 96000
have had this sort of thing for quite a while.  For those of you who
happen to be unfamilar, the idea is that the PC addressing hardware
has a loop beginning, end, and count register and the hardware does a
decrement-branch-nonzero when the PC == end-of-loop, resetting it to
beginning-of-loop, while in the instruction fetch stage.

The 320C30 also has a repeat-single mode which just does the next
instruction N times.  Unfortunately the processor also *ignores*
interrupts during the loop, so I can't use it, but if it wasn't broken
in that small way it would be very useful indeed.

Note that this could also be done in a bunch of different ways, like
for instance in a superscaler just doing a DBNZ in parallel with the
previous instruction.  I don't care much how it actually works, but it
is quite fun to have things like a vector sum or vector-vector add or
an inline bcopy() that run at one instruction per operation.

Now if only the "real" processor in my workstation could do such
things.

		-- Paul Placeway <pplaceway@bbn.com>

pegram@kira.UUCP (Robert B. Pegram) (05/01/91)

From article <63942@bbn.BBN.COM>, by pplacewa@bbn.com (Paul W Placeway):
> ts@cup.portal.com (Tim W Smith) writes:
> 
> < Chris Torek says:
> < > However, it turns out that on the 68020 it is almost invariably faster
> < > to avoid DBcc anyway (bcopy, for instance, should be unrolled).  Score
> 
> < If you unroll too far, don't you start to miss on the instruction
> < cache?  The optimum seems to be unrolled enough to lower loop overhead
> < but rolled enough to fit the loop in the cache.
> 
> After hacking too many DSP things, all I have to say about loop
> unrolling is that it's a good technique to make up for a bad
> architecture.
> 
> What I want in my processor is a zero-overhead-per-loop down-counting
> loop instruction.  The TMS320 series, and Motorola DSP 56000 and 96000
> have had this sort of thing for quite a while.  For those of you who
> happen to be unfamilar, the idea is that the PC addressing hardware
> has a loop beginning, end, and count register and the hardware does a
> decrement-branch-nonzero when the PC == end-of-loop, resetting it to
> beginning-of-loop, while in the instruction fetch stage.
> 

Yup, that's very familiar, I see it in my Analog Devices 2100s also.
The other nice thing is the modified Harvard architecture.  It
allows for two data fetches at one time - provided the instructions
are in the cache, and some of the data is stored in the instruction 
memory space.  Frankly, I think it's good for general purpose computer
architects to get away from the idea that all programs are geared to
the time frame of human beings - the optimizations necessary to
maintain milli or microsecond responses can also be useful in more
general purpose programs. 

> Note that this could also be done in a bunch of different ways, like
> for instance in a superscaler just doing a DBNZ in parallel with the
> previous instruction.  I don't care much how it actually works, but it
> is quite fun to have things like a vector sum or vector-vector add or
> an inline bcopy() that run at one instruction per operation.
> 
> Now if only the "real" processor in my workstation could do such
> things.
> 
> 		-- Paul Placeway <pplaceway@bbn.com>

Amen to that.

Bob Pegram

pegram@griffin.uvm.edu
	or
...!uvm-gen!pegram

zalman@mips.com (Zalman Stern) (05/01/91)

In article <63942@bbn.BBN.COM> pplacewa@bbn.com (Paul W Placeway) writes:
[...]
>After hacking too many DSP things, all I have to say about loop
>unrolling is that it's a good technique to make up for a bad
>architecture.
>
>What I want in my processor is a zero-overhead-per-loop down-counting
>loop instruction.  The TMS320 series, and Motorola DSP 56000 and 96000
[...]
>		-- Paul Placeway <pplaceway@bbn.com>

The IBM RISC System/6000 (RIOS) has such an instruction. In fact, you can
also fold a condition register bit into the loop test and so long as the
condition is computed far enough in advance (three cycles), the branch is
free. A simple decrement and branch loop instruction is always free.

Of course there are tradeoffs to consider here. Setting up the loop count
register takes longer than putting the value into a general purpose
register (GPR). (I'm not sure what the latency of moving a GPR to the count
register is. I'd guess two or three cycles. Anyone out there know?) If
there are function calls in the loop, you have to save and restore the
count register. If you need to use the loop variable as an index, you need
an add in the loop anyway. (On the RIOS you usually finese this by using a
load and update instruction to step an induction variable.)

All in all, this is a win for the IBM machine since they have already
decoupled the integer ALU and instruction fetching. Its not clear how this
would apply to other RISC architectures though. A first criterion would be
for the instruction to take a GPR for the counter. It would also have a
delay slot.  I doubt such an instruction requires an extra write port into
the register file. (The branch instruction doesn't write anything so its
writeback slot can be used to update the counter register.)  It might
require some extra ALU hardware since the count register gets decremented
and an offset is added to the PC in one cycle. Its also a lot less
important in certain superscalar implementations where the counter update
can overlap something else in the loop.
-- 
Zalman Stern, MIPS Computer Systems, 928 E. Arques 1-03, Sunnyvale, CA 94088
zalman@mips.com OR {ames,decwrl,prls,pyramid}!mips!zalman     (408) 524 8395
	 "Never rub another man's rhubarb" -- Pop Will Eat Itself

mshute@cs.man.ac.uk (Malcolm Shute) (05/10/91)

In article <63942@bbn.BBN.COM> pplacewa@bbn.com (Paul W Placeway) writes:
>What I want in my processor is a zero-overhead-per-loop down-counting
>loop instruction.  The TMS320 series, and Motorola DSP 56000 and 96000
>have had this sort of thing for quite a while.  For those of you who
>happen to be unfamilar, the idea is that the PC addressing hardware
>has a loop beginning, end, and count register and the hardware does a
>decrement-branch-nonzero when the PC == end-of-loop, resetting it to
>beginning-of-loop, while in the instruction fetch stage.

One interesting[*] idea which I saw once
[[*]where "interesting" == "novel", "neat" and "mind-stimulating",
but probably != "efficient", "commercially cost effective"]
made the instruction set completely block structured:
the fetch part of the instruction cycle worked with one
stack (with the IR and PC as the top-of-stack, and next-on-stack)
respectively; and the execute part of the instruction cycle
worked with the other (i.e. arithmetic) stack (again with the
top two items in registers... effectively the ACC and ARG registers
of the machine).  Every instruction had a repetition field
which counted down to zero, while the op-code part of the instruction
was repeatedly obeyed this number of times.

Most arithmetic instructions would just carry a repetition count of one,
and do their normal work on the arithmetic stack.  Occasionally a PUSH
instruction with a higher repetition count would be used to move a
contiguous block of data around (block move).

Subroutine calls simply involved pushing the new PC value on to the
system stack, once the calling instruction (presumably with a repetition
count of 1) had been POPed from the top (i.e had finished executing in the IR).
Loops were then implemented using this subroutine-call mechanism, with
the appropriate iteration count set in the calling instruction (some sort
of bit twiddling of fields would be required if anything other than a
compile-time constant number of iterations was required).

Its designer (Prof Osmon of City University, in London) named it CAC
(clean architecture computer -- due to its minimal and regular hardware)
just before the RISC label became fashionable.

Has any one any thoughts on how this could be put into practical use?
Or, indeed, has any one ever seen a real example embodying any of these
principles?
--

Malcolm SHUTE.         (The AM Mollusc:   v_@_ )        Disclaimer: all

pkl@ee.mu.OZ.AU (Peter LAWREY) (05/17/91)

In article <2518@m1.cs.man.ac.uk> mshute@cs.man.ac.uk (Malcolm Shute) writes:
>In article <63942@bbn.BBN.COM> pplacewa@bbn.com (Paul W Placeway) writes:
>>What I want in my processor is a zero-overhead-per-loop down-counting
>>loop instruction. 

Say we have an instruction loop[m] N. Where n is the number of instuctions in
the loop and N is the number of times the loop is executed. This is implimented
using two counters. One to keep track of the number of instructions to execute
and another to keep the loop counter. eg

	// A simplistic vector operation
	loop4		r7		// Variable length loop.
	load.32  	r4=[r1++]
	load.32 	r5=[r2++]
	fadd.s		r6=r5+r4
	store.32 	[r3++]=r6

	// Copy a byte character string.
	// Uses a break operator.
	loop3		r0		// Forever.
	load.8		r3=[r1++]
	store.8		[r2++]=r3
	break		r3==0

	// Wonderous numbers.
	// r1 contains the number to test.
	// r2 contains the number of loops.
	loop6		r0
	break		r1==1		// 1
	add		r2=r2+1		// 2
	and		r3=r1 & 0x0001	// 3
	bra		even,r3=0	// 4
	add		r3=r1+r1	// 5
	add		r1=r1+r3	// 6
even:	shl		r1=r1,1		// 5
	shl		r0=r0,r0	// 6
	// This however executes from the even label after finishing the loop
	// but it shows the versitility of the command.

These loops will have trouble with traps etc. as it represents a special mode
of execution.