[comp.sys.amiga.programmer] New life for MOVEM!

smcgerty@vax1.tcd.ie (02/12/91)

Hi 68000 users!

Here's a little trick that someone might find useful:
(maybe its common knowlage?)

Right, picture the problem; you want to move, say, 1200 bytes from A to B
QUICKLY but you couldn't be bothered getting the Blitter to do it/Blitter is
busy/You just don't know how to get the blitter to do it.

So you do it like this

	LEA	Source,A0
	LEA	Dest,A1
	MOVE.W	#300,D0   ; 1200 Bs=300 LWs
Loop:   MOVE.L  (A0)+,(A1)+
	DBRA	D0,Loop

How about this, which takes about 2/3 of the time of the above:

	LEA	Source,A0
	LEA	Dest,A1
	MOVE.W	#25,D0             ;25*48=1200 bytes
Loop:	MOVEM.L (A0)+,D1-D7/A2-A6  ;12 LWs! = 48 bytes
	MOVEM.L D1-D7/A2-A6,(A1)
	ADDA.L	#48,A1        ;since MOVEM can't have (A1)+ as Dest. operand	
        DBRA	D0,Loop

Ok, so its a little register intensive, but you can always save all the regs
before using the routine, and restore them later.

Just to get a bit more speed, you could have a bigger loop, which has, say,
five itterations of the original loop in one loop, which saves you 4 DBRA
instructions for every 5-itteration. (I think thats almost 40 clock cycles!)
You may think thats trivial, but it all mounts up!

Anyone got any other tricks?

----------------------------------------------------------------------------
|  / T | /  Stephen John McGerty           |                     Amiga  // |
|  / | |/   smcgerty@vax1.tcd.ie (C.Sci.)  | "Hmm.. No, nothing."    \\//  |
|__________________________________________|_______________________________|

jesup@cbmvax.commodore.com (Randell Jesup) (02/19/91)

In article <1991Feb11.160212.7749@vax1.tcd.ie> smcgerty@vax1.tcd.ie writes:
>Here's a little trick that someone might find useful:
>(maybe its common knowlage?)

	Yes.

[example of movem-loop follows..]

	Or you could use CopyMem() (or CopyMemQuick() when you know the source
and destination are aligned).  They use movem-loops when possible.  (In
fact, under 2.0 CopyMem is adaptive to the processor in use).

	Suprising what you can do when you use the OS....

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
The compiler runs
Like a swift-flowing river
I wait in silence.  (From "The Zen of Programming")  ;-)

smcgerty@vax1.tcd.ie (02/21/91)

In article <19100@cbmvax.commodore.com>, jesup@cbmvax.commodore.com (Randell Jesup) writes:
> In article <1991Feb11.160212.7749@vax1.tcd.ie> smcgerty@vax1.tcd.ie writes:
>>Here's a little trick that someone might find useful:
>>(maybe its common knowlage?) 
> 	Yes. 

Not judging by the response I got... Remember, there's always someone lower than
you on the learning curve....

> [example of movem-loop follows..] 
> 	Or you could use CopyMem() (or CopyMemQuick() when you know the source
> and destination are aligned).  They use movem-loops when possible.  (In
> fact, under 2.0 CopyMem is adaptive to the processor in use). 
> 	Suprising what you can do when you use the OS....
> -- 
> Randell Jesup, Keeper of AmigaDos, Commodore Engineering.

Hey, I don't doubt the OS is very fast and neat; we all use it quite often, and
its great etc etc.. However, as far as giving people a deeper understanding of
68000 programming is concerned , an example of a movem-loop in assembly is a
bit better than a recommendation to use an OS routine.

By writing my example, I wasn't really trying to fulfill someone's desire to
have a fast-copy-memory routine, but instead I wanted to stimulate an interest
in the techniques of using the 68000 efficiently. 

If everyone purely relied on OS routines, without knowing how they worked, then
there would be a lot more ignorance about the nitty-gritty techniques of
programming the Amiga. 

Re-inventing the wheel is often the best way of educating yourself. I find it
helpful, and I reckon others do too.

----------------------------------------------------------------------------
|  / T | /  Stephen John McGerty           |                     Amiga  // |
|  / | |/   smcgerty@vax1.tcd.ie (C.Sci.)  | "Hmm.. No, nothing."    \\//  |
|__________________________________________|_______________________________|

dillon@overload.Berkeley.CA.US (Matthew Dillon) (02/23/91)

In article <1991Feb21.115145.7828@vax1.tcd.ie> smcgerty@vax1.tcd.ie writes:
>In article <19100@cbmvax.commodore.com>, jesup@cbmvax.commodore.com (Randell Jesup) writes:
>> In article <1991Feb11.160212.7749@vax1.tcd.ie> smcgerty@vax1.tcd.ie writes:
>>...
>
>Hey, I don't doubt the OS is very fast and neat; we all use it quite often, and
>its great etc etc.. However, as far as giving people a deeper understanding of
>68000 programming is concerned , an example of a movem-loop in assembly is a
>bit better than a recommendation to use an OS routine.
>
>By writing my example, I wasn't really trying to fulfill someone's desire to
>have a fast-copy-memory routine, but instead I wanted to stimulate an interest
>in the techniques of using the 68000 efficiently.
>
>Re-inventing the wheel is often the best way of educating yourself. I find it
>helpful, and I reckon others do too.
>...

    I generally post this about once a year when the question comes up..
    here is a fully working MOVMEM() call that optimizes via MOVEM:

					-Matt

    Matthew Dillon	    dillon@Overload.Berkeley.CA.US
    891 Regal Rd.	    uunet.uu.net!overload!dillon
    Berkeley, Ca. 94708
    USA


		;   MOVMEM.A
		;
		;   (c)Copyright 1990, Matthew Dillon, All Rights Reserved

		section text,code

		;   movmem(src, dst, len)   (ANSI)
		;   bcopy(src, dst, len)    (UNIX)
		;	    A0	A1   D0     DICE-REG
		;	    A0	A1   D0     internal
		;	 4(sp) 8(sp) 12(sp)
		;
		;   The memory move algorithm is somewhat more of a mess
		;   since we must do it either ascending or decending.

		xdef	_movmem
		xdef	_bcopy	    ; UNIX
		xdef	@movmem
		xdef	@bcopy	    ; UNIX


_bcopy:
_movmem:	move.l	4(sp),A0
		move.l	8(sp),A1
		move.l	12(sp),D0
@bcopy:
@movmem:
		cmp.l	A0,A1		;move to self
		beq	xbmend
		bls	xbmup
xbmdown 	adda.l	D0,A0		;descending copy
		adda.l	D0,A1
		move.w	A0,D1		;CHECK WORD ALIGNED
		lsr.l	#1,D1
		bcs	xbmdown1
		move.w	A1,D1
		lsr.l	#1,D1
		bcs	xbmdown1
		cmp.l	#259,D0 	    ;chosen by calculation.
		bcs	xbmdown8

		move.l	D0,D1		    ;overhead for bmd44: ~360
		divu	#44,D1
		bvs	xbmdown8	    ;too big (> 2,883,540)
		movem.l D2-D7/A2-A6,-(sp)   ;use D2-D7/A2-A6 (11 regs)
		move.l	#44,D0
		bra	xbmd44b
xbmd44a 	sub.l	D0,A0		    ;8		total 214/44bytes
		movem.l (A0),D2-D7/A2-A6    ;12 + 8*11  4.86 cycles/byte
		movem.l D2-D7/A2-A6,-(A1)   ; 8 + 8*11
xbmd44b 	dbf	D1,xbmd44a	    ;10
		swap	D1		    ;D0<15:7> already contain 0
		move.w	D1,D0		    ;D0 = remainder
		movem.l (sp)+,D2-D7/A2-A6

xbmdown8	move.w	D0,D1		    ;D1<2:0> = #bytes left later
		lsr.l	#3,D0		    ;divide by 8
		bra	xbmd8b
xbmd8a		move.l	-(A0),-(A1)         ;20         total 50/8bytes
		move.l	-(A0),-(A1)         ;20         = 6.25 cycles/byte
xbmd8b		dbf	D0,xbmd8a	    ;10
		sub.l	#$10000,D0
		bcc	xbmd8a
		move.w	D1,D0		    ;D0 = 0 to 7 bytes
		and.l	#7,D0
		bne	xbmdown1
xbmend
		move.l	8(sp),D0
		rts

xbmd1a		move.b	-(A0),-(A1)         ;12         total 22/byte
xbmdown1				    ;		= 22 cycles/byte
xbmd1b		dbf	D0,xbmd1a	    ;10
		sub.l	#$10000,D0
		bcc	xbmd1a
		move.l	8(sp),D0
		rts

xbmup		move.w	A0,D1		    ;CHECK WORD ALIGNED
		lsr.l	#1,D1
		bcs	xbmup1
		move.w	A1,D1
		lsr.l	#1,D1
		bcs	xbmup1
		cmp.l	#259,D0 	    ;chosen by calculation
		bcs	xbmup8

		move.l	D0,D1		    ;overhead for bmu44: ~360
		divu	#44,D1
		bvs	xbmup8		    ;too big (> 2,883,540)
		movem.l D2-D7/A2-A6,-(sp)   ;use D2-D7/A2-A6 (11 regs)
		move.l	#44,D0
		bra	xbmu44b
xbmu44a 	movem.l (A0)+,D2-D7/A2-A6   ;12 + 8*11  ttl 214/44bytes
		movem.l D2-D7/A2-A6,(A1)    ;8  + 8*11  4.86 cycles/byte
		add.l	D0,A1		    ;8
xbmu44b 	dbf	D1,xbmu44a	    ;10
		swap	D1		    ;D0<15:7> already contain 0
		move.w	D1,D0		    ;D0 = remainder
		movem.l (sp)+,D2-D7/A2-A6

xbmup8		move.w	D0,D1		    ;D1<2:0> = #bytes left later
		lsr.l	#3,D0		    ;divide by 8
		bra	xbmu8b
xbmu8a		move.l	(A0)+,(A1)+         ;20         total 50/8bytes
		move.l	(A0)+,(A1)+         ;20         = 6.25 cycles/byte
xbmu8b		dbf	D0,xbmu8a	    ;10
		sub.l	#$10000,D0
		bcc	xbmu8a
		move.w	D1,D0		    ;D0 = 0 to 7 bytes
		and.l	#7,D0
		bne	xbmup1
		move.l	8(sp),D0
		rts

xbmu1a		move.b	(A0)+,(A1)+
xbmup1
xbmu1b		dbf	D0,xbmu1a
		sub.l	#$10000,D0
		bcc	xbmu1a
		move.l	8(sp),D0
		rts

		END

dej@qpoint.amiga.ocunix.on.ca (David Jones) (02/23/91)

>In article <1991Feb11.160212.7749@vax1.tcd.ie> smcgerty@vax1.tcd.ie writes:
>How about this, which takes about 2/3 of the time of the above:
>
>	LEA	Source,A0
>	LEA	Dest,A1
>	MOVE.W	#25,D0             ;25*48=1200 bytes
>Loop:	MOVEM.L (A0)+,D1-D7/A2-A6  ;12 LWs! = 48 bytes
>	MOVEM.L D1-D7/A2-A6,(A1)
>	ADDA.L	#48,A1        ;since MOVEM can't have (A1)+ as Dest. operand	
>        DBRA	D0,Loop
>
>Anyone got any other tricks?

Ya.  Save yourself some code.  Check out CopyMem() in exec.library
(V33 or greater).  Disassemble it.  Essentially, it is the above code.

--
 
 |    The Q-Point                  David Jones
 |\   Amiga S/W development        UUCP:  dej@qpoint.amiga.ocunix.on.ca
 | \                               Fido:  1:163/109.8                  
 |  \
 |   \      "I can understand why someone would want to go out, get drunk
 |   -\----  and wake up the next morning with a splitting headache and
 |  /  \     absolutely no memory of the night before, but I *cannot*
 | /    \    understand why anyone would want to do that more than once."
 |/      \
 +----------                                   - Don Elgee

hughesmp@vax1.tcd.ie (03/02/91)

In article <dej.0456@qpoint.amiga.ocunix.on.ca>, dej@qpoint.amiga.ocunix.on.ca (David Jones) writes:
>>In article <1991Feb11.160212.7749@vax1.tcd.ie> smcgerty@vax1.tcd.ie writes:
>>How about this, which takes about 2/3 of the time of the above:
>>
>>[..usage of movem deleted..]
>>
>>Anyone got any other tricks?
> 
> Ya.  Save yourself some code.  Check out CopyMem() in exec.library
> (V33 or greater).  Disassemble it.  Essentially, it is the above code.

Hey cmon man, he doesn't want to hear about supplied software. Often you
find stuff written by someone else, particularly the OS, sucks. You want
one thing quick. It wants something else slow. So you write it _yourself_.
At least that way you know exactly what's going on, how fast, and everyone
will be able to use it. Not just people with V33 or greater, whatever
that is. He asks (if you read the posting) if anyone else has any tricks.
He wants to know if there are any other ways of squeezing more out of what
is basically a not-very-fast-processor. One byte per 4 cycles stinks, so
what'd it be like without movem? Are there any other ways of doing something
else faster; try and get summat out of the machine, if you don't want to
waste your money on a bigger chip in the series? Don't say find out about
the OS, because it is a heap of it. You want _real_optimisation_ for the
specific problem, for which some general ideas may help. Movem is one. The
OS is not. Matt Dillon's program is very nice, coping with non-word
boundaries and everything, but if you want _everything_ out of the machine,
forget those checks. Align your data, and use the plain movems. Shove the
loop in a cupboard, and in-line the code. On a processor running at the
speed of a low 68000, those cycles count. Save them. Don't give a damn about
memory. Remember, only a heartless fiend can get the true max out of the
machine. Work everything to the bloody stumps, and waste everything else.

T.

SICK - the Slightly Intelligent Crazy Rosebi -
We came. We saw. We went away again.
#! r

lkoop@pnet01.cts.com (Lamonte Koop) (03/04/91)

hughesmp@vax1.tcd.ie writes:
>In article <dej.0456@qpoint.amiga.ocunix.on.ca>, dej@qpoint.amiga.ocunix.on.ca (David Jones) writes:
>>>In article <1991Feb11.160212.7749@vax1.tcd.ie> smcgerty@vax1.tcd.ie writes:
>>>How about this, which takes about 2/3 of the time of the above:
>>>
>>>[..usage of movem deleted..]
>>>
>>>Anyone got any other tricks?
>> 
>> Ya.  Save yourself some code.  Check out CopyMem() in exec.library
>> (V33 or greater).  Disassemble it.  Essentially, it is the above code.
>
>Hey cmon man, he doesn't want to hear about supplied software. Often you
>find stuff written by someone else, particularly the OS, sucks. You want

Not in my experience.  Just because the OS is "supplied" or written by 
someone else, it doesn't mean you have to go about re-inventing the wheel
because you feel "it sucks"...a feeling which I strongly disagree with.  Yes,
the OS has it's problems, but it has quite a few excellent points to it as
well.

>one thing quick. It wants something else slow. So you write it _yourself_.
>At least that way you know exactly what's going on, how fast, and everyone
>will be able to use it. Not just people with V33 or greater, whatever
>that is. He asks (if you read the posting) if anyone else has any tricks.
>He wants to know if there are any other ways of squeezing more out of what
>is basically a not-very-fast-processor. One byte per 4 cycles stinks, so
>what'd it be like without movem? Are there any other ways of doing something
>else faster; try and get summat out of the machine, if you don't want to
>waste your money on a bigger chip in the series? Don't say find out about
>the OS, because it is a heap of it. You want _real_optimisation_ for the
>specific problem, for which some general ideas may help. Movem is one. The
>OS is not. Matt Dillon's program is very nice, coping with non-word
>boundaries and everything, but if you want _everything_ out of the machine,
>forget those checks. Align your data, and use the plain movems. Shove the
>loop in a cupboard, and in-line the code. On a processor running at the
>speed of a low 68000, those cycles count. Save them. Don't give a damn about
>memory. Remember, only a heartless fiend can get the true max out of the
>machine. Work everything to the bloody stumps, and waste everything else.

From you comments, I have just a few of my own:  First of all, I have
absolutely nothing against optimizing code...in fact I am all for it, and any
ideas pertaining to it.  However, your attitude seems to be quite hostile
towards the OS...which is NOT "full of it".  In fact, you seem to be the sort
who would write code which crashes just about every machine except a
particular model.  This may not be the case, but I don't see how you would get
decently multitasking-friendly applications when you avoid the OS.
 
Second of all, how to you propose to get anything done...when you insist on
reinventing everything?

>
>T.
>
>SICK - the Slightly Intelligent Crazy Rosebi -
>We came. We saw. We went away again.
>#! r


                             LaMonte Koop
 Internet: lkoop@pnet01.cts.com         ARPA: crash!pnet01!lkoop@nosc.mil
           UUCP: {hplabs!hp-sdd ucsd nosc}!crash!pnet01!lkoop
 "It's a dog-eat-dog world...and I'm wearing Milk Bone underwear"--Norm

jesup@cbmvax.commodore.com (Randell Jesup) (03/05/91)

In article <1991Mar2.042511.7894@vax1.tcd.ie> hughesmp@vax1.tcd.ie writes:
>In article <dej.0456@qpoint.amiga.ocunix.on.ca>, dej@qpoint.amiga.ocunix.on.ca (David Jones) writes:
>> Ya.  Save yourself some code.  Check out CopyMem() in exec.library
>> (V33 or greater).  Disassemble it.  Essentially, it is the above code.
>
>Hey cmon man, he doesn't want to hear about supplied software. Often you
>find stuff written by someone else, particularly the OS, sucks. You want
...
>will be able to use it. Not just people with V33 or greater, whatever
>that is.

	V33 is 1.2.  Anyone who is running anything earlier than 1.2 deserves
10 lashes with a wet noodle (since 1.0 and 1.1 were only available on A1000's,
and they can upgrade in a snap - almost all modern stuff requires 1.2).

>waste your money on a bigger chip in the series? Don't say find out about
>the OS, because it is a heap of it. You want _real_optimisation_ for the
>specific problem, for which some general ideas may help. Movem is one. The
>OS is not. Matt Dillon's program is very nice, coping with non-word
>boundaries and everything, but if you want _everything_ out of the machine,
>forget those checks. Align your data, and use the plain movems. Shove the
>loop in a cupboard, and in-line the code.

	Guess what: what you suggest is exactly what's in the OS.  There's
CopyMem(), for non-aligned data (ala matt's), and CopyMemQuick(), for 
aligned data.  It can't inline the code, but if you're transferring enough
data for movem-loops to make a difference, the cycles for a single
subroutine call to start it is WAY down in the noise (plus you win in that
on a chip-only machine, ROM access can be far faster than ram access,
depending on video mode).

	And if you happen to run your code on 2.0 with an '020 or better,
suddenly your copies get even quicker, since we have separate copy loops
for different processors.

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
The compiler runs
Like a swift-flowing river
I wait in silence.  (From "The Zen of Programming")  ;-)

sschaem@starnet.uucp (Stephan Schaem) (03/06/91)

 Talking about people that dont like their OS function....
 If you think something is not fit, creat your own: why be stuck
 with other people way of thinking?!
 I'm not saying replacing but doing addition/extension.
 I dont extensilvy use intuition  (screen mostly) since I have other
 need and Have fight to mutch to get things to be done the intuition
 way.
 The previews example where text: there should be diferent way to
 handle text, and I find FF or the 2.0 'emulation' not at text display
 'peak'.So when I need special text feature I use my own library
 Alway using the OS is not ALWAY the best solution, and should be the
 only way to make things work...