[comp.sys.atari.st] ASSEMBLY MOVE/CLEAR/SET/COMPARE ROUTINES

dillon@CORY.BERKELEY.EDU (Matt Dillon) (08/16/88)

>I think this problem could be solved quite simply by using one of the reserved
>fields in the GEMDOS executable program header.  One of these fields could be
>used for telling the OS that it does not have to clear the TPA beforeexecuting
	
	Half of this conversation is silly.  Since when is clearing memory
slow?  A properly written memory-set/clear function will use, say, 12 or 13
registers filled with the pattern and then loop on movem.l instruction.

	Needless to say, this is *fast*.  Very fast, in fact.

	And before some of the less sophisticated start blabbering about
special cases, following is some *GENERAL* 68K code for clearing, moving,
and comparing memory.

	All routines work on arbitrary boundries, and optimize according to
the block size and alignment, all the way to using multiple-register moves
to accomplish the goal.  bmov() will do either an ascending or decending
copy accordingly, allowing for overlapped moves.

SPECIAL NOTES:
	-All calls take 32 bit quantities for any pointers or integers.  Note
	-especially that the BSET() function takes a long for the fill
	 character even though only a char is used.  Simply modify the code
	 to fill your needs.

	-This assembles under the Aztec assembler.  Some modifications may
	 be required to work on other assemblers.  However, the code is
	 COMPLETELY self contained.

	-Code is set up for being called from C, with arguments pushed on
	 the stack in reverse argument (i.e. first arg is 4(sp), second is
	 8(sp), and third is 12(sp) on entry to the call)

	-D0/D1/A0/A1 are all assumed to be scratch and are not saved.

						-Matt

#! /bin/sh
# This is a shell archive, meaning:
# 1. Remove everything above the #! /bin/sh line.
# 2. Save the resulting text in a file.
# 3. Execute the file with /bin/sh (not csh) to create:
#	bcmp.asm
#	bmov.asm
#	bset.asm
# This archive created: Mon Aug 15 20:36:56 1988
export PATH; PATH=/bin:/usr/bin:$PATH
echo shar: "extracting 'bcmp.asm'" '(772 characters)'
if test -f 'bcmp.asm'
then
	echo shar: "will not over-write existing file 'bcmp.asm'"
else
cat << \!Funky!Stuff! > 'bcmp.asm'

		public	_bcmp	    ; compare two blocks of memory

		;   BCMP(src, dst, len)
		;   char *src, *dst;
		;   long len;

_bcmp:		move.l	4(sp),A0
		move.l	8(sp),A1
		move.l	12(sp),D0
		move.w	D0,D1	    ;longword align address
		neg.w	D1
		and.w	#3,D1
		cmp.w	D0,D0	    ;force Z bit
		bra	.bc2
.bc1		cmpm.b	(A0)+,(A1)+
.bc2		dbne	D1,.bc1
		bne	.bcfail
		move.l	D0,D1
		lsr.l	#2,D1	    ;# of longwords to compare
		cmp.w	D0,D0	    ;force Z bit
		bra	.bc11
.bc10		cmpm.l	(A0)+,(A1)+
.bc11		dbne	D1,.bc10
		bne	.bcfail
		sub.l	#$10000,D0
		bcc	.bc10
		and.w	#3,D0	    ;remaining bytes to compare
		cmp.w	D0,D0	    ;force Z bit
		bra	.bc21
.bc20		cmpm.b	(A0)+,(A1)+
.bc21		dbne	D0,.bc20
		bne	.bcfail
		moveq.l #1,D0	    ;success!
		rts
.bcfail 	moveq.l #0,D0	    ;failure!
		rts

!Funky!Stuff!
fi  # end of overwriting check
echo shar: "extracting 'bmov.asm'" '(2664 characters)'
if test -f 'bmov.asm'
then
	echo shar: "will not over-write existing file 'bmov.asm'"
else
cat << \!Funky!Stuff! > 'bmov.asm'

		;   BMOV(src, dst, len)
		;
		;   char *src, *dst;
		;   long len;
		;
		;   The memory move algorithm is somewhat more of a mess
		;   since we must do it either ascending or decending.

		public	_bmov
_bmov:		move.l	4(sp),A0
		move.l	8(sp),A1
		move.l	12(sp),D0
		cmp.l	A0,A1		;move to self
		beq	.bmend
		bls	.bmup
.bmdown 	adda.l	D0,A0		;descending copy
		adda.l	D0,A1
		move.w	A0,D1		;CHECK WORD ALIGNED
		btst.l	#0,D1
		bne	.bmdown1
		move.w	A1,D1
		btst.l	#0,D1
		bne	.bmdown1
		cmp.l	#259,D0 	    ;chosen by calculation.
		blo	.bmdown8

		move.l	D0,D1		    ;overhead for bmd44: ~360
		divu	#44,D1
		bvs	.bmdown8	    ;too big (> 2,883,540)
		movem.l D2-D7/A2-A6,-(sp)   ;use D2-D7/A2-A6 (11 regs)
		move.l	#11*4,D0
		bra	.bmd44b
.bmd44a 	sub.l	D0,A0		    ;8		total 214/44bytes
		movem.l (A0),D2-D7/A2-A6    ;12 + 8*11  4.86 cycles/byte
		movem.l D2-D7/A2-A6,-(A1)   ; 8 + 8*11
.bmd44b 	dbf	D1,.bmd44a	    ;10
		swap	D1		    ;D0<15:7> already contain 0
		move.w	D1,D0		    ;D0 = remainder
		movem.l (sp)+,D2-D7/A2-A6

.bmdown8	move.w	D0,D1		    ;D1<2:0> = #bytes left later
		lsr.l	#3,D0		    ;divide by 8
		bra	.bmd8b
.bmd8a		move.l	-(A0),-(A1)         ;20         total 50/8bytes
		move.l	-(A0),-(A1)         ;20         = 6.25 cycles/byte
.bmd8b		dbf	D0,.bmd8a	    ;10
		sub.l	#$10000,D0
		bcc	.bmd8a
		move.w	D1,D0		    ;D0 = 0 to 7 bytes
		and.l	#7,D0
		bne	.bmdown1
		rts

.bmd1a		move.b	-(A0),-(A1)         ;12         total 22/byte
.bmdown1				    ;		= 22 cycles/byte
.bmd1b		dbf	D0,.bmd1a	    ;10
		sub.l	#$10000,D0
		bcc	.bmd1a
		rts

.bmup		move.w	A0,D1		    ;CHECK WORD ALIGNED
		btst.l	#0,D1
		bne	.bmup1
		move.w	A1,D1
		btst.l	#0,D1
		bne	.bmup1
		cmp.l	#259,D0 	    ;chosen by calculation
		blo	.bmup8

		move.l	D0,D1		    ;overhead for bmu44: ~360
		divu	#44,D1
		bvs	.bmup8		    ;too big (> 2,883,540)
		movem.l D2-D7/A2-A6,-(sp)   ;use D2-D7/A2-A6 (11 regs)
		move.l	#11*4,D0
		bra	.bmu44b
.bmu44a 	movem.l (A0)+,D2-D7/A2-A6   ;12 + 8*11  ttl 214/44bytes
		movem.l D2-D7/A2-A6,(A1)    ;8  + 8*11  4.86 cycles/byte
		add.l	D0,A1		    ;8
.bmu44b 	dbf	D1,.bmu44a	    ;10
		swap	D1		    ;D0<15:7> already contain 0
		move.w	D1,D0		    ;D0 = remainder
		movem.l (sp)+,D2-D7/A2-A6

.bmup8		move.w	D0,D1		    ;D1<2:0> = #bytes left later
		lsr.l	#3,D0		    ;divide by 8
		bra	.bmu8b
.bmu8a		move.l	(A0)+,(A1)+         ;20         total 50/8bytes
		move.l	(A0)+,(A1)+         ;20         = 6.25 cycles/byte
.bmu8b		dbf	D0,.bmu8a	    ;10
		sub.l	#$10000,D0
		bcc	.bmu8a
		move.w	D1,D0		    ;D0 = 0 to 7 bytes
		and.l	#7,D0
		bne	.bmup1
		rts

.bmu1a		move.b	(A0)+,(A1)+
.bmup1
.bmu1b		dbf	D0,.bmu1a
		sub.l	#$10000,D0
		bcc	.bmu1a
.bmend		rts


!Funky!Stuff!
fi  # end of overwriting check
echo shar: "extracting 'bset.asm'" '(1702 characters)'
if test -f 'bset.asm'
then
	echo shar: "will not over-write existing file 'bset.asm'"
else
cat << \!Funky!Stuff! > 'bset.asm'

		public	_bzero		; Zero a block of memory
		public	_bset		; Set a block of memory to (byte val)

		;   BSET(buffer, len, byte)
		;   BZERO(buffer, len)
		;
		;   char *buffer;
		;   long len;
		;   long byte;	 (must be passed as a long though only a byte)

		public	_bset
		public	_bzero

_bzero: 	moveq.l #0,D1
		bra	.bz0
_bset:		move.b	12+3(sp),D1
.bz0		move.l	4(sp),A0
		move.l	8(sp),D0

		add.l	D0,A0	    ; start at end of address
		cmp.l	#40,D0	    ; unscientifically chosen
		bls	.bs2
		bra	.bs10
.bs1		move.b	D1,-(A0)    ; any count < 65536
.bs2		dbf	D0,.bs1
		rts

				    ; at least 2 bytes in count (D0)
.bs10		movem.l D2-D7/A2-A6,-(sp)   ;ant count > 4
		move.l	A0,D2
		btst.l	#0,D2	    ; is it aligned?
		beq	.bs22
		move.b	D1,-(A0)    ; no, copy one byte
		subq.l	#1,D0

.bs22		andi.l	#$FF,D1     ; expand data D1.B -> D2-D7/A1-A6
		move.l	D1,D2	    ; D1 000000xx   D2 000000xx
		asl.w	#8,D2	    ;		       0000xx00
		or.w	D2,D1	    ;	 0000xxxx
		move.w	D1,D2	    ;	 0000xxxx      0000xxxx
		swap	D2	    ;	 0000xxxx      xxxx0000
		or.l	D1,D2	    ; D2.L
		move.l	D2,D3
		move.l	D2,D4
		move.l	D2,D5
		move.l	D2,D6
		move.l	D2,D7
		move.l	D2,A1
		move.l	D2,A2
		move.l	D2,A3
		move.l	D2,A4
		move.l	D2,A5
		move.l	D2,A6	    ; D2-D7/A1-A6   (12 registers)
		move.l	#12*4,D1    ; bytes per transfer (48)
.bs30		sub.l	D1,D0	    ; pre subtract
		bmi	.bs40
.bs31		movem.l D2-D7/A1-A6,-(A0)
		sub.l	D1,D0
		bpl	.bs31
.bs40		add.w	D1,D0	    ; less than 48 bytes remaining

		move.w	#4,D1	    ; by 4's
		sub.w	D1,D0
		bmi	.bs50
.bs41		move.l	D2,-(A0)
		sub.w	D1,D0
		bpl	.bs41
.bs50		add.w	D1,D0
		bra	.bs52
.bs51		move.b	D2,-(A0)    ; by 1's
.bs52		dbf	D0,.bs51
		movem.l (sp)+,D2-D7/A2-A6
		rts

!Funky!Stuff!
fi  # end of overwriting check
exit 0
#	End of shell archive

apratt@atari.UUCP (Allan Pratt) (08/17/88)

dillon@cory.berkeley.edu says something about clearing memory being fast
when using movem and lots of registers.  He is right, but on a 4-meg
machine it still takes appreciable time.  It takes almost a second,
which is a long time if you do it a lot for short utilities loaded off
hard disk:

	At 4 clocks per longword to do the memory access, it takes
	4*3.5MB = 13,107,200 clocks, or ~0.82 seconds at 16MHz to
	clear 3.5MB, not counting instruction fetching and looping
	overhead.  This also doesn't count the fact that video
	memory cycles are interleaved with processor memory cycles;
	I don't know what that impact will be. 

The clearing code in the 11/20 ROMs (pre-Mega ROMs) is stupid and
slow, which is why it takes appreciable time to clear 1 Meg.  The
Mega ROMs fixed this, but clearing 3.5 Meg still takes a long time,
and future machines may possibly have up to 10 Meg -- clearing
9.5 Meg will not be pleasant:

	It would take 38,273,024 clocks (~2.4 sec) to clear
	9.5MB at 16MHz, but fortunately any machine with that
	much memory will probably run faster than 16MHz.

These numbers are back-of-the-napkin computations, and I may have
some major flaw in my arithmetic, but .82 seconds sounds right to me.

You can speed up memory clearing on a Mega 4 by installing a 2MB RAMdisk...

============================================
Opinions expressed above do not necessarily	-- Allan Pratt, Atari Corp.
reflect those of Atari Corp. or anyone else.	  ...ames!atari!apratt

egisin@watmath.waterloo.edu (Eric Gisin) (08/17/88)

Why do so many people think movem is faster than the
more straight-forward loop of move.l's?

copying 12 long words with movem (a0)+,regs; movem regs,(a1); add.l Rn,a1"
 takes 242 cycles on the 68000,
while 12 successive "move.l (a0)+, (a1)+" takes 240 cycles.
(timings derived from Motorola'a 68000 manual)
when copying fewer words, move.l is even better.

the move.l appoach does not require saving and restoring all
your registers, and can be coded in C with decent compilers.

Thomas_E_Zerucha@cup.portal.com (08/18/88)

There is a program called "TopDown" available from Eideco Resources
(written by John Eidsvoog) which allocates all the memory at the beginning,
causing all the staying programs to load at the bottom, and will speed things
up appreciably.  It is only $20, but solves the slowness of the Mega 4
loading, and also some minor compatibility problems with some programs
that expect to be loaded near the top.  Eidco Resources/POBox 4336/N Hollywood
CA 91807.  It does some other nice things to work around problems (like rez
changes).  I have no affiliation with Eidco except as a satisfied customer.

dillon@CORY.BERKELEY.EDU (Matt Dillon) (08/18/88)

:copying 12 long words with movem (a0)+,regs; movem regs,(a1); add.l Rn,a1"
: takes 242 cycles on the 68000,
:while 12 successive "move.l (a0)+, (a1)+" takes 240 cycles.
:(timings derived from Motorola'a 68000 manual)
:when copying fewer words, move.l is even better.
:
:the move.l appoach does not require saving and restoring all
:your registers, and can be coded in C with decent compilers.

				* BULL SHIT *

	The motorola manual states the following for movem.l:

	movem M->R, long:  (An)+:   12 + 8n clock cycles
	movem R->M, long:  (An):     8 + 8n clock cycles
	add.l D0,A1:		     8      clock cycles

	Total clocks to move 12 longwords of data from source to destination
	in an ascending copy is:  220 clock cycles

					---


	move.l	(A0)+,(A1)+	:	20 clock cycles

	Total clocks to move 12 longwords of data from source to destination
	in an ascending copy is:  240 clock cycles

	move.l -(A0),-(A1)	:	22 clock cycles

	Total clocks to move 12 longwords of data from source to destination
	in a decending copy is:  264 clock cycles

	the comparison is valid because in both cases you still need the outer
	DBF loop.

NOW WHERE IN THE HELL DID YOU GET *YOUR* INFORMATION ?????

					-Matt


						-Matt
	

wes@obie.UUCP (Barnacle Wes) (08/21/88)

% The clearing code in the 11/20 ROMs (pre-Mega ROMs) is stupid and
% slow, which is why it takes appreciable time to clear 1 Meg.  The
% Mega ROMs fixed this, but clearing 3.5 Meg still takes a long time,
% and future machines may possibly have up to 10 Meg -- clearing
% 9.5 Meg will not be pleasant:
% 
% 	It would take 38,273,024 clocks (~2.4 sec) to clear
% 	9.5MB at 16MHz, but fortunately any machine with that
% 	much memory will probably run faster than 16MHz.

Why clear all of the memory in the machine?  The executable file has the
size of the bss segment in it, why not just clear the bss?  You could
also clear each Malloced block as it is allocated, if you really want to
preserve the idea that Malloced blocks should be cleared.
-- 
                     {hpda, uwmcsd1}!sp7040!obie!wes
           "Happiness lies in being priviledged to work hard for
           long hours in doing whatever you think is worth doing."
                         -- Robert A. Heinlein --

leo@philmds.UUCP (Leo de Wit) (08/22/88)

In article <20383@watmath.waterloo.edu> egisin@watmath.waterloo.edu (Eric Gisin) writes:
>Why do so many people think movem is faster than the
>more straight-forward loop of move.l's?

Because they may be right to do so 8-)

>copying 12 long words with movem (a0)+,regs; movem regs,(a1); add.l Rn,a1"
> takes 242 cycles on the 68000,
>while 12 successive "move.l (a0)+, (a1)+" takes 240 cycles.
>(timings derived from Motorola'a 68000 manual)

I dunno how you got your timings, but I got the following:

    movem.l  (a0)+,d1-d7/a2-a6    112 cycles
    movem.l  d1-d7/a2-a6,(a1)     108 cycles
    adda.l   d0,a1                  8 cycles
    ----------------------------------------
    TOTAL                         228 cycles

    move.l   (a0)+,(a1)+           24 cycles
    ----------------------------------------
    TOTAL (12 times)              288 cycles

so that the movem.l construct gains you 60 cycles here: about 25% faster
(it seems you've done a bit of cycle stealing 8-).
These timings are derived by ACTUALLY TIMING them (repeating them a lot
of times and measuring the time taken). I've got a nice little program
that times a series of hexadecimal codes, anyone interested?
B.T.W. I use movem.l because movem.w only moves the lower words of the
regs (after having extended them). The adda.l may just as well be a
adda.w because it is sign extended anyway.

Note that timings on the Motorola cannot be derived from a manual,
unless you know what timing is involved with memory access in your
special case (an ST, now ain't that special). In this it differs from
e.g. the Z80 where times must be synchronous; the bus does not wait (I
hope my explanation isn't too bad, not being an hardware expert and
so). On the Motorola bus the driver is more polite 8-).

>when copying fewer words, move.l is even better.

Of course, i.e. less worse. The gain in the movem approach is just that
you do much less instruction fetching (see above: 3, be it somewhat
longer instructions, against 12). No wonder movem.l is faster.

>the move.l appoach does not require saving and restoring all
>your registers, and can be coded in C with decent compilers.

The overhead of saving and restoring registers is very little, because
we can use the fast movem.l instructions for that 8-)!!
No, seriously, if you move a large chunck, this overhead is quickly
gained back. And of course, it can be coded in assembler with decent
assemblers. As it can be in C with decent compilers (although this
involves either #asm's or executing data).


         Leo.

P.S. Some people may wonder why my smiley is most of the time a 8-).
This is simply because I wear glasses 8-).
And if not so, I forgot to put 'em on, or I'm just cleaning them :-).
*-) Oops, seems I broke'em :-(

egisin@watmath.waterloo.edu (Eric Gisin) (08/23/88)

In article <8808180442.AA08547@cory.Berkeley.EDU>, dillon@CORY.BERKELEY.EDU (Matt Dillon) writes:
> In an article I write:
> :copying 12 long words with movem (a0)+,regs; movem regs,(a1); add.l Rn,a1"
> : takes 242 cycles on the 68000,
> :while 12 successive "move.l (a0)+, (a1)+" takes 240 cycles.
> :(timings derived from Motorola'a 68000 manual)
> :when copying fewer words, move.l is even better.
> 
> 				* BULL SHIT *
> 
> 	The motorola manual states the following for movem.l:
> 
> 	movem M->R, long:  (An)+:   12 + 8n clock cycles
> 	movem R->M, long:  (An):     8 + 8n clock cycles
> 	add.l D0,A1:		     8      clock cycles
> 
> 	Total clocks to move 12 longwords of data from source to destination
> 	in an ascending copy is:  220 clock cycles
>  [asks where I got my information]

Third edition of the Motorola 16-bit microprocessor user's manual.
It is dated 1982. On page 203, movem r->m (long) is 8+10n cycles.
I guess my sources are obsolete.

Anyone know when this changed to 8+8n cycles?

mitch@Stride.COM (Thomas Mitchell) (08/26/88)

In article <20499@watmath.waterloo.edu> egisin@watmath.waterloo.edu (Eric Gisin) writes:
>In article <8808180442.AA08547@cory.Berkeley.EDU>, dillon@CORY.BERKELEY.EDU (Matt Dillon) writes:
>> In an article I write:
>> :copying 12 long words with movem (a0)+,regs; movem regs,(a1); add.l Rn,a1"
>> 	The motorola manual states the following for movem.l:
>> 
>> 	movem M->R, long:  (An)+:   12 + 8n clock cycles
>> 	movem R->M, long:  (An):     8 + 8n clock cycles
>> 	add.l D0,A1:		     8      clock cycles
>> 
>> 	Total clocks to move 12 longwords of data from source to destination
>> 	in an ascending copy is:  220 clock cycles
>
>Third edition of the Motorola 16-bit microprocessor user's manual.
>It is dated 1982. On page 203, movem r->m (long) is 8+10n cycles.
>I guess my sources are obsolete.
>
>Anyone know when this changed to 8+8n cycles?
             ^^^^^^^^^^^^^^^^^^^^^^^^^

This is a better question than at first looked!

From A PRELIMINARY copy Original Issue: Sept 1, 1979
Of the MC68000 16-bit Microprocessor Users Manual page D-6
 	movem M->R, long:  (An)+:    6 + 4n clock cycles**
 	movem R->M, long:  (An):     4 + 4n clock cycles**
	** internal cycles (1 internal=2 clock input cycles)

From the third edition 1982
Of the MC68000 16-bit Microprocessor Users Manual page 203
 	movem M->R, long:  (An)+:   12 + 8n clock cycles
 	movem R->M, long:  (An):     8 +10n clock cycles

From the fourth edition 1984
Of the MC68000 16/32-bit Microprocessor Users Manual page 214
 	movem M->R, long:  (An)+:   12 + 8n clock cycles
 	movem R->M, long:  (An):     8 + 8n clock cycles

A Dec 1982 manual on the 68010 and an Oct 1985 man on th 68000
hold with movem.l R->M (An):     8 + 8n clock cycles.


mitch
-- 
Thomas P. Mitchell (mitch@stride1.Stride.COM)
Phone: (702)322-6868	TWX: 910-395-6073	FAX: (702)322-7975
MicroSage Computer Systems Inc.
Opinions expressed are probably mine.