dillon@CORY.BERKELEY.EDU (Matt Dillon) (08/16/88)
>I think this problem could be solved quite simply by using one of the reserved >fields in the GEMDOS executable program header. One of these fields could be >used for telling the OS that it does not have to clear the TPA beforeexecuting Half of this conversation is silly. Since when is clearing memory slow? A properly written memory-set/clear function will use, say, 12 or 13 registers filled with the pattern and then loop on movem.l instruction. Needless to say, this is *fast*. Very fast, in fact. And before some of the less sophisticated start blabbering about special cases, following is some *GENERAL* 68K code for clearing, moving, and comparing memory. All routines work on arbitrary boundries, and optimize according to the block size and alignment, all the way to using multiple-register moves to accomplish the goal. bmov() will do either an ascending or decending copy accordingly, allowing for overlapped moves. SPECIAL NOTES: -All calls take 32 bit quantities for any pointers or integers. Note -especially that the BSET() function takes a long for the fill character even though only a char is used. Simply modify the code to fill your needs. -This assembles under the Aztec assembler. Some modifications may be required to work on other assemblers. However, the code is COMPLETELY self contained. -Code is set up for being called from C, with arguments pushed on the stack in reverse argument (i.e. first arg is 4(sp), second is 8(sp), and third is 12(sp) on entry to the call) -D0/D1/A0/A1 are all assumed to be scratch and are not saved. -Matt #! /bin/sh # This is a shell archive, meaning: # 1. Remove everything above the #! /bin/sh line. # 2. Save the resulting text in a file. # 3. Execute the file with /bin/sh (not csh) to create: # bcmp.asm # bmov.asm # bset.asm # This archive created: Mon Aug 15 20:36:56 1988 export PATH; PATH=/bin:/usr/bin:$PATH echo shar: "extracting 'bcmp.asm'" '(772 characters)' if test -f 'bcmp.asm' then echo shar: "will not over-write existing file 'bcmp.asm'" else cat << \!Funky!Stuff! > 'bcmp.asm' public _bcmp ; compare two blocks of memory ; BCMP(src, dst, len) ; char *src, *dst; ; long len; _bcmp: move.l 4(sp),A0 move.l 8(sp),A1 move.l 12(sp),D0 move.w D0,D1 ;longword align address neg.w D1 and.w #3,D1 cmp.w D0,D0 ;force Z bit bra .bc2 .bc1 cmpm.b (A0)+,(A1)+ .bc2 dbne D1,.bc1 bne .bcfail move.l D0,D1 lsr.l #2,D1 ;# of longwords to compare cmp.w D0,D0 ;force Z bit bra .bc11 .bc10 cmpm.l (A0)+,(A1)+ .bc11 dbne D1,.bc10 bne .bcfail sub.l #$10000,D0 bcc .bc10 and.w #3,D0 ;remaining bytes to compare cmp.w D0,D0 ;force Z bit bra .bc21 .bc20 cmpm.b (A0)+,(A1)+ .bc21 dbne D0,.bc20 bne .bcfail moveq.l #1,D0 ;success! rts .bcfail moveq.l #0,D0 ;failure! rts !Funky!Stuff! fi # end of overwriting check echo shar: "extracting 'bmov.asm'" '(2664 characters)' if test -f 'bmov.asm' then echo shar: "will not over-write existing file 'bmov.asm'" else cat << \!Funky!Stuff! > 'bmov.asm' ; BMOV(src, dst, len) ; ; char *src, *dst; ; long len; ; ; The memory move algorithm is somewhat more of a mess ; since we must do it either ascending or decending. public _bmov _bmov: move.l 4(sp),A0 move.l 8(sp),A1 move.l 12(sp),D0 cmp.l A0,A1 ;move to self beq .bmend bls .bmup .bmdown adda.l D0,A0 ;descending copy adda.l D0,A1 move.w A0,D1 ;CHECK WORD ALIGNED btst.l #0,D1 bne .bmdown1 move.w A1,D1 btst.l #0,D1 bne .bmdown1 cmp.l #259,D0 ;chosen by calculation. blo .bmdown8 move.l D0,D1 ;overhead for bmd44: ~360 divu #44,D1 bvs .bmdown8 ;too big (> 2,883,540) movem.l D2-D7/A2-A6,-(sp) ;use D2-D7/A2-A6 (11 regs) move.l #11*4,D0 bra .bmd44b .bmd44a sub.l D0,A0 ;8 total 214/44bytes movem.l (A0),D2-D7/A2-A6 ;12 + 8*11 4.86 cycles/byte movem.l D2-D7/A2-A6,-(A1) ; 8 + 8*11 .bmd44b dbf D1,.bmd44a ;10 swap D1 ;D0<15:7> already contain 0 move.w D1,D0 ;D0 = remainder movem.l (sp)+,D2-D7/A2-A6 .bmdown8 move.w D0,D1 ;D1<2:0> = #bytes left later lsr.l #3,D0 ;divide by 8 bra .bmd8b .bmd8a move.l -(A0),-(A1) ;20 total 50/8bytes move.l -(A0),-(A1) ;20 = 6.25 cycles/byte .bmd8b dbf D0,.bmd8a ;10 sub.l #$10000,D0 bcc .bmd8a move.w D1,D0 ;D0 = 0 to 7 bytes and.l #7,D0 bne .bmdown1 rts .bmd1a move.b -(A0),-(A1) ;12 total 22/byte .bmdown1 ; = 22 cycles/byte .bmd1b dbf D0,.bmd1a ;10 sub.l #$10000,D0 bcc .bmd1a rts .bmup move.w A0,D1 ;CHECK WORD ALIGNED btst.l #0,D1 bne .bmup1 move.w A1,D1 btst.l #0,D1 bne .bmup1 cmp.l #259,D0 ;chosen by calculation blo .bmup8 move.l D0,D1 ;overhead for bmu44: ~360 divu #44,D1 bvs .bmup8 ;too big (> 2,883,540) movem.l D2-D7/A2-A6,-(sp) ;use D2-D7/A2-A6 (11 regs) move.l #11*4,D0 bra .bmu44b .bmu44a movem.l (A0)+,D2-D7/A2-A6 ;12 + 8*11 ttl 214/44bytes movem.l D2-D7/A2-A6,(A1) ;8 + 8*11 4.86 cycles/byte add.l D0,A1 ;8 .bmu44b dbf D1,.bmu44a ;10 swap D1 ;D0<15:7> already contain 0 move.w D1,D0 ;D0 = remainder movem.l (sp)+,D2-D7/A2-A6 .bmup8 move.w D0,D1 ;D1<2:0> = #bytes left later lsr.l #3,D0 ;divide by 8 bra .bmu8b .bmu8a move.l (A0)+,(A1)+ ;20 total 50/8bytes move.l (A0)+,(A1)+ ;20 = 6.25 cycles/byte .bmu8b dbf D0,.bmu8a ;10 sub.l #$10000,D0 bcc .bmu8a move.w D1,D0 ;D0 = 0 to 7 bytes and.l #7,D0 bne .bmup1 rts .bmu1a move.b (A0)+,(A1)+ .bmup1 .bmu1b dbf D0,.bmu1a sub.l #$10000,D0 bcc .bmu1a .bmend rts !Funky!Stuff! fi # end of overwriting check echo shar: "extracting 'bset.asm'" '(1702 characters)' if test -f 'bset.asm' then echo shar: "will not over-write existing file 'bset.asm'" else cat << \!Funky!Stuff! > 'bset.asm' public _bzero ; Zero a block of memory public _bset ; Set a block of memory to (byte val) ; BSET(buffer, len, byte) ; BZERO(buffer, len) ; ; char *buffer; ; long len; ; long byte; (must be passed as a long though only a byte) public _bset public _bzero _bzero: moveq.l #0,D1 bra .bz0 _bset: move.b 12+3(sp),D1 .bz0 move.l 4(sp),A0 move.l 8(sp),D0 add.l D0,A0 ; start at end of address cmp.l #40,D0 ; unscientifically chosen bls .bs2 bra .bs10 .bs1 move.b D1,-(A0) ; any count < 65536 .bs2 dbf D0,.bs1 rts ; at least 2 bytes in count (D0) .bs10 movem.l D2-D7/A2-A6,-(sp) ;ant count > 4 move.l A0,D2 btst.l #0,D2 ; is it aligned? beq .bs22 move.b D1,-(A0) ; no, copy one byte subq.l #1,D0 .bs22 andi.l #$FF,D1 ; expand data D1.B -> D2-D7/A1-A6 move.l D1,D2 ; D1 000000xx D2 000000xx asl.w #8,D2 ; 0000xx00 or.w D2,D1 ; 0000xxxx move.w D1,D2 ; 0000xxxx 0000xxxx swap D2 ; 0000xxxx xxxx0000 or.l D1,D2 ; D2.L move.l D2,D3 move.l D2,D4 move.l D2,D5 move.l D2,D6 move.l D2,D7 move.l D2,A1 move.l D2,A2 move.l D2,A3 move.l D2,A4 move.l D2,A5 move.l D2,A6 ; D2-D7/A1-A6 (12 registers) move.l #12*4,D1 ; bytes per transfer (48) .bs30 sub.l D1,D0 ; pre subtract bmi .bs40 .bs31 movem.l D2-D7/A1-A6,-(A0) sub.l D1,D0 bpl .bs31 .bs40 add.w D1,D0 ; less than 48 bytes remaining move.w #4,D1 ; by 4's sub.w D1,D0 bmi .bs50 .bs41 move.l D2,-(A0) sub.w D1,D0 bpl .bs41 .bs50 add.w D1,D0 bra .bs52 .bs51 move.b D2,-(A0) ; by 1's .bs52 dbf D0,.bs51 movem.l (sp)+,D2-D7/A2-A6 rts !Funky!Stuff! fi # end of overwriting check exit 0 # End of shell archive
apratt@atari.UUCP (Allan Pratt) (08/17/88)
dillon@cory.berkeley.edu says something about clearing memory being fast when using movem and lots of registers. He is right, but on a 4-meg machine it still takes appreciable time. It takes almost a second, which is a long time if you do it a lot for short utilities loaded off hard disk: At 4 clocks per longword to do the memory access, it takes 4*3.5MB = 13,107,200 clocks, or ~0.82 seconds at 16MHz to clear 3.5MB, not counting instruction fetching and looping overhead. This also doesn't count the fact that video memory cycles are interleaved with processor memory cycles; I don't know what that impact will be. The clearing code in the 11/20 ROMs (pre-Mega ROMs) is stupid and slow, which is why it takes appreciable time to clear 1 Meg. The Mega ROMs fixed this, but clearing 3.5 Meg still takes a long time, and future machines may possibly have up to 10 Meg -- clearing 9.5 Meg will not be pleasant: It would take 38,273,024 clocks (~2.4 sec) to clear 9.5MB at 16MHz, but fortunately any machine with that much memory will probably run faster than 16MHz. These numbers are back-of-the-napkin computations, and I may have some major flaw in my arithmetic, but .82 seconds sounds right to me. You can speed up memory clearing on a Mega 4 by installing a 2MB RAMdisk... ============================================ Opinions expressed above do not necessarily -- Allan Pratt, Atari Corp. reflect those of Atari Corp. or anyone else. ...ames!atari!apratt
egisin@watmath.waterloo.edu (Eric Gisin) (08/17/88)
Why do so many people think movem is faster than the more straight-forward loop of move.l's? copying 12 long words with movem (a0)+,regs; movem regs,(a1); add.l Rn,a1" takes 242 cycles on the 68000, while 12 successive "move.l (a0)+, (a1)+" takes 240 cycles. (timings derived from Motorola'a 68000 manual) when copying fewer words, move.l is even better. the move.l appoach does not require saving and restoring all your registers, and can be coded in C with decent compilers.
Thomas_E_Zerucha@cup.portal.com (08/18/88)
There is a program called "TopDown" available from Eideco Resources (written by John Eidsvoog) which allocates all the memory at the beginning, causing all the staying programs to load at the bottom, and will speed things up appreciably. It is only $20, but solves the slowness of the Mega 4 loading, and also some minor compatibility problems with some programs that expect to be loaded near the top. Eidco Resources/POBox 4336/N Hollywood CA 91807. It does some other nice things to work around problems (like rez changes). I have no affiliation with Eidco except as a satisfied customer.
dillon@CORY.BERKELEY.EDU (Matt Dillon) (08/18/88)
:copying 12 long words with movem (a0)+,regs; movem regs,(a1); add.l Rn,a1" : takes 242 cycles on the 68000, :while 12 successive "move.l (a0)+, (a1)+" takes 240 cycles. :(timings derived from Motorola'a 68000 manual) :when copying fewer words, move.l is even better. : :the move.l appoach does not require saving and restoring all :your registers, and can be coded in C with decent compilers. * BULL SHIT * The motorola manual states the following for movem.l: movem M->R, long: (An)+: 12 + 8n clock cycles movem R->M, long: (An): 8 + 8n clock cycles add.l D0,A1: 8 clock cycles Total clocks to move 12 longwords of data from source to destination in an ascending copy is: 220 clock cycles --- move.l (A0)+,(A1)+ : 20 clock cycles Total clocks to move 12 longwords of data from source to destination in an ascending copy is: 240 clock cycles move.l -(A0),-(A1) : 22 clock cycles Total clocks to move 12 longwords of data from source to destination in a decending copy is: 264 clock cycles the comparison is valid because in both cases you still need the outer DBF loop. NOW WHERE IN THE HELL DID YOU GET *YOUR* INFORMATION ????? -Matt -Matt
wes@obie.UUCP (Barnacle Wes) (08/21/88)
% The clearing code in the 11/20 ROMs (pre-Mega ROMs) is stupid and % slow, which is why it takes appreciable time to clear 1 Meg. The % Mega ROMs fixed this, but clearing 3.5 Meg still takes a long time, % and future machines may possibly have up to 10 Meg -- clearing % 9.5 Meg will not be pleasant: % % It would take 38,273,024 clocks (~2.4 sec) to clear % 9.5MB at 16MHz, but fortunately any machine with that % much memory will probably run faster than 16MHz. Why clear all of the memory in the machine? The executable file has the size of the bss segment in it, why not just clear the bss? You could also clear each Malloced block as it is allocated, if you really want to preserve the idea that Malloced blocks should be cleared. -- {hpda, uwmcsd1}!sp7040!obie!wes "Happiness lies in being priviledged to work hard for long hours in doing whatever you think is worth doing." -- Robert A. Heinlein --
leo@philmds.UUCP (Leo de Wit) (08/22/88)
In article <20383@watmath.waterloo.edu> egisin@watmath.waterloo.edu (Eric Gisin) writes: >Why do so many people think movem is faster than the >more straight-forward loop of move.l's? Because they may be right to do so 8-) >copying 12 long words with movem (a0)+,regs; movem regs,(a1); add.l Rn,a1" > takes 242 cycles on the 68000, >while 12 successive "move.l (a0)+, (a1)+" takes 240 cycles. >(timings derived from Motorola'a 68000 manual) I dunno how you got your timings, but I got the following: movem.l (a0)+,d1-d7/a2-a6 112 cycles movem.l d1-d7/a2-a6,(a1) 108 cycles adda.l d0,a1 8 cycles ---------------------------------------- TOTAL 228 cycles move.l (a0)+,(a1)+ 24 cycles ---------------------------------------- TOTAL (12 times) 288 cycles so that the movem.l construct gains you 60 cycles here: about 25% faster (it seems you've done a bit of cycle stealing 8-). These timings are derived by ACTUALLY TIMING them (repeating them a lot of times and measuring the time taken). I've got a nice little program that times a series of hexadecimal codes, anyone interested? B.T.W. I use movem.l because movem.w only moves the lower words of the regs (after having extended them). The adda.l may just as well be a adda.w because it is sign extended anyway. Note that timings on the Motorola cannot be derived from a manual, unless you know what timing is involved with memory access in your special case (an ST, now ain't that special). In this it differs from e.g. the Z80 where times must be synchronous; the bus does not wait (I hope my explanation isn't too bad, not being an hardware expert and so). On the Motorola bus the driver is more polite 8-). >when copying fewer words, move.l is even better. Of course, i.e. less worse. The gain in the movem approach is just that you do much less instruction fetching (see above: 3, be it somewhat longer instructions, against 12). No wonder movem.l is faster. >the move.l appoach does not require saving and restoring all >your registers, and can be coded in C with decent compilers. The overhead of saving and restoring registers is very little, because we can use the fast movem.l instructions for that 8-)!! No, seriously, if you move a large chunck, this overhead is quickly gained back. And of course, it can be coded in assembler with decent assemblers. As it can be in C with decent compilers (although this involves either #asm's or executing data). Leo. P.S. Some people may wonder why my smiley is most of the time a 8-). This is simply because I wear glasses 8-). And if not so, I forgot to put 'em on, or I'm just cleaning them :-). *-) Oops, seems I broke'em :-(
egisin@watmath.waterloo.edu (Eric Gisin) (08/23/88)
In article <8808180442.AA08547@cory.Berkeley.EDU>, dillon@CORY.BERKELEY.EDU (Matt Dillon) writes: > In an article I write: > :copying 12 long words with movem (a0)+,regs; movem regs,(a1); add.l Rn,a1" > : takes 242 cycles on the 68000, > :while 12 successive "move.l (a0)+, (a1)+" takes 240 cycles. > :(timings derived from Motorola'a 68000 manual) > :when copying fewer words, move.l is even better. > > * BULL SHIT * > > The motorola manual states the following for movem.l: > > movem M->R, long: (An)+: 12 + 8n clock cycles > movem R->M, long: (An): 8 + 8n clock cycles > add.l D0,A1: 8 clock cycles > > Total clocks to move 12 longwords of data from source to destination > in an ascending copy is: 220 clock cycles > [asks where I got my information] Third edition of the Motorola 16-bit microprocessor user's manual. It is dated 1982. On page 203, movem r->m (long) is 8+10n cycles. I guess my sources are obsolete. Anyone know when this changed to 8+8n cycles?
mitch@Stride.COM (Thomas Mitchell) (08/26/88)
In article <20499@watmath.waterloo.edu> egisin@watmath.waterloo.edu (Eric Gisin) writes: >In article <8808180442.AA08547@cory.Berkeley.EDU>, dillon@CORY.BERKELEY.EDU (Matt Dillon) writes: >> In an article I write: >> :copying 12 long words with movem (a0)+,regs; movem regs,(a1); add.l Rn,a1" >> The motorola manual states the following for movem.l: >> >> movem M->R, long: (An)+: 12 + 8n clock cycles >> movem R->M, long: (An): 8 + 8n clock cycles >> add.l D0,A1: 8 clock cycles >> >> Total clocks to move 12 longwords of data from source to destination >> in an ascending copy is: 220 clock cycles > >Third edition of the Motorola 16-bit microprocessor user's manual. >It is dated 1982. On page 203, movem r->m (long) is 8+10n cycles. >I guess my sources are obsolete. > >Anyone know when this changed to 8+8n cycles? ^^^^^^^^^^^^^^^^^^^^^^^^^ This is a better question than at first looked! From A PRELIMINARY copy Original Issue: Sept 1, 1979 Of the MC68000 16-bit Microprocessor Users Manual page D-6 movem M->R, long: (An)+: 6 + 4n clock cycles** movem R->M, long: (An): 4 + 4n clock cycles** ** internal cycles (1 internal=2 clock input cycles) From the third edition 1982 Of the MC68000 16-bit Microprocessor Users Manual page 203 movem M->R, long: (An)+: 12 + 8n clock cycles movem R->M, long: (An): 8 +10n clock cycles From the fourth edition 1984 Of the MC68000 16/32-bit Microprocessor Users Manual page 214 movem M->R, long: (An)+: 12 + 8n clock cycles movem R->M, long: (An): 8 + 8n clock cycles A Dec 1982 manual on the 68010 and an Oct 1985 man on th 68000 hold with movem.l R->M (An): 8 + 8n clock cycles. mitch -- Thomas P. Mitchell (mitch@stride1.Stride.COM) Phone: (702)322-6868 TWX: 910-395-6073 FAX: (702)322-7975 MicroSage Computer Systems Inc. Opinions expressed are probably mine.