gt0t+@andrew.cmu.edu (Gregory Ross Thompson) (12/07/89)
I'm working on a small ML program that does some SHR stuff in bank $00, just to prep the screen, and stuff like that. I need to move all this stuff into bank $E1 (obviously). Will the move routine at $FE20 move memory across banks? Also, is there an easy way to store with STA into bank E1? Pardon my ignorance, but I can only afford one GS ref manual that doesn't go into any detail, and Toolbox 1 which tells me nothing for this... -Greg T.
dlyons@Apple.COM (David A. Lyons) (12/07/89)
In article <kZTLjCG00WB7Q=4bJa@andrew.cmu.edu> gt0t+@andrew.cmu.edu (Gregory Ross Thompson) writes: >I'm working on a small ML program that does some SHR stuff in bank $00, >just to prep the screen, and stuff like that. I need to move all this >stuff into bank $E1 (obviously). Will the move routine at $FE20 move >memory across banks? > >Also, is there an easy way to store with STA into bank E1? $FE20 is not a supported entry point into ROM (see the GS Firmware Reference, page 250). $FE2C is, but it will not move memory across banks. I recommend the BlockMove toolbox call, documented in the Memory Manager chapter of TB Reference, Volume 1. The $8F opcode is STA $aabbcc; you can store anywhere in addressable memory with that. You should be sure to allocate the super-hires screen using the memory manager (NewHandle) before storing to it (or start up QuickDraw, which allocates it for you). -- --David A. Lyons, Apple Computer, Inc. | DAL Systems Apple II Developer Technical Support | P.O. Box 875 America Online: Dave Lyons | Cupertino, CA 95015-0875 GEnie: D.LYONS2 or DAVE.LYONS CompuServe: 72177,3233 Internet/BITNET: dlyons@apple.com UUCP: ...!ames!apple!dlyons My opinions are my own, not Apple's.
rnf@shumv1.uucp (Rick Fincher) (12/07/89)
In article <kZTLjCG00WB7Q=4bJa@andrew.cmu.edu> gt0t+@andrew.cmu.edu (Gregory Ross Thompson) writes: > > I'm working on a small ML program that does some SHR stuff in bank $00, >just to prep the screen, and stuff like that. I need to move all this >stuff into bank $E1 (obviously). Will the move routine at $FE20 move >memory across banks? > > Also, is there an easy way to store with STA into bank E1? > You can move the data by turning shadowing on then LDA and STA each word back to its original location. This puts the data in bank E1 and is faster than the memory moves you were talking about, if you keep your loop overhead low. Rick Fincher rnf@shumv1.ncsu.edu
brianw@microsoft.UUCP (Brian Willoughby) (12/15/89)
rnf@shumv1.ncsu.edu (Rick Fincher) writes: >gt0t+@andrew.cmu.edu (Gregory Ross Thompson) writes: >> >> I'm working on a small ML program that does some SHR stuff in bank $00, >>just to prep the screen, and stuff like that. I need to move all this >>stuff into bank $E1 (obviously). Will the move routine at $FE20 move >>memory across banks? >> >> Also, is there an easy way to store with STA into bank E1? > >You can move the data by turning shadowing on then LDA and STA each >word back to its original location. This puts the data in bank E1 >and is faster than the memory moves you were talking about, if you >keep your loop overhead low. Hey, this has to be a GS if you are using SHR? (Unless you have a Video Overlay card) Why not just use the 24 bit address features of the 65C816? There are a couple of ways of doing this. You could reload the Data Bank register before doing the bank $00 prep, and then the stuff would already be in bank $E1. I think you would do LDA #01 (or $E1), PHA, PLB. After selecting a new data bank, code still executes from the current bank and data accesses work in the Data Bank with the normal 16 bit address supplying the least significant bits. Also, remember LDA (zp) ? The 65C816 has LDA [dp] and/or LDA [dp],y These allow pseudo address registers in the direct page to use 24 bit addresses, with the LSB first and the MSB in the third byte. You could use these indirect pointers to either create the image directly in the alternate video bank OR you could copy between banks after setting up full 24 bit pointers. Thirdly (did I say there were only a couple of ways? shame on me), you could use the VERY fast MVP instruction, which is as fast as DMA (for a given memory speed) if you are willing to move <= 64K in one shot. The MVP instruction uses all three 16 bit registers (A, X, Y) for length of move, source address and destination address (NOT respectively, don't trust my memory - look it up). Since you need to specify the full 24 bit address, MVP has two bytes of operands: the source Bank and destination Bank. You would probably use MVP 00,01 Actually I have a 65C802 in my ][ Plus, so I haven't copied between banks. But I have used a lot of 65C802 specific instructions when I really need speed. Brian Willoughby UUCP: ...!{tikal, sun, uunet, elwood}!microsoft!brianw InterNet: microsoft!brianw@uunet.UU.NET or: microsoft!brianw@Sun.COM Bitnet brianw@microsoft.UUCP
rnf@shumv1.uucp (Rick Fincher) (12/16/89)
In article <9542@microsoft.UUCP> brianw@microsoft.UUCP (Brian Willoughby) writes: >rnf@shumv1.ncsu.edu (Rick Fincher) writes: >>gt0t+@andrew.cmu.edu (Gregory Ross Thompson) writes: >>> >>> I'm working on a small ML program that does some SHR stuff in bank $00, >>>just to prep the screen, and stuff like that. I need to move all this >>>stuff into bank $E1 (obviously). Will the move routine at $FE20 move >>>memory across banks? >>> >>> Also, is there an easy way to store with STA into bank E1? >> >>You can move the data by turning shadowing on then LDA and STA each >>word back to its original location. This puts the data in bank E1 >>and is faster than the memory moves you were talking about, if you >>keep your loop overhead low. > >Hey, this has to be a GS if you are using SHR? (Unless you have a Video >Overlay card) Why not just use the 24 bit address features of the 65C816? > >There are a couple of ways of doing this. You could reload the Data Bank >register before doing the bank $00 prep, and then the stuff would already be >in bank $E1. I think you would do LDA #01 (or $E1), PHA, PLB. After selecting [several other suggestions follow] If you write directly into $E1 you do so at 1mhz. The mvn instruction is fast but because of the way the writes to $E1 are slowed to 1mhz I think it is still faster to just read a word into a 16 bit register and write it back to the same location. No bank boundaries are crossed sio extra cycles are added for that, and shadowing lets the hardware do the actual copies. I think the Apple guys added up all of the cycles and determined that this was the fastest way to do this (Matt, Dave?). Rick Fincher rnf@shumv1.ncsu.edu
shankar@SRC.Honeywell.COM (Subash Shankar) (12/16/89)
In article <9542@microsoft.UUCP> brianw@microsoft.UUCP (Brian Willoughby) writes: >Thirdly (did I say there were only a couple of ways? shame on me), you could >use the VERY fast MVP instruction, which is as fast as DMA (for a given memory >speed) if you are willing to move <= 64K in one shot Is this really true? MVP takes 7 cycles per byte, and my understanding was that DMA only takes one cycle per byte (perhaps two since the address and data lines are shared). --- Subash Shankar Honeywell Systems & Research Center voice: (612) 782 7558 US Snail: 3660 Technology Dr., Minneapolis, MN 55418 shankar@src.honeywell.com srcsip!shankar
nicholaA@batman.moravian.EDU (Andy Nicholas) (12/17/89)
In article <9542@microsoft.UUCP>, brianw@microsoft.UUCP (Brian Willoughby) writes: > Thirdly (did I say there were only a couple of ways? shame on me), you could > use the VERY fast MVP instruction, which is as fast as DMA (for a given memory > speed) if you are willing to move <= 64K in one shot. I thought the cycle times on MVN/MVP were 7 cycles per byte moved. How is that as fast as DMA which is supposed to be (at least what I've always been told) 1 cycle per byte moved? Generally, MVN/MVP is sort of a slow way to do things... or at least thats what most of the GS graphics gurus will tell you. :-) andy -- Andy Nicholas GEnie, AM-Online: shrinkit Box 435, Moravian College CompuServe: 70771,2615 Bethlehem, PA 18018 InterNET: shrinkit@moravian.edu
ericmcg@pro-generic.cts.com (Eric Mcgillicuddy) (12/21/89)
In-Reply-To: message from nicholaA@batman.moravian.EDU DMA controllers take 3 cycles/word (8bits, 16, whatever), plus 7+ cycles for setup. This surprised me, maybe newer controllers are faster. BTW I didn't know the GS used DMA for Memory access. How do you access it? i.e. where are the control registers Mapped?
brianw@microsoft.UUCP (Brian WILLOUGHBY) (12/22/89)
shankar@src.honeywell.com (Subash Shankar) writes: >MVP takes 7 cycles per byte, and my understanding was that DMA only >takes one cycle per byte (perhaps two since the address and data lines >are shared). I may have to review the W65C802/816 data sheets, but I thought the 7 cycles occurs ONCE for setting up the MVP instruction. This includes fetching the opcode and the two bank bytes as well as a few internal setup cycles. Then you have one cycle for each byte access until the move is complete. Also, concerning cycles per byte moved, you're thinking of single direction DMA - i.e. peripheral to memory or memory to peripheral. If you want a memory to memory DMA, you'll need two accesses per byte, one to read from the source address and one to write to the destination address. These two addresses can be different. The 6502 memory access cycle only uses (reads or writes) data at the very end of the cycle. The final clock edge is used to latch the data into RAM or into the CPU depending upon the direction of data transfer. The first half of the cycle is used for address setup, and the second half is used to allow the data lines to settle. Thus the new sharing of the data lines to extend the address bus to 24 bits does not lengthen the memory access cycle (in fact it is shorter since the processor is now running at higher clock rates than before). On a related note, the Apple ][ only used half of the 1 MHz cycle time for CPU accesses. 50% of the 1 MHz clock was devoted to video address and video data. Moral, there is plenty of time - until you get up to 13 MHz, that is. Brian Willoughby UUCP: ...!{tikal, sun, uunet, elwood}!microsoft!brianw InterNet: microsoft!brianw@uunet.UU.NET or: microsoft!brianw@Sun.COM Bitnet brianw@microsoft.UUCP
brianw@microsoft.UUCP (Brian WILLOUGHBY) (12/22/89)
rnf@shumv1.ncsu.edu (Rick Fincher) writes: >brianw@microsoft.UUCP (Brian Willoughby) writes: >>rnf@shumv1.ncsu.edu (Rick Fincher) writes: >>>You can move the data by turning shadowing on then LDA and STA each >>>word back to its original location. This puts the data in bank E1 >>>and is faster than the memory moves you were talking about, if you >>>keep your loop overhead low. >> >>There are a couple of ways of doing this. You could reload the Data Bank >>register before doing the bank $00 prep, and then the stuff would already be >>in bank $E1. I think you would do LDA #01 (or $E1), PHA, PLB. After selecting >> >> [mention using MVN instruction] > >If you write directly into $E1 you do so at 1mhz. The mvn instruction is >fast but because of the way the writes to $E1 are slowed to 1mhz I think >it is still faster to just read a word into a 16 bit register and write >it back to the same location. No bank boundaries are crossed sio extra cycles >are added for that, and shadowing lets the hardware do the actual copies. I >think the Apple guys added up all of the cycles and determined that this was >the fastest way to do this (Matt, Dave?). Nope, nothing comes for free. Writes (but not reads) to banks $00 or $01 occur at the same speed as writes to $E0/$E1 as long as shadowing is on (provided that you are accessing the addresses set aside for video). The "Apple guys" only allowed shadowing so that ][+ and //e programs would still function, even though these programs are unaware that video memory has been moved to $E0/$E1. Thus, it was a compatibility issue, not a speed issue. I don't think that there is a case (for a GS-specific program) where shadowing allows faster execution times. For a non-GS program it just wouldn't work without shadowing. Fortunately, shadowing doesn't cause writes OUTSIDE of the video areas to be slowed. If you still prefer shadowing, then you could save time by causing the MVN instruction to move back to the same location (source == destination). A hand-coded loop will always be slower than MVN, except for cases where a different kind of move is needed, such as an I/O move where you keep read/writing the same address from/to a memory buffer. (i.e. reading from a single SCSI port address into a memory buffer.) Thus, the only limitation of MVN (or MVP) is that BOTH the source and destination addresses must be changing. EXPLANATION: Only one cycle of any direct write to $E1 is at 1 MHz, the rest of the cycles for that instruction are at full speed. This is a limitation because the video circuitry is using the $E0/E1 RAM banks at 1 MHz for 50% of the time, and the CPU can only "get in" on regular intervals during the other 50%. The Mac also suffers from the same limitation (except for the SE/030 which has dual port RAM. OK Apple, when do we see this technology in a ][?). There is hardware in the GS to "stretch" any cycle which accesses the video memory, based on the address generated by the CPU. Fortunately there are two sets of RAM banks, so it is possible to write to both at the same time with shadowing on. Here is the catch: if you have shadowing on, then you are technically writing into video memory and the CPU still slows down for that cycle. There is no magical way of sneaking past this requirement because the whole system must synchronize to the video memory. If the hardware didn't wait for the video write to complete, then there would be a possibility that the CPU would do a 16 bit write at 2.8 MHz to bank $01 with shadowing on, and the second byte would have nowhere to go because at 2.8 MHz the first byte would not yet be written to the 1 MHz video memory. 1 MHz clock ----------------- actual ----------------- actual ------------ | Video read | Write byte 1 | Video read | Write byte 2 | Video read --- ----------------- ----------------- 2 MHz (I didn't want to try to illustrate 2.8 MHz!) --------- --------- --------- --------- --------- | Write | 1 | Write | 2 | | | | | | --- --------- --------- --------- --------- ---- The first write attempt conflicts with video access to $Ex, and so it is delayed. The second write is impossible unless the 2 MHz CPU clock is stretched to sync up with the 1 MHz video timing. P.S. Hey Rick, do you remember we met at the NCSU Computing Center back when you used to work there? I was attending NCSU at the time and it was my first exposure to the GS. Brian Willoughby UUCP: ...!{tikal, sun, uunet, elwood}!microsoft!brianw InterNet: microsoft!brianw@uunet.UU.NET or: microsoft!brianw@Sun.COM Bitnet brianw@microsoft.UUCP
mek4_ltd@uhura.cc.rochester.edu (Mark Kern) (12/23/89)
In article <10041@microsoft.UUCP> brianw@microsoft.UUCP (Brian Willoughby) writes: >If you still prefer shadowing, then you could save time by causing the MVN >instruction to move back to the same location (source == destination). A >hand-coded loop will always be slower than MVN, except for cases where a >different kind of move is needed, such as an I/O move where you keep >read/writing the same address from/to a memory buffer. (i.e. reading from a >single SCSI port address into a memory buffer.) Thus, the only limitation of >MVN (or MVP) is that BOTH the source and destination addresses must be >changing. If MVN were the fastest way for moving data, GS games would be not be moving at half the speed they are now. MVN takes 7 cycles per byte, not word. An unfolded lda, sta loop is slightly faster, taking roughly 12 cyles per word, depending on the addressing mode used. MVN might be faster when the slowdown in writing occurs, but this is something I'm unsure of. The way many GS games shuttles memory from bank $01 to $E1 in a hurry is by mapping the stack onto the SHR screen, setting DP at the SHR, then PEI'ing the screen to itself, which then gets shadowed over to $E1. So far, this one of the fastest ways to do it. It is much faster than the MVN method. For info on the slow/fast cycle times for instructions when writing to Mega II controlled RAM can be found in Apple Tech Notes #70 (fast graphics hints) and Apple Note #68 (tips for I/O expansion slot card design). Mark E. Kern -- ========================================================================= Mark Edward Kern, mek4_ltd@uhura.cc.rochester.edu A.Online: Markus Quagmire Studios U.S.A. "We not only hear you, we feel you !" =========================================================================
brianw@microsoft.UUCP (Brian WILLOUGHBY) (12/25/89)
nicholaA@batman.moravian.EDU (Andy Nicholas) writes: >I thought the cycle times on MVN/MVP were 7 cycles per byte moved. How >is that as fast as DMA which is supposed to be (at least what I've always >been told) 1 cycle per byte moved? Have you compared the speeds in an actual coding situation? As soon as I figured out how to assemble 16 bit opcodes using Merlin macros, the first 16 bit program I wrote to use my new W65C802 was a full HGR screen move in each of the available methods. I had an 8 bit move loop, a 16 bit move loop (which used X and Y as sixteen bit pointers into memory), and a MVN instruction. I repeated each move 16 times, so that my slow human perception could get a handle on how long the process was taking. Using alternating full screens of black and white, it was VERY easy to see that MVN was clearly the fastest. I coded the fastest 16 bit move I could think of, using LDA 00,X - with X as a 16 bit offset, the actual address was not in the Zero Page, but using the Zero Page (now Direct Page) addressing mode shaved an extra cycle off of every loop iteration. There was no mistaking it, the MVN was just as much an improvement over the 16 bit move loop as the 16 bit move was over the 8 bit move. This is on a Plus, but after I got a TransWarp I was faced with the same slow video cycles as the GS. Still the MVN method won. >Generally, MVN/MVP is sort of a slow way to do things... or at least thats >what most of the GS graphics gurus will tell you. :-) Well, for generating graphics screens from multiple smaller images (instead of moving the entire graphics screen as a single unit), MVN doesn't offer many advantages. Than again, neither does the standard DMA move (as if it were available on an Apple :-). This is because writing a shape - or a window, or any object smaller than the width of the graphics screen - to the video memory is not a simple move with a single start address and length. What you always end up with is several shorter moves to each individual scan line. With moves that are shorter than 40 bytes (using the HGR screen as an example), the advantage of MVN or MVP are not so great - and besides, there is so much room for optimization in video routines that the static MVN instruction is just not flexible enough. Add to this the consideration that many plotting routines might need to rotate bits within a byte in order to plot at different locations, and the MVN becomes even less useful. I believe that you have *graphics* gurus telling you that MVN/MVP is slow for *their* purposes, but these instructions are faster than a loop based move algorithm for simple block moves of large areas of memory. Do you think that the Western Design Center engineers had nothing better to do one day than to create a totally useless instruction? They could have left these two opcodes open for future expansion. The 7 cycles is instruction setup time - the move occurs at a rate of 1 cycle per byte. Side note: the video DMA circuitry in the Amiga has a start address, length AND a scan line pitch value (address difference between two pixels located at the same X position on the screen). For the Amiga, moving square areas on the video screen (like, say, windows) is super fast. Plus, their bit-blitter does the bit rotations that make Apple graphics programmers choose hand-coded loops over block moves. This is the kind of hardware I'd like to see in the GS! Brian Willoughby UUCP: ...!{tikal, sun, uunet, elwood}!microsoft!brianw InterNet: microsoft!brianw@uunet.UU.NET or: microsoft!brianw@Sun.COM Bitnet brianw@microsoft.UUCP
ruzun@pro-sol.cts.com (Roger Uzun) (12/27/89)
In-Reply-To: message from brianw@microsoft.UUCP I used MVn/MVP in a program I wrote a few years ago for PBI Software called SoundKeys. It is pretty good for block moves, but it does take 7 cycles/byte The Amiga Blitter is very handy, and the //gs should have had such a device from the start, IMHO. -Roger Uzun
usenet@orstcs.CS.ORST.EDU (Usenet programs owner) (12/27/89)
Keywords: From: throoph@jacobs.CS.ORST.EDU (Henry Throop) Path: jacobs.CS.ORST.EDU!throoph In article <10071@microsoft.UUCP> brianw@microsoft.UUCP (Brian WILLOUGHBY) writes: <nicholaA@batman.moravian.EDU (Andy Nicholas) writes: <>I thought the cycle times on MVN/MVP were 7 cycles per byte moved. How <>is that as fast as DMA which is supposed to be (at least what I've always <>been told) 1 cycle per byte moved? < <Have you compared the speeds in an actual coding situation? < [...] < <I believe that you have *graphics* gurus telling you that MVN/MVP is slow for <*their* purposes, but these instructions are faster than a loop based move <algorithm for simple block moves of large areas of memory. Do you think that <the Western Design Center engineers had nothing better to do one day than to <create a totally useless instruction? They could have left these two opcodes <open for future expansion. The 7 cycles is instruction setup time - the move <occurs at a rate of 1 cycle per byte. No, it's 7 cycles per byte moved. I timed an MVN moving one bank (64K) at 452100 +/- 50 ms, which come out to (at 1.023 Mhz on my gs) 7.06 cycles per instruction. Considering that there was probably a bit of overhead at the start or end, and maybe a few interrupts, it looks like 7 to me. >Brian Willoughby --- Henry Throop Internet: throoph@jacobs.cs.orst.edu
stout@hpscdc.scd.hp.com (Tim Stoutamore) (12/28/89)
/ hpscdc:comp.sys.apple / brianw@microsoft.UUCP (Brian WILLOUGHBY) / 12:27 am Dec 25, 1989 / nicholaA@batman.moravian.EDU (Andy Nicholas) writes: >I thought the cycle times on MVN/MVP were 7 cycles per byte moved. How >is that as fast as DMA which is supposed to be (at least what I've always >been told) 1 cycle per byte moved? Have you compared the speeds in an actual coding situation? As soon as I figured out how to assemble 16 bit opcodes using Merlin macros, the first 16 bit program I wrote to use my new W65C802 was a full HGR screen move in each of the available methods. I had an 8 bit move loop, a 16 bit move loop (which used X and Y as sixteen bit pointers into memory), and a MVN instruction. I repeated each move 16 times, so that my slow human perception could get a handle on how long the process was taking. Using alternating full screens of black and white, it was VERY easy to see that MVN was clearly the fastest. I coded the fastest 16 bit move I could think of, using LDA 00,X - with X as a 16 bit offset, the actual address was not in the Zero Page, but using the Zero Page (now Direct Page) addressing mode shaved an extra cycle off of every loop iteration. There was no mistaking it, the MVN was just as much an improvement over the 16 bit move loop as the 16 bit move was over the 8 bit move. This is on a Plus, but after I got a TransWarp I was faced with the same slow video cycles as the GS. Still the MVN method won. >Generally, MVN/MVP is sort of a slow way to do things... or at least thats >what most of the GS graphics gurus will tell you. :-) Well, for generating graphics screens from multiple smaller images (instead of moving the entire graphics screen as a single unit), MVN doesn't offer many advantages. Than again, neither does the standard DMA move (as if it were available on an Apple :-). This is because writing a shape - or a window, or any object smaller than the width of the graphics screen - to the video memory is not a simple move with a single start address and length. What you always end up with is several shorter moves to each individual scan line. With moves that are shorter than 40 bytes (using the HGR screen as an example), the advantage of MVN or MVP are not so great - and besides, there is so much room for optimization in video routines that the static MVN instruction is just not flexible enough. Add to this the consideration that many plotting routines might need to rotate bits within a byte in order to plot at different locations, and the MVN becomes even less useful. I believe that you have *graphics* gurus telling you that MVN/MVP is slow for *their* purposes, but these instructions are faster than a loop based move algorithm for simple block moves of large areas of memory. Do you think that the Western Design Center engineers had nothing better to do one day than to create a totally useless instruction? They could have left these two opcodes open for future expansion. The 7 cycles is instruction setup time - the move occurs at a rate of 1 cycle per byte. Side note: the video DMA circuitry in the Amiga has a start address, length AND a scan line pitch value (address difference between two pixels located at the same X position on the screen). For the Amiga, moving square areas on the video screen (like, say, windows) is super fast. Plus, their bit-blitter does the bit rotations that make Apple graphics programmers choose hand-coded loops over block moves. This is the kind of hardware I'd like to see in the GS! Brian Willoughby UUCP: ...!{tikal, sun, uunet, elwood}!microsoft!brianw InterNet: microsoft!brianw@uunet.UU.NET or: microsoft!brianw@Sun.COM Bitnet brianw@microsoft.UUCP ----------
stout@hpscdc.scd.hp.com (Tim Stoutamore) (12/28/89)
Sorry for the inadvertent reposting of Brian's message. I am still just learning the notes system. Inter-memory moves, whether DMA or MVN/MVP, are constrained to atleast two memory cycles. This is because one cycle is needed to put the source address on the bus and one cycle is needed to put the destination address on the bus. The only time that DMA controllers can perform one word per cycle moves is when the transfer is between memory and I/O.
brianw@microsoft.UUCP (Brian WILLOUGHBY) (12/30/89)
throoph@jacobs.CS.ORST.EDU.UUCP (Henry Throop) writes: > >brianw@microsoft.UUCP (Brian WILLOUGHBY) writes: ><nicholaA@batman.moravian.EDU (Andy Nicholas) writes: ><>I thought the cycle times on MVN/MVP were 7 cycles per byte moved. How ><>is that as fast as DMA which is supposed to be (at least what I've always ><>been told) 1 cycle per byte moved? >< ><Have you compared the speeds in an actual coding situation? > >No, it's 7 cycles per byte moved. I timed an MVN moving one bank (64K) at >452100 +/- 50 ms, which come out to (at 1.023 Mhz on my gs) 7.06 cycles >per instruction. Considering that there was probably a bit of overhead >at the start or end, and maybe a few interrupts, it looks like 7 to me. On the Apple, the *average* clock speed is 1.020484 MHz when you consider that at the end of each video line the final clock cycle is shortened by one period of the 14.3818 MHz clock. In order to keep the video data in sync with the phase of the colorburst signal, the Apple can't use a constant frequency clock. Pick up the SAMs manual called "The Apple II Circuit Description" for more details. >>Brian Willoughby > >Henry Throop I should have known that I would have to eat my words if I posted before checking the docs. Straight from WDC: 7 cycles per byte. That's the bad news. The good news is that MVN/MVP is still the fastest *generic* move, where you have total freedom over length and source and destination addresses. Any method of moving data faster than MV* is necessarily limited by either source address, destination address, or BOTH. Too bad that WDC hasn't designed a standard 6502 bus DMA controller chip, yet. After grabbing the documentation, I looked at how many different ways I could move the 8192 bytes that make up the hires screen. What follows is a summary (hopefully not too boring) of several different approaches to moving data with the 65C8xx : move 8192 bytes: Method #cycles #bytes of code Stack & Direct Page 65536 19 MVN 57344 12 Partially Expanded Loop 57047 28 variation of above 45143 648 Expanded (no loop) 40960 24576 EXPANDED (NO LOOP) Subash Shankar pointed out that many graphics moves are not looped - they have a separate instruction for each word moved. Using 4096 LDA/STA pairs, the 65C8xx can moves 8192 bytes in only 40960 cycles, but this code occupies 24576 bytes of memory! I think that this is the absolute fastest way to move that many unknown (i.e. non-constant) bytes, without DMA. STACK AND DIRECT PAGE MOVES Looking at the number of cycles needed for each addressing mode would help: LDA STA PHA 3 6 (d,x) 6 PLA 4 5 * (d),y 6 PEA 5 a 5 (d) 5 PER 6 pc+a 4 d,s 4 PEI 6 or 7 (d) 7 (d,s),y 7 3 d 3 4 d,x 4 6 [d] 6 6 [d],y 6 2 # - 4 * a,y 5 4 a 4 4 * a,x 5 5 al 5 5 al,x 5 All instructions take an extra cycle to move a word instead of a single byte. The addressing moves with an asterisk * after the LDA timing take an extra cycle if you are using 16 bit indexing. Since you can only move 256 bytes with 8 bit indexing, this extra cycle will have to be considered. For example: LDA a,X takes 6 cycles when reading a word from memory with X set to 16 bits. The quickest mode is d, or direct page, but you can't move more than 256 bytes to the direct page. The modes d,x and d,s are the next fastest, and give a hint that the stack might be useful in faster moves. Looking at PHA and PLA show just 3 or 4 cycles respectively. Someone mentioned a graphics hacker who used PEI. It looks like the fastest operation would be PEA, which only takes 5 cycles to place a CONSTANT word on the stack. If you were plotting a static shape to the screen, and you first set S to the highest address, then the quickest way to change sequential bytes would be PEI. But for each horizontal line, you would need to update the stack pointer (unless the shape occupied the full width of the screen). Curiously, using 16 bit index registers does NOT add an extra cycle when LoaDing the Accumulator from direct page indexed memory. This actually allows accesses outside the 256 byte direct page to be faster than the normal absolute addressing modes. The fastest short loop I could design using the above knowledge was as follows: lda #Length ;use $1FFE for hires tax clc adc #Dest ;use $2000 tas lda #Source tad Loop lda $00,x ;5 cycles pha ;4 dex ;2 dex ;2 bpl Loop ;3 this limits Length to a maximum of $8000 This code uses 65536 cycles to move 8192 bytes. It is too slow because of the 7 cycles of loop overhead to decrement x and loop back again. PARTIALLY EXPANDED LOOP I figured that there had to be a compromise between the fast non-looped move which used 25K of memory and the short loop which took longer than MVN. How about a very long loop? This would make the time for loop overhead have a smaller effect in comparison to the time for actually moving the data. To avoid hard-coding BOTH the source AND destination addresses, the direct page indexed mode could be used for exactly one address, and the Direct Page Register could be changed to point to the right part of memory just before the move. Using 16 bit index registers, I figured the longest stretch of code would be 256 bytes before the direct page was exhausted and the index register would need to be changed to access more memory. Source can be anywhere, but in this example Destination is hard-coded. You could easily use this same algorithm with a fixed Source and variable Dest by rewriting it. N refers to the number of LDA/STA pairs that are repeated before the loop restarts: ldx #Length ;$1FFE - N*2 lda #Source tad ; Dest is hard-coded as an absolute address Loop lda $00,x ;5 cycles sta $2000,x ;6 lda $02,x sta $2002,x ... ;repeat LDA/STA pair for a total of N times lda $00+N-2,x sta $2000+N-2,x txa ;2 cycles sec ;2 sbc #LoopSize ;3 LoopSize = N*2 tax ;2 bpl Loop ;3 this limits Length to $8000 or less The only choice left is to find N, the number of times to repeat the LDA/STA pairs to gain cycle time efficiency. The limiting maximum would be N = 128, because the direct page is only 256 bytes long, and we are moving word data. There is a simple formula for the resulting code size and cycle time. Counting the number of bytes per opcode: Size of code = 5N + 8 Number of cycles = (11N + 12)I where I = Number of iterations of loop, I'll use 8192 bytes again I = 4096 words/N cycles = (11N + 12)*4096/N = 44759 + 49152/N Using the maximum N = 128, cycles = 45143, size = 648 bytes The minimum code size, using MVN as a limit for cycles, can be found as follows: cycles = 44759 + 49152/N < MVN = 57344 therefore N > 3.9056, N has to be a whole number, so any value 4 or greater yeilds a loop that it faster than MVN. Using N = 4, cycles = 57047, size = 28 bytes What does this mean? I've just proven to myself that you can write a rather limited move loop that is faster than MVN, and only takes slightly more than twice the code. But it is not nearly as flexible. In other words, you couldn't use this algorithm in a Memory Manager subroutine of the Operating System. I hope that a few a these coding algorithms prove useful to someone else. Brian Willoughby UUCP: ...!{tikal, sun, uunet, elwood}!microsoft!brianw InterNet: microsoft!brianw@uunet.UU.NET or: microsoft!brianw@Sun.COM Bitnet brianw@microsoft.UUCP
dlyons@Apple.COM (David A. Lyons) (01/03/90)
In article <10041@microsoft.UUCP> brianw@microsoft.UUCP (Brian Willoughby) writes: >[...] Writes (but not reads) to banks $00 or $01 occur >at the same speed as writes to $E0/$E1 as long as shadowing is on (provided >that you are accessing the addresses set aside for video). The "Apple guys" >only allowed shadowing so that ][+ and //e programs would still function, even >though these programs are unaware that video memory has been moved to $E0/$E1. >Thus, it was a compatibility issue, not a speed issue. I don't think that >there is a case (for a GS-specific program) where shadowing allows faster >execution times. For a non-GS program it just wouldn't work without shadowing. The first part of that is the key: reads from 0 and 1 are fast. Consider scrolling the text screen, for example: the reads and writes are all to banks 0 and 1, so the scrolling is *faster* than if the reads and writes were to banks $E0 and $E1--so shadowing *is* partially a speed issue. >Fortunately, shadowing doesn't cause writes OUTSIDE of the video areas to be >slowed. I read that wrong on my first try--to clarify, access to banks $E0 and $E1 is always slow, but access to non-shadowed areas of banks 0 and 1 is fast, and all reads from 0 and 1 are fast. -- --David A. Lyons, Apple Computer, Inc. | DAL Systems Apple II Developer Technical Support | P.O. Box 875 America Online: Dave Lyons | Cupertino, CA 95015-0875 GEnie: D.LYONS2 or DAVE.LYONS CompuServe: 72177,3233 Internet/BITNET: dlyons@apple.com UUCP: ...!ames!apple!dlyons My opinions are my own, not Apple's.
brianw@microsoft.UUCP (Brian WILLOUGHBY) (01/04/90)
In article <37569@apple.Apple.COM> dlyons@Apple.COM (David A. Lyons) writes: >The first part of that is the key: reads from 0 and 1 are fast. Consider >scrolling the text screen, for example: the reads and writes are all to >banks 0 and 1, so the scrolling is *faster* than if the reads and writes >were to banks $E0 and $E1--so shadowing *is* partially a speed issue. Quite true. In fact, the TransWarp on my II Plus takes advantage of similar shadowing because it only slows to 1 MHz when *writing* to video memory and then also shadows the data to the 48K RAM to update the screen. Reads (except to slot memory) are always at 3.58 MHz. >brianw@microsoft.UUCP (Brian Willoughby) writes: >>Fortunately, shadowing doesn't cause writes OUTSIDE of the video areas to be >>slowed. > >I read that wrong on my first try--to clarify, access to banks $E0 and >$E1 is always slow, but access to non-shadowed areas of banks 0 and 1 >is fast, and all reads from 0 and 1 are fast. I seem to have trouble with my wording. What I should have said was that a write to Bank $00 or $01 at any address that is not in a video area is not slowed, because these writes do not need to be synched to the real video memory. I was trying to make the point that shadowing slows down some, but not all, writes to the first two Banks. If you didn't need video (don't ask me why), you could treat the entire first two banks as contiguous RAM memory. Then, with shadowing turned *off*, *all* of the accesses would be full speed. Shadowing, therefore, reduces performance in certain cases (but admittedly, these are rare cases). Brian Willoughby UUCP: ...!{tikal, sun, uunet, elwood}!microsoft!brianw InterNet: microsoft!brianw@uunet.UU.NET or: microsoft!brianw@Sun.COM Bitnet brianw@microsoft.UUCP
kadickey@phoenix.Princeton.EDU (Kent Andrew Dickey) (01/08/90)
In article <10100@microsoft.UUCP> brianw@microsoft.UUCP (Brian WILLOUGHBY) writes: >throoph@jacobs.CS.ORST.EDU.UUCP (Henry Throop) writes: >> >>brianw@microsoft.UUCP (Brian WILLOUGHBY) writes: >><nicholaA@batman.moravian.EDU (Andy Nicholas) writes: >><>I thought the cycle times on MVN/MVP were 7 cycles per byte moved. How >><>is that as fast as DMA which is supposed to be (at least what I've always >><>been told) 1 cycle per byte moved? >>< >><Have you compared the speeds in an actual coding situation? >> >>No, it's 7 cycles per byte moved. I timed an MVN moving one bank (64K) at >>452100 +/- 50 ms, which come out to (at 1.023 Mhz on my gs) 7.06 cycles >>per instruction. Considering that there was probably a bit of overhead >>at the start or end, and maybe a few interrupts, it looks like 7 to me. > >On the Apple, the *average* clock speed is 1.020484 MHz when you consider >that at the end of each video line the final clock cycle is shortened by one >period of the 14.3818 MHz clock. In order to keep the video data in sync with >the phase of the colorburst signal, the Apple can't use a constant frequency >clock. Pick up the SAMs manual called "The Apple II Circuit Description" for >more details. > >>>Brian Willoughby >> >>Henry Throop First, Woz himself wrote an article in Byte on how to calculate e to 36,000 decimal places on an Apple II--and he gave the effective clock speed of the Apple II as .99 MHz or so (I don't have the issue handy, but I remember it clearly being under 1.0). 1.00 is fairly accurate to use for most purposes. But, as to fast memory moves, PEI is the answer. If you want to move from bank $05 to bank $07, MVN and MVP are the handiest way to go (not the best, but the speed improvement is not that much). But, as my LONG article in my SHR.DEMO.SHK archive, I explain why PEI is so fast for screen moves, and fast for general memory moves too. PEI is a new 65816 instruction--PEI $00 pushes the two bytes at $00 and $01 onto the stack. Therefore, by moving the direct page around we can change the source address, and moving the stack pointer changes the destination address--but we are always stuck in bank $00. No problem--//e technology to the rescue. For the //e to access it's auxiliary memory area, there are softswitches to set which basically swap out the low 48K area of memory for the auxiliary memory. But, we can set the switches so that we READ from bank $00 and WRITE to bank $01. Then, set shadowing on, and our write to bank $01 will show up on the SHR screen. And here's the key point--PEI moves 2 byte in 6 cycles--that's 3 cycles per byte. Twice as fast as MVN and MVP. And, more subtly (and described in more detail in the archive file), PEI times better with the slow memory write. The processor has to wait up to 2.5 cycles for it to synchronize up with the slow video memory. It then takes 2.5 cycles (fast cycles) to write to this slow memory. PEI just so happens to execute its writes and reads in such a way that the synchronization time is nearly zero. That is, the 6 cycles it would normal take to operate are 2 cycles for the actual memory write, and 4 cycles for other stuff. Those 2 write cycles would be expanded to 5 cycles to write to slow memory. But my timings show that a PEI memory move to slow memory can occur as fast as 10 cycles/2 bytes. That means an average of just .5 cycle is wasted to synch up with slow memory. MVN and MVP on the other hand, each take an average of 1 fast cycle to synch up--so MVN to slow memory would occur at 6+2.5+1 = 9.5 cycles/byte. Still almost twice as slow as PEI. For more information on this, either pick up my file somewhere, or send me mail. Kent Dickey kadickey@phoenix.Princeton.EDU
sb@pro-generic.cts.com (Stephen Brown) (01/14/90)
In-Reply-To: message from kadickey@phoenix.Princeton.EDU In this message, it is claimed that calling the Apple IIe clock speed 1.00 Mhz is good enough for most purposes. Well, not really, and certainly not if you're doing any timing (say timed loops in which you're changing video modes). The frequency would by 1.022727 Mhz (14.31818 master clock divided by 14) if life were simple. Life is not simple. One cycle out of every 65 cycles is longer than the rest. The long cycle frequency is 0.89488625 Mhz, bringing the composite frequency to 1.02048432 Mhz. If a loop is written in a multiple of 65 cycles, then it will always take the same time to execute. If not, then the loop time may vary by 140nS. PAL (phase alternating line) or European Apple II's run at a slightly different frequency because there are a greater number of horizontal scans and a fewer number of frames. I believe PAL motherboards' composite frequency is 1.015625 Mhz. Excuse my sloppiness for significant figures! UUCP: crash!pro-generic!sb ARPA: crash!pro-generic!sb@nosc.mil INET: sb@pro-generic.cts.com