[comp.sys.apple] Really small question

gt0t+@andrew.cmu.edu (Gregory Ross Thompson) (12/07/89)

  I'm working on a small ML program that does some SHR stuff in bank $00,
just to prep the screen, and stuff like that.  I need to move all this
stuff into bank $E1 (obviously).  Will the move routine at $FE20 move
memory across banks?

  Also, is there an easy way to store with STA into bank E1?

  Pardon my ignorance, but I can only afford one GS ref manual that
doesn't go into any detail, and Toolbox 1 which tells me nothing for
this...

		-Greg T.

dlyons@Apple.COM (David A. Lyons) (12/07/89)

In article <kZTLjCG00WB7Q=4bJa@andrew.cmu.edu> gt0t+@andrew.cmu.edu (Gregory Ross Thompson) writes:
>I'm working on a small ML program that does some SHR stuff in bank $00,
>just to prep the screen, and stuff like that.  I need to move all this
>stuff into bank $E1 (obviously).  Will the move routine at $FE20 move
>memory across banks?
>
>Also, is there an easy way to store with STA into bank E1?

$FE20 is not a supported entry point into ROM (see the GS Firmware
Reference, page 250).  $FE2C is, but it will not move memory across
banks.

I recommend the BlockMove toolbox call, documented in the Memory Manager
chapter of TB Reference, Volume 1.

The $8F opcode is STA $aabbcc; you can store anywhere in addressable
memory with that.  You should be sure to allocate the super-hires
screen using the memory manager (NewHandle) before storing to it (or
start up QuickDraw, which allocates it for you).
-- 

 --David A. Lyons, Apple Computer, Inc.      |   DAL Systems
   Apple II Developer Technical Support      |   P.O. Box 875
   America Online: Dave Lyons                |   Cupertino, CA 95015-0875
   GEnie: D.LYONS2 or DAVE.LYONS         CompuServe: 72177,3233
   Internet/BITNET:  dlyons@apple.com    UUCP:  ...!ames!apple!dlyons
   
   My opinions are my own, not Apple's.

rnf@shumv1.uucp (Rick Fincher) (12/07/89)

In article <kZTLjCG00WB7Q=4bJa@andrew.cmu.edu> gt0t+@andrew.cmu.edu (Gregory Ross Thompson) writes:
>
>  I'm working on a small ML program that does some SHR stuff in bank $00,
>just to prep the screen, and stuff like that.  I need to move all this
>stuff into bank $E1 (obviously).  Will the move routine at $FE20 move
>memory across banks?
>
>  Also, is there an easy way to store with STA into bank E1?
>

You can move the data by turning shadowing on then LDA and STA each 
word back to its original location.  This puts the data in bank E1
and is faster than the memory moves you were talking about, if you
keep your loop overhead low.

Rick Fincher
rnf@shumv1.ncsu.edu

brianw@microsoft.UUCP (Brian Willoughby) (12/15/89)

rnf@shumv1.ncsu.edu (Rick Fincher) writes:
>gt0t+@andrew.cmu.edu (Gregory Ross Thompson) writes:
>>
>>  I'm working on a small ML program that does some SHR stuff in bank $00,
>>just to prep the screen, and stuff like that.  I need to move all this
>>stuff into bank $E1 (obviously).  Will the move routine at $FE20 move
>>memory across banks?
>>
>>  Also, is there an easy way to store with STA into bank E1?
>
>You can move the data by turning shadowing on then LDA and STA each 
>word back to its original location.  This puts the data in bank E1
>and is faster than the memory moves you were talking about, if you
>keep your loop overhead low.

Hey, this has to be a GS if you are using SHR? (Unless you have a Video
Overlay card)  Why not just use the 24 bit address features of the 65C816?

There are a couple of ways of doing this.  You could reload the Data Bank
register before doing the bank $00 prep, and then the stuff would already be
in bank $E1.  I think you would do LDA #01 (or $E1), PHA, PLB.  After selecting
a new data bank, code still executes from the current bank and data accesses
work in the Data Bank with the normal 16 bit address supplying the least
significant bits.

Also, remember LDA (zp) ?  The 65C816 has LDA [dp] and/or LDA [dp],y
These allow pseudo address registers in the direct page to use 24 bit
addresses, with the LSB first and the MSB in the third byte.  You could use
these indirect pointers to either create the image directly in the alternate
video bank OR you could copy between banks after setting up full 24 bit
pointers.

Thirdly (did I say there were only a couple of ways? shame on me), you could
use the VERY fast MVP instruction, which is as fast as DMA (for a given memory
speed) if you are willing to move <= 64K in one shot.  The MVP instruction
uses all three 16 bit registers (A, X, Y) for length of move, source address
and destination address (NOT respectively, don't trust my memory - look it up).
Since you need to specify the full 24 bit address, MVP has two bytes of
operands: the source Bank and destination Bank.  You would probably use
MVP 00,01

Actually I have a 65C802 in my ][ Plus, so I haven't copied between banks.
But I have used a lot of 65C802 specific instructions when I really need
speed.

Brian Willoughby
UUCP:           ...!{tikal, sun, uunet, elwood}!microsoft!brianw
InterNet:       microsoft!brianw@uunet.UU.NET
  or:           microsoft!brianw@Sun.COM
Bitnet          brianw@microsoft.UUCP

rnf@shumv1.uucp (Rick Fincher) (12/16/89)

In article <9542@microsoft.UUCP> brianw@microsoft.UUCP (Brian Willoughby) writes:
>rnf@shumv1.ncsu.edu (Rick Fincher) writes:
>>gt0t+@andrew.cmu.edu (Gregory Ross Thompson) writes:
>>>
>>>  I'm working on a small ML program that does some SHR stuff in bank $00,
>>>just to prep the screen, and stuff like that.  I need to move all this
>>>stuff into bank $E1 (obviously).  Will the move routine at $FE20 move
>>>memory across banks?
>>>
>>>  Also, is there an easy way to store with STA into bank E1?
>>
>>You can move the data by turning shadowing on then LDA and STA each 
>>word back to its original location.  This puts the data in bank E1
>>and is faster than the memory moves you were talking about, if you
>>keep your loop overhead low.
>
>Hey, this has to be a GS if you are using SHR? (Unless you have a Video
>Overlay card)  Why not just use the 24 bit address features of the 65C816?
>
>There are a couple of ways of doing this.  You could reload the Data Bank
>register before doing the bank $00 prep, and then the stuff would already be
>in bank $E1.  I think you would do LDA #01 (or $E1), PHA, PLB.  After selecting

[several other suggestions follow]

If you write directly into $E1 you do so at 1mhz.  The mvn instruction is
fast but because of the way the writes to $E1 are slowed to 1mhz I think
it is still faster to just read a word into a 16 bit register and write
it back to the same location.  No bank boundaries are crossed sio extra cycles
are added for that, and shadowing lets the hardware do the actual copies.  I
think the Apple guys added up all of the cycles and determined that this was
the fastest way to do this (Matt, Dave?).

Rick Fincher
rnf@shumv1.ncsu.edu

shankar@SRC.Honeywell.COM (Subash Shankar) (12/16/89)

In article <9542@microsoft.UUCP> brianw@microsoft.UUCP (Brian Willoughby) writes:

>Thirdly (did I say there were only a couple of ways? shame on me), you could
>use the VERY fast MVP instruction, which is as fast as DMA (for a given memory
>speed) if you are willing to move <= 64K in one shot	 

Is this really true?
MVP takes 7 cycles per byte, and my understanding was that DMA only
takes one cycle per byte (perhaps two since the address and data lines
are shared).  

---
Subash Shankar             Honeywell Systems & Research Center
voice: (612) 782 7558      US Snail: 3660 Technology Dr., Minneapolis, MN 55418
shankar@src.honeywell.com  srcsip!shankar

nicholaA@batman.moravian.EDU (Andy Nicholas) (12/17/89)

In article <9542@microsoft.UUCP>, brianw@microsoft.UUCP (Brian Willoughby) writes:

> Thirdly (did I say there were only a couple of ways? shame on me), you could
> use the VERY fast MVP instruction, which is as fast as DMA (for a given memory
> speed) if you are willing to move <= 64K in one shot. 

I thought the cycle times on MVN/MVP were 7 cycles per byte moved.  How
is that as fast as DMA which is supposed to be (at least what I've always
been told) 1 cycle per byte moved?

Generally, MVN/MVP is sort of a slow way to do things... or at least thats
what most of the GS graphics gurus will tell you.  :-)

andy

-- 
Andy Nicholas             GEnie, AM-Online: shrinkit
Box 435, Moravian College       CompuServe: 70771,2615
Bethlehem, PA  18018              InterNET: shrinkit@moravian.edu

ericmcg@pro-generic.cts.com (Eric Mcgillicuddy) (12/21/89)

In-Reply-To: message from nicholaA@batman.moravian.EDU

DMA controllers take 3 cycles/word (8bits, 16, whatever), plus 7+ cycles for
setup. This surprised me, maybe newer controllers are faster. 
BTW I didn't know the GS used DMA for Memory access. How do you access it?
i.e. where are the control registers Mapped?

brianw@microsoft.UUCP (Brian WILLOUGHBY) (12/22/89)

shankar@src.honeywell.com (Subash Shankar) writes:
>MVP takes 7 cycles per byte, and my understanding was that DMA only
>takes one cycle per byte (perhaps two since the address and data lines
>are shared).  

I may have to review the W65C802/816 data sheets, but I thought the 7 cycles
occurs ONCE for setting up the MVP instruction.  This includes fetching the
opcode and the two bank bytes as well as a few internal setup cycles.  Then
you have one cycle for each byte access until the move is complete.

Also, concerning cycles per byte moved, you're thinking of single direction DMA
- i.e. peripheral to memory or memory to peripheral.  If you want a memory to
memory DMA, you'll need two accesses per byte, one to read from the source
address and one to write to the destination address.  These two addresses can
be different.

The 6502 memory access cycle only uses (reads or writes) data at the very end
of the cycle.  The final clock edge is used to latch the data into RAM or into
the CPU depending upon the direction of data transfer.  The first half of the
cycle is used for address setup, and the second half is used to allow the data
lines to settle.  Thus the new sharing of the data lines to extend the address
bus to 24 bits does not lengthen the memory access cycle (in fact it is shorter
since the processor is now running at higher clock rates than before).  On a
related note, the Apple ][ only used half of the 1 MHz cycle time for CPU
accesses.  50% of the 1 MHz clock was devoted to video address and video data.
Moral, there is plenty of time - until you get up to 13 MHz, that is.

Brian Willoughby
UUCP:           ...!{tikal, sun, uunet, elwood}!microsoft!brianw
InterNet:       microsoft!brianw@uunet.UU.NET
  or:           microsoft!brianw@Sun.COM
Bitnet          brianw@microsoft.UUCP

brianw@microsoft.UUCP (Brian WILLOUGHBY) (12/22/89)

rnf@shumv1.ncsu.edu (Rick Fincher) writes:
>brianw@microsoft.UUCP (Brian Willoughby) writes:
>>rnf@shumv1.ncsu.edu (Rick Fincher) writes:
>>>You can move the data by turning shadowing on then LDA and STA each 
>>>word back to its original location.  This puts the data in bank E1
>>>and is faster than the memory moves you were talking about, if you
>>>keep your loop overhead low.
>>
>>There are a couple of ways of doing this.  You could reload the Data Bank
>>register before doing the bank $00 prep, and then the stuff would already be
>>in bank $E1.  I think you would do LDA #01 (or $E1), PHA, PLB. After selecting
>>
>> [mention using MVN instruction]
>
>If you write directly into $E1 you do so at 1mhz.  The mvn instruction is
>fast but because of the way the writes to $E1 are slowed to 1mhz I think
>it is still faster to just read a word into a 16 bit register and write
>it back to the same location.  No bank boundaries are crossed sio extra cycles
>are added for that, and shadowing lets the hardware do the actual copies.  I
>think the Apple guys added up all of the cycles and determined that this was
>the fastest way to do this (Matt, Dave?).

Nope, nothing comes for free.  Writes (but not reads) to banks $00 or $01 occur
at the same speed as writes to $E0/$E1 as long as shadowing is on (provided
that you are accessing the addresses set aside for video).  The "Apple guys"
only allowed shadowing so that ][+ and //e programs would still function, even
though these programs are unaware that video memory has been moved to $E0/$E1.
Thus, it was a compatibility issue, not a speed issue.  I don't think that
there is a case (for a GS-specific program) where shadowing allows faster
execution times.  For a non-GS program it just wouldn't work without shadowing.
Fortunately, shadowing doesn't cause writes OUTSIDE of the video areas to be
slowed.

If you still prefer shadowing, then you could save time by causing the MVN
instruction to move back to the same location (source == destination).  A
hand-coded loop will always be slower than MVN, except for cases where a
different kind of move is needed, such as an I/O move where you keep
read/writing the same address from/to a memory buffer.  (i.e.  reading from a
single SCSI port address into a memory buffer.)  Thus, the only limitation of
MVN (or MVP) is that BOTH the source and destination addresses must be
changing.

EXPLANATION:
Only one cycle of any direct write to $E1 is at 1 MHz, the rest of the cycles
for that instruction are at full speed.  This is a limitation because the video
circuitry is using the $E0/E1 RAM banks at 1 MHz for 50% of the time, and the
CPU can only "get in" on regular intervals during the other 50%.  The Mac also
suffers from the same limitation (except for the SE/030 which has dual port
RAM.  OK Apple, when do we see this technology in a ][?).

There is hardware in the GS to "stretch" any cycle which accesses the video
memory, based on the address generated by the CPU.  Fortunately there are two
sets of RAM banks, so it is possible to write to both at the same time with
shadowing on.

Here is the catch: if you have shadowing on, then you are technically writing
into video memory and the CPU still slows down for that cycle.  There is no
magical way of sneaking past this requirement because the whole system must
synchronize to the video memory.  If the hardware didn't wait for the video
write to complete, then there would be a possibility that the CPU would do a
16 bit write at 2.8 MHz to bank $01 with shadowing on, and the second byte
would have nowhere to go because at 2.8 MHz the first byte would not yet be
written to the 1 MHz video memory.

1 MHz clock
  -----------------  actual       -----------------  actual       ------------
  | Video read    | Write byte 1  | Video read    | Write byte 2  | Video read  
---               -----------------               -----------------

2 MHz (I didn't want to try to illustrate 2.8 MHz!)
  ---------       ---------       ---------       ---------       ---------
  | Write | 1     | Write | 2     |       |       |       |       |       |
---       ---------       ---------       ---------       ---------       ----

The first write attempt conflicts with video access to $Ex, and so it is
delayed.  The second write is impossible unless the 2 MHz CPU clock is
stretched to sync up with the 1 MHz video timing.

P.S. Hey Rick, do you remember we met at the NCSU Computing Center back when
you used to work there?  I was attending NCSU at the time and it was my first
exposure to the GS.

Brian Willoughby
UUCP:           ...!{tikal, sun, uunet, elwood}!microsoft!brianw
InterNet:       microsoft!brianw@uunet.UU.NET
  or:           microsoft!brianw@Sun.COM
Bitnet          brianw@microsoft.UUCP

mek4_ltd@uhura.cc.rochester.edu (Mark Kern) (12/23/89)

In article <10041@microsoft.UUCP> brianw@microsoft.UUCP (Brian Willoughby) writes:

>If you still prefer shadowing, then you could save time by causing the MVN
>instruction to move back to the same location (source == destination).  A
>hand-coded loop will always be slower than MVN, except for cases where a
>different kind of move is needed, such as an I/O move where you keep
>read/writing the same address from/to a memory buffer.  (i.e.  reading from a
>single SCSI port address into a memory buffer.)  Thus, the only limitation of
>MVN (or MVP) is that BOTH the source and destination addresses must be
>changing.

	If MVN were the fastest way for moving data, GS games would be
not be moving at half the speed they are now. MVN takes 7 cycles per
byte, not word. An unfolded lda, sta loop is slightly faster, taking roughly
12 cyles per word, depending on the addressing mode used. MVN might be
faster when the slowdown in writing occurs, but this is something I'm
unsure of. 
        The way many GS games shuttles memory from bank $01 to $E1 in a
hurry is by mapping the stack onto the SHR screen, setting DP at the SHR,
then PEI'ing the screen to itself, which then gets shadowed over to $E1.
So far, this one of the fastest ways to do it. It is much faster than the
MVN method.   
        For info on the slow/fast cycle times for instructions when
writing to Mega II controlled RAM can be found in Apple Tech Notes #70
(fast graphics hints) and Apple Note #68 (tips for I/O expansion slot
card design). 

					Mark E. Kern


-- 
=========================================================================
   Mark Edward Kern, mek4_ltd@uhura.cc.rochester.edu  A.Online: Markus
      Quagmire Studios U.S.A. "We not only hear you, we feel you !"
=========================================================================

brianw@microsoft.UUCP (Brian WILLOUGHBY) (12/25/89)

nicholaA@batman.moravian.EDU (Andy Nicholas) writes:
>I thought the cycle times on MVN/MVP were 7 cycles per byte moved.  How
>is that as fast as DMA which is supposed to be (at least what I've always
>been told) 1 cycle per byte moved?

Have you compared the speeds in an actual coding situation?

As soon as I figured out how to assemble 16 bit opcodes using Merlin macros,
the first 16 bit program I wrote to use my new W65C802 was a full HGR screen
move in each of the available methods.  I had an 8 bit move loop, a 16 bit move
loop (which used X and Y as sixteen bit pointers into memory), and a MVN
instruction.  I repeated each move 16 times, so that my slow human perception
could get a handle on how long the process was taking.  Using alternating full
screens of black and white, it was VERY easy to see that MVN was clearly the
fastest.

I coded the fastest 16 bit move I could think of, using LDA 00,X - with X as a
16 bit offset, the actual address was not in the Zero Page, but using the Zero
Page (now Direct Page) addressing mode shaved an extra cycle off of every loop
iteration.

There was no mistaking it, the MVN was just as much an improvement over the 16
bit move loop as the 16 bit move was over the 8 bit move.  This is on a Plus,
but after I got a TransWarp I was faced with the same slow video cycles as the
GS.  Still the MVN method won.

>Generally, MVN/MVP is sort of a slow way to do things... or at least thats
>what most of the GS graphics gurus will tell you.  :-)

Well, for generating graphics screens from multiple smaller images (instead of
moving the entire graphics screen as a single unit), MVN doesn't offer many
advantages.  Than again, neither does the standard DMA move (as if it were
available on an Apple :-).  This is because writing a shape - or a window, or
any object smaller than the width of the graphics screen - to the video memory
is not a simple move with a single start address and length.  What you always
end up with is several shorter moves to each individual scan line.  With moves
that are shorter than 40 bytes (using the HGR screen as an example), the
advantage of MVN or MVP are not so great - and besides, there is so much room
for optimization in video routines that the static MVN instruction is just not
flexible enough.  Add to this the consideration that many plotting routines
might need to rotate bits within a byte in order to plot at different
locations, and the MVN becomes even less useful.

I believe that you have *graphics* gurus telling you that MVN/MVP is slow for
*their* purposes, but these instructions are faster than a loop based move
algorithm for simple block moves of large areas of memory.  Do you think that
the Western Design Center engineers had nothing better to do one day than to
create a totally useless instruction?  They could have left these two opcodes
open for future expansion.  The 7 cycles is instruction setup time - the move
occurs at a rate of 1 cycle per byte.

Side note: the video DMA circuitry in the Amiga has a start address, length
AND a scan line pitch value (address difference between two pixels located at
the same X position on the screen).  For the Amiga, moving square areas on the
video screen (like, say, windows) is super fast.  Plus, their bit-blitter does
the bit rotations that make Apple graphics programmers choose hand-coded loops
over block moves.  This is the kind of hardware I'd like to see in the GS!

Brian Willoughby
UUCP:           ...!{tikal, sun, uunet, elwood}!microsoft!brianw
InterNet:       microsoft!brianw@uunet.UU.NET
  or:           microsoft!brianw@Sun.COM
Bitnet          brianw@microsoft.UUCP

ruzun@pro-sol.cts.com (Roger Uzun) (12/27/89)

In-Reply-To: message from brianw@microsoft.UUCP

I used MVn/MVP in a program I wrote a few years ago for PBI Software called
SoundKeys.  It is pretty good for block moves, but it does take 7 cycles/byte
The Amiga Blitter is very handy, and the //gs should have had such a device
from the start, IMHO.
-Roger Uzun

usenet@orstcs.CS.ORST.EDU (Usenet programs owner) (12/27/89)

Keywords: 
From: throoph@jacobs.CS.ORST.EDU (Henry Throop)
Path: jacobs.CS.ORST.EDU!throoph

In article <10071@microsoft.UUCP> brianw@microsoft.UUCP (Brian WILLOUGHBY) writes:
<nicholaA@batman.moravian.EDU (Andy Nicholas) writes:
<>I thought the cycle times on MVN/MVP were 7 cycles per byte moved.  How
<>is that as fast as DMA which is supposed to be (at least what I've always
<>been told) 1 cycle per byte moved?
<
<Have you compared the speeds in an actual coding situation?
<

[...]
<
<I believe that you have *graphics* gurus telling you that MVN/MVP is slow for
<*their* purposes, but these instructions are faster than a loop based move
<algorithm for simple block moves of large areas of memory.  Do you think that
<the Western Design Center engineers had nothing better to do one day than to
<create a totally useless instruction?  They could have left these two opcodes
<open for future expansion.  The 7 cycles is instruction setup time - the move
<occurs at a rate of 1 cycle per byte.

No, it's 7 cycles per byte moved.  I timed an MVN moving one bank (64K) at
452100 +/- 50 ms, which come out to (at 1.023 Mhz on my gs) 7.06 cycles
per instruction.  Considering that there was probably a bit of overhead
at the start or end, and maybe a few interrupts, it looks like 7 to me.


>Brian Willoughby


---
Henry Throop
Internet: throoph@jacobs.cs.orst.edu

stout@hpscdc.scd.hp.com (Tim Stoutamore) (12/28/89)

/ hpscdc:comp.sys.apple / brianw@microsoft.UUCP (Brian WILLOUGHBY) / 12:27 am  Dec 25, 1989 /
nicholaA@batman.moravian.EDU (Andy Nicholas) writes:
>I thought the cycle times on MVN/MVP were 7 cycles per byte moved.  How
>is that as fast as DMA which is supposed to be (at least what I've always
>been told) 1 cycle per byte moved?

Have you compared the speeds in an actual coding situation?

As soon as I figured out how to assemble 16 bit opcodes using Merlin macros,
the first 16 bit program I wrote to use my new W65C802 was a full HGR screen
move in each of the available methods.  I had an 8 bit move loop, a 16 bit move
loop (which used X and Y as sixteen bit pointers into memory), and a MVN
instruction.  I repeated each move 16 times, so that my slow human perception
could get a handle on how long the process was taking.  Using alternating full
screens of black and white, it was VERY easy to see that MVN was clearly the
fastest.

I coded the fastest 16 bit move I could think of, using LDA 00,X - with X as a
16 bit offset, the actual address was not in the Zero Page, but using the Zero
Page (now Direct Page) addressing mode shaved an extra cycle off of every loop
iteration.

There was no mistaking it, the MVN was just as much an improvement over the 16
bit move loop as the 16 bit move was over the 8 bit move.  This is on a Plus,
but after I got a TransWarp I was faced with the same slow video cycles as the
GS.  Still the MVN method won.

>Generally, MVN/MVP is sort of a slow way to do things... or at least thats
>what most of the GS graphics gurus will tell you.  :-)

Well, for generating graphics screens from multiple smaller images (instead of
moving the entire graphics screen as a single unit), MVN doesn't offer many
advantages.  Than again, neither does the standard DMA move (as if it were
available on an Apple :-).  This is because writing a shape - or a window, or
any object smaller than the width of the graphics screen - to the video memory
is not a simple move with a single start address and length.  What you always
end up with is several shorter moves to each individual scan line.  With moves
that are shorter than 40 bytes (using the HGR screen as an example), the
advantage of MVN or MVP are not so great - and besides, there is so much room
for optimization in video routines that the static MVN instruction is just not
flexible enough.  Add to this the consideration that many plotting routines
might need to rotate bits within a byte in order to plot at different
locations, and the MVN becomes even less useful.

I believe that you have *graphics* gurus telling you that MVN/MVP is slow for
*their* purposes, but these instructions are faster than a loop based move
algorithm for simple block moves of large areas of memory.  Do you think that
the Western Design Center engineers had nothing better to do one day than to
create a totally useless instruction?  They could have left these two opcodes
open for future expansion.  The 7 cycles is instruction setup time - the move
occurs at a rate of 1 cycle per byte.

Side note: the video DMA circuitry in the Amiga has a start address, length
AND a scan line pitch value (address difference between two pixels located at
the same X position on the screen).  For the Amiga, moving square areas on the
video screen (like, say, windows) is super fast.  Plus, their bit-blitter does
the bit rotations that make Apple graphics programmers choose hand-coded loops
over block moves.  This is the kind of hardware I'd like to see in the GS!

Brian Willoughby
UUCP:           ...!{tikal, sun, uunet, elwood}!microsoft!brianw
InterNet:       microsoft!brianw@uunet.UU.NET
  or:           microsoft!brianw@Sun.COM
Bitnet          brianw@microsoft.UUCP
----------

stout@hpscdc.scd.hp.com (Tim Stoutamore) (12/28/89)

Sorry for the inadvertent reposting of Brian's message.  I am still just learning the notes system.  Inter-memory moves, whether DMA or MVN/MVP, are constrained to atleast two memory cycles.  This is because one cycle is needed to put the source address on the bus and one cycle is needed to put the destination address on the bus.  The only time that DMA controllers can perform one word per cycle moves is when the transfer is between memory and I/O.

brianw@microsoft.UUCP (Brian WILLOUGHBY) (12/30/89)

throoph@jacobs.CS.ORST.EDU.UUCP (Henry Throop) writes:
>
>brianw@microsoft.UUCP (Brian WILLOUGHBY) writes:
><nicholaA@batman.moravian.EDU (Andy Nicholas) writes:
><>I thought the cycle times on MVN/MVP were 7 cycles per byte moved.  How
><>is that as fast as DMA which is supposed to be (at least what I've always
><>been told) 1 cycle per byte moved?
><
><Have you compared the speeds in an actual coding situation?
>
>No, it's 7 cycles per byte moved.  I timed an MVN moving one bank (64K) at
>452100 +/- 50 ms, which come out to (at 1.023 Mhz on my gs) 7.06 cycles
>per instruction.  Considering that there was probably a bit of overhead
>at the start or end, and maybe a few interrupts, it looks like 7 to me.

On the Apple, the *average* clock speed is 1.020484 MHz when you consider
that at the end of each video line the final clock cycle is shortened by one
period of the 14.3818 MHz clock.  In order to keep the video data in sync with
the phase of the colorburst signal, the Apple can't use a constant frequency
clock.  Pick up the SAMs manual called "The Apple II Circuit Description" for
more details.

>>Brian Willoughby
>
>Henry Throop

I should have known that I would have to eat my words if I posted before
checking the docs.  Straight from WDC: 7 cycles per byte.  That's the bad news.

The good news is that MVN/MVP is still the fastest *generic* move, where you
have total freedom over length and source and destination addresses.  Any
method of moving data faster than MV* is necessarily limited by either source
address, destination address, or BOTH.  Too bad that WDC hasn't designed a
standard 6502 bus DMA controller chip, yet.

After grabbing the documentation, I looked at how many different ways I could
move the 8192 bytes that make up the hires screen.  What follows is a summary
(hopefully not too boring) of several different approaches to moving data with
the 65C8xx :

move 8192 bytes:
Method                  #cycles  #bytes of code
Stack & Direct Page      65536   19
MVN                      57344   12
Partially Expanded Loop  57047   28
  variation of above     45143   648
Expanded (no loop)       40960   24576

EXPANDED (NO LOOP)

Subash Shankar pointed out that many graphics moves are not looped - they have
a separate instruction for each word moved.  Using 4096 LDA/STA pairs, the
65C8xx can moves 8192 bytes in only 40960 cycles, but this code occupies 24576
bytes of memory!  I think that this is the absolute fastest way to move that
many unknown (i.e. non-constant) bytes, without DMA.

STACK AND DIRECT PAGE MOVES

Looking at the number of cycles needed for each addressing mode would help:

LDA         STA   PHA 3
6   (d,x)    6    PLA 4
5 * (d),y    6    PEA 5      a
5   (d)      5    PER 6      pc+a
4   d,s      4    PEI 6 or 7 (d)
7   (d,s),y  7
3   d        3
4   d,x      4
6   [d]      6
6   [d],y    6
2   #        -
4 * a,y      5
4   a        4
4 * a,x      5
5   al       5
5   al,x     5

All instructions take an extra cycle to move a word instead of a single byte.
The addressing moves with an asterisk * after the LDA timing take an extra
cycle if you are using 16 bit indexing.  Since you can only move 256 bytes
with 8 bit indexing, this extra cycle will have to be considered.  For example:
LDA a,X takes 6 cycles when reading a word from memory with X set to 16 bits.

The quickest mode is d, or direct page, but you can't move more than 256 bytes
to the direct page.  The modes d,x and d,s are the next fastest, and give a
hint that the stack might be useful in faster moves.  Looking at PHA and PLA
show just 3 or 4 cycles respectively.

Someone mentioned a graphics hacker who used PEI.  It looks like the fastest
operation would be PEA, which only takes 5 cycles to place a CONSTANT word on
the stack.  If you were plotting a static shape to the screen, and you first
set S to the highest address, then the quickest way to change sequential bytes
would be PEI.  But for each horizontal line, you would need to update the stack
pointer (unless the shape occupied the full width of the screen).

Curiously, using 16 bit index registers does NOT add an extra cycle when
LoaDing the Accumulator from direct page indexed memory.  This actually allows
accesses outside the 256 byte direct page to be faster than the normal absolute
addressing modes.

The fastest short loop I could design using the above knowledge was as follows:

     lda #Length ;use $1FFE for hires
     tax
     clc
     adc #Dest   ;use $2000
     tas
     lda #Source
     tad
Loop lda $00,x   ;5 cycles
     pha         ;4
     dex         ;2
     dex         ;2
     bpl Loop    ;3  this limits Length to a maximum of $8000

This code uses 65536 cycles to move 8192 bytes.  It is too slow because of the
7 cycles of loop overhead to decrement x and loop back again.

PARTIALLY EXPANDED LOOP

I figured that there had to be a compromise between the fast non-looped move
which used 25K of memory and the short loop which took longer than MVN.  How
about a very long loop?  This would make the time for loop overhead have a
smaller effect in comparison to the time for actually moving the data.  To
avoid hard-coding BOTH the source AND destination addresses, the direct page
indexed mode could be used for exactly one address, and the Direct Page
Register could be changed to point to the right part of memory just before the
move.  Using 16 bit index registers, I figured the longest stretch of code
would be 256 bytes before the direct page was exhausted and the index register
would need to be changed to access more memory.

Source can be anywhere, but in this example Destination is hard-coded.
You could easily use this same algorithm with a fixed Source and variable Dest
by rewriting it.  N refers to the number of LDA/STA pairs that are repeated
before the loop restarts:

     ldx #Length   ;$1FFE - N*2
     lda #Source
     tad
; Dest is hard-coded as an absolute address
Loop lda $00,x     ;5 cycles
     sta $2000,x   ;6
     lda $02,x
     sta $2002,x
  ...              ;repeat LDA/STA pair for a total of N times
     lda $00+N-2,x
     sta $2000+N-2,x

     txa           ;2 cycles
     sec           ;2
     sbc #LoopSize ;3  LoopSize = N*2
     tax           ;2
     bpl Loop      ;3  this limits Length to $8000 or less

The only choice left is to find N, the number of times to repeat the LDA/STA
pairs to gain cycle time efficiency.  The limiting maximum would be N = 128,
because the direct page is only 256 bytes long, and we are moving word data.

There is a simple formula for the resulting code size and cycle time.

Counting the number of bytes per opcode: Size of code = 5N + 8

Number of cycles = (11N + 12)I
where I = Number of iterations of loop, I'll use 8192 bytes again
I = 4096 words/N
cycles = (11N + 12)*4096/N = 44759 + 49152/N

Using the maximum N = 128,  cycles = 45143, size = 648 bytes

The minimum code size, using MVN as a limit for cycles, can be found as
follows:

cycles = 44759 + 49152/N < MVN = 57344
therefore N > 3.9056, N has to be a whole number, so any value 4 or greater
yeilds a loop that it faster than MVN.
Using N = 4,  cycles = 57047, size = 28 bytes

What does this mean?  I've just proven to myself that you can write a rather
limited move loop that is faster than MVN, and only takes slightly more than
twice the code.  But it is not nearly as flexible.  In other words, you
couldn't use this algorithm in a Memory Manager subroutine of the Operating
System.

I hope that a few a these coding algorithms prove useful to someone else.

Brian Willoughby
UUCP:           ...!{tikal, sun, uunet, elwood}!microsoft!brianw
InterNet:       microsoft!brianw@uunet.UU.NET
  or:           microsoft!brianw@Sun.COM
Bitnet          brianw@microsoft.UUCP

dlyons@Apple.COM (David A. Lyons) (01/03/90)

In article <10041@microsoft.UUCP> brianw@microsoft.UUCP (Brian Willoughby) writes:
>[...]  Writes (but not reads) to banks $00 or $01 occur
>at the same speed as writes to $E0/$E1 as long as shadowing is on (provided
>that you are accessing the addresses set aside for video).  The "Apple guys"
>only allowed shadowing so that ][+ and //e programs would still function, even
>though these programs are unaware that video memory has been moved to $E0/$E1.
>Thus, it was a compatibility issue, not a speed issue.  I don't think that
>there is a case (for a GS-specific program) where shadowing allows faster
>execution times.  For a non-GS program it just wouldn't work without shadowing.

The first part of that is the key:  reads from 0 and 1 are fast.  Consider
scrolling the text screen, for example:  the reads and writes are all to
banks 0 and 1, so the scrolling is *faster* than if the reads and writes
were to banks $E0 and $E1--so shadowing *is* partially a speed issue.

>Fortunately, shadowing doesn't cause writes OUTSIDE of the video areas to be
>slowed.

I read that wrong on my first try--to clarify, access to banks $E0 and
$E1 is always slow, but access to non-shadowed areas of banks 0 and 1
is fast, and all reads from 0 and 1 are fast.
-- 

 --David A. Lyons, Apple Computer, Inc.      |   DAL Systems
   Apple II Developer Technical Support      |   P.O. Box 875
   America Online: Dave Lyons                |   Cupertino, CA 95015-0875
   GEnie: D.LYONS2 or DAVE.LYONS         CompuServe: 72177,3233
   Internet/BITNET:  dlyons@apple.com    UUCP:  ...!ames!apple!dlyons
   
   My opinions are my own, not Apple's.

brianw@microsoft.UUCP (Brian WILLOUGHBY) (01/04/90)

In article <37569@apple.Apple.COM> dlyons@Apple.COM (David A. Lyons) writes:
>The first part of that is the key:  reads from 0 and 1 are fast.  Consider
>scrolling the text screen, for example:  the reads and writes are all to
>banks 0 and 1, so the scrolling is *faster* than if the reads and writes
>were to banks $E0 and $E1--so shadowing *is* partially a speed issue.

Quite true.  In fact, the TransWarp on my II Plus takes advantage of similar
shadowing because it only slows to 1 MHz when *writing* to video memory and
then also shadows the data to the 48K RAM to update the screen.  Reads (except
to slot memory) are always at 3.58 MHz.

>brianw@microsoft.UUCP (Brian Willoughby) writes:
>>Fortunately, shadowing doesn't cause writes OUTSIDE of the video areas to be
>>slowed.
>
>I read that wrong on my first try--to clarify, access to banks $E0 and
>$E1 is always slow, but access to non-shadowed areas of banks 0 and 1
>is fast, and all reads from 0 and 1 are fast.

I seem to have trouble with my wording.  What I should have said was that a
write to Bank $00 or $01 at any address that is not in a video area is not
slowed, because these writes do not need to be synched to the real video
memory.  I was trying to make the point that shadowing slows down some, but
not all, writes to the first two Banks.  If you didn't need video (don't ask
me why), you could treat the entire first two banks as contiguous RAM memory.
Then, with shadowing turned *off*, *all* of the accesses would be full speed.
Shadowing, therefore, reduces performance in certain cases (but admittedly,
these are rare cases).

Brian Willoughby
UUCP:           ...!{tikal, sun, uunet, elwood}!microsoft!brianw
InterNet:       microsoft!brianw@uunet.UU.NET
  or:           microsoft!brianw@Sun.COM
Bitnet          brianw@microsoft.UUCP

kadickey@phoenix.Princeton.EDU (Kent Andrew Dickey) (01/08/90)

In article <10100@microsoft.UUCP> brianw@microsoft.UUCP (Brian WILLOUGHBY) writes:
>throoph@jacobs.CS.ORST.EDU.UUCP (Henry Throop) writes:
>>
>>brianw@microsoft.UUCP (Brian WILLOUGHBY) writes:
>><nicholaA@batman.moravian.EDU (Andy Nicholas) writes:
>><>I thought the cycle times on MVN/MVP were 7 cycles per byte moved.  How
>><>is that as fast as DMA which is supposed to be (at least what I've always
>><>been told) 1 cycle per byte moved?
>><
>><Have you compared the speeds in an actual coding situation?
>>
>>No, it's 7 cycles per byte moved.  I timed an MVN moving one bank (64K) at
>>452100 +/- 50 ms, which come out to (at 1.023 Mhz on my gs) 7.06 cycles
>>per instruction.  Considering that there was probably a bit of overhead
>>at the start or end, and maybe a few interrupts, it looks like 7 to me.
>
>On the Apple, the *average* clock speed is 1.020484 MHz when you consider
>that at the end of each video line the final clock cycle is shortened by one
>period of the 14.3818 MHz clock.  In order to keep the video data in sync with
>the phase of the colorburst signal, the Apple can't use a constant frequency
>clock.  Pick up the SAMs manual called "The Apple II Circuit Description" for
>more details.
>
>>>Brian Willoughby
>>
>>Henry Throop

First, Woz himself wrote an article in Byte on how to calculate e to
36,000 decimal places on an Apple II--and he gave the effective clock
speed of the Apple II as .99 MHz or so (I don't have the issue handy,
but I remember it clearly being under 1.0).  1.00 is fairly accurate to
use for most purposes.

But, as to fast memory moves, PEI is the answer.  If you want to move
from bank $05 to bank $07, MVN and MVP are the handiest way to go (not
the best, but the speed improvement is not that much).

But, as my LONG article in my SHR.DEMO.SHK archive, I explain why PEI is
so fast for screen moves, and fast for general memory moves too.

PEI is a new 65816 instruction--PEI $00 pushes the two bytes at $00 and
$01 onto the stack.  Therefore, by moving the direct page around we can
change the source address, and moving the stack pointer changes the
destination address--but we are always stuck in bank $00.

No problem--//e technology to the rescue. For the //e to access it's
auxiliary memory area, there are softswitches to set which basically
swap out the low 48K area of memory for the auxiliary memory.  But, we
can set the switches so that we READ from bank $00 and WRITE to bank
$01.  Then, set shadowing on, and our write to bank $01 will show up on
the SHR screen.

And here's the key point--PEI moves 2 byte in 6 cycles--that's 3 cycles
per byte.  Twice as fast as MVN and MVP.

And, more subtly (and described in more detail in the archive file), PEI
times better with the slow memory write.  The processor has to wait up
to 2.5 cycles for it to synchronize up with the slow video memory.  It
then takes 2.5 cycles (fast cycles) to write to this slow memory.  PEI
just so happens to execute its writes and reads in such a way that the
synchronization time is nearly zero.  That is, the 6 cycles it would
normal take to operate are 2 cycles for the actual memory write, and 4
cycles for other stuff.  Those 2 write cycles would be expanded to 5
cycles to write to slow memory.  But my timings show that a PEI memory
move to slow memory can occur as fast as 10 cycles/2 bytes.  That means
an average of just .5 cycle is wasted to synch up with slow memory.

MVN and MVP on the other hand, each take an average of 1 fast cycle to
synch up--so MVN to slow memory would occur at 6+2.5+1 = 9.5
cycles/byte.  Still almost twice as slow as PEI.

For more information on this, either pick up my file somewhere, or send
me mail.

			Kent Dickey
kadickey@phoenix.Princeton.EDU

sb@pro-generic.cts.com (Stephen Brown) (01/14/90)

In-Reply-To: message from kadickey@phoenix.Princeton.EDU

In this message, it is claimed that calling the Apple IIe clock speed 1.00 Mhz
is good enough for most purposes. Well, not really, and certainly not if
you're doing any timing (say timed loops in which you're changing video
modes).

The frequency would by 1.022727 Mhz (14.31818 master clock divided by 14) if
life were simple. Life is not simple. One cycle out of every 65 cycles is
longer than the rest.  The long cycle frequency is 0.89488625 Mhz, bringing
the composite frequency to 1.02048432 Mhz.

If a loop is written in a multiple of 65 cycles, then it will always take the
same time to execute. If not, then the loop time may vary by 140nS.

PAL (phase alternating line) or European Apple II's run at a slightly
different frequency because there are a greater number of horizontal scans and
a fewer number of frames. I believe PAL motherboards' composite frequency is
1.015625 Mhz.

Excuse my sloppiness for significant figures!

UUCP: crash!pro-generic!sb
ARPA: crash!pro-generic!sb@nosc.mil
INET: sb@pro-generic.cts.com