[comp.sys.amiga.tech] Hard disks, DMA vs Non-DMA

GORRIEDE@UREGINA1.BITNET (Dennis Robert Gorrie) (11/15/89)

I know this has been discussed alot already, but its still not completely
clear to me.  I would appreciate any information regarding this subject.

The story goes, DMA is faster.  But, as many people point out, it sometimes
is slower than non-DMA, when there is contention for the bus.  Case in point
is the hi-res interlace screen situation where co-processors and your DMA
hard disk device are contending for cycles on the coprocessor bus.

Then someone says a 'proper' DMA device is faster, than non-DMA, even for the
situation above.  How is this so?  What is a 'proper' DMA device?

The current solution, for DMA controlers with slow loads during the above
situation, from my understanding, is to limit the DMA transfer to small
block sized transfers.  It seems that this is basicaly just like a non-DMA
transfer.

What about other solutions?  Like dynamicaly allocating DMA transfer sizes
based on device priority, and contention, ect.  Isn't that what Bus Mastering
is all about?

+-----------------------------------------------------------------------+
|Dennis Gorrie                 'Chain-Saw Tag...                        |
|GORRIEDE AT UREGINA1.BITNET                    Try It, You'll Like It!'|
+-----------------------------------------------------------------------+

ckp@grebyn.com (Checkpoint Technologies) (11/16/89)

In article <8911150430.AA24506@jade.berkeley.edu> GORRIEDE@UREGINA1.BITNET (Dennis Robert Gorrie) writes:
>
>I know this has been discussed alot already, but its still not completely
>clear to me.  I would appreciate any information regarding this subject.
>
>   [The Universal Question (condensed): Which is faster, DMA or non-DMA? Why?]

	All this talk of DMA vs non-DMA *has* to be because of the
Commodore A2090(A) disk controller, which is a DMA device that performs
poorly under high-chip-RAM-contention situations. So it brought DMA into
a bad light.

	First of all, "contention" is when two or more devices want to
use the same bus at the same time. Only one device at a time may use a
given bus, so when several want it, somebody has to wait. Every Amiga
has about 26 devices which want the chip RAM bus; 25 of them are in the
custom chips, amd one is the 68000.

	There are many ways in which contention can be resolved, and all
of them involve choosing one device which "wins" the bus. In the Amiga,
there are assigned priorities to chip RAM, and the device with the
highest priority wins. First comes the carefully-programmed cycles which
are never contended (bit plane, DRAM refresh, audio, sprites, floppy
disk), then comes the Copper, then the Blitter. The 68000 is last, it
only gets the chip RAM bus when the custom chips don't want it.

	Now, when you plug in a card which performs DMA, it takes it's
place in the priority scheme. In fact, bus cards will win over the
CPU, but will lose to the custom chips. This is only for the chip RAM
bus, however; when contending for fast RAM, a DMA device will always win
over the CPU which is the only other contender (unless there are other
DMA devices; then Busteer picks a winner based on slot number I think).

	There are times when the custom chips want the chip RAM bus a
LOT; high res overscanned 4 bit plane with the Blitter and Copper
running, for example, in which case you may not win the bus for a WHOLE
FRAME, which is about a hundredth of a second. That's a LONG TIME. If
your DMA device is a disk, then the data comes off the disk at the rate
that the disk is spinning. The disk will *not* stop and wait for the bus
to become available. So the DMA device has to be able to do something
else with the data.

	The A2090(A) has a 64 byte FIFO, which is a place to hold 64
bytes of a transfer while waiting for the bus. This is good, but not
good enough, since a whole sector is 512 bytes; at the speed of an ST506
disk, 512 bytes will arrive in less than a thousandth of a second. So
the A2090(A) will *lose* that sector. The device driver is smart enough
to try again, after the disk revolves around again, but this takes
another 60th of a second. This is the real reason that the A2090(A) will
run soooo slooooow when in a bad situation like this.

	Now, a properly designed DMA device will be able to handle the
situation where it can't get the bus for a LONG time, and to do this
it'll have a FIFO big enough to handle at least one sector (512 bytes).
More FIFO is better. When it does win the bus, it can then transfer
everything into RAM quickly (DMA's forte is fast transfers). The
Microbotics HardFrame disk controller is designed this way; therefore it
is the fastest available Amiga hard disk controller, and does not have
problems with chip RAM contention.

	Another note: SCSI disk drives have FIFOs built into them, big
enough for a sector. This is good enough to handle the problems of DMA
contention. Commodore's new A2091 DMA controller actually has a
*smaller* FIFO than the A2090(A), and they depend on the SCSI
disk drive's FIFO to keep them out of trouble. There are "handshaking"
wires on a SCSI connector that lets the computer tell the SCSI disk
drive how fast it can accept data from the disk drive. Why then does the
A2090(A) have contention problems with SCSI disk drives, just like it
has with ST506 drives? Because they botched the A2090's SCSI adapter. It
doesn't properly use the available handshaking to slow the SCSI disk
drive down when it's FIFO fills. They have fixed this on the A2091, and
so they don't need such a large FIFO. (Incidentally, the A2091 doesn't
have an ST506 controller.)

	On non-DMA controllers: These devices (like the GVP controller)
move disk drive data to a built-in RAM chip first. Since this movement
does not need the Amiga's bus at all, and so are never affected by the
chip RAM bus. When the built-in RAM is filled, the 68000 CPU copies
that RAM to the place in system RAM where the data's really needed.

	Now, with this design, the contention problem is solved.
However, it's using your CPU to move the data, and while it's doing that,
it's *not* running your applications. So you may get acceptable IO
performance, but your Amiga's CPU is being penalized. I dislike this,
personally; I think it goes against the Amiga's high-performance IO
philosophies.
 
	Well, I hope this helps everyone in their understanding of the
Amiga and DMA. To sum up: the A2090(A) has given DMA a bad name, which
it does not deserve. The A2090(A) deserves the bad name; DMA is still
the fastest possible way to perform IO, even on the Amiga.

swarren@eugene.uucp (Steve Warren) (11/16/89)

In article <8911150430.AA24506@jade.berkeley.edu> GORRIEDE@UREGINA1.BITNET (Dennis Robert Gorrie) writes:
>
>I know this has been discussed alot already, but its still not completely
>clear to me.  I would appreciate any information regarding this subject.
>
>The story goes, DMA is faster.  But, as many people point out, it sometimes
>is slower than non-DMA, when there is contention for the bus.  Case in point
>is the hi-res interlace screen situation where co-processors and your DMA
>hard disk device are contending for cycles on the coprocessor bus.

No matter where the request comes from, the same bus cycle will have to
occur to read or write to the chip ram.  The non-DMA transfer to chip
ram will experience the same contention that the DMA transfer would.
If the DMA device has trouble where the CPU doesn't then it just isn't
glued into the bus properly.

>Then someone says a 'proper' DMA device is faster, than non-DMA, even for the
>situation above.  How is this so?  What is a 'proper' DMA device?

The only thing I can think of is that maybe the DMA device is held off so
long that it overflows its buffer and has to wait for another disk rotation
to get the rest of the data.  In this case a bigger buffer would have fixed
it.  Otherwise the two devices (DMA & CPU) should both see the same contention
and suffer the same constraints.

>The current solution, for DMA controlers with slow loads during the above
>situation, from my understanding, is to limit the DMA transfer to small
>block sized transfers.  It seems that this is basicaly just like a non-DMA
>transfer.

That sounds like a work-around for a too-small buffer on the disk interface
card.

The only time I think you might find non-DMA faster is on 32-bit ram on
a coprocessor (where DMA is 16-bit), and you still take a performance
hit, because you can't use your cpu cycles for other tasks while you are
doing the transfer.  Of course, if the DMA device uses 100% of the bus
bandwidth then you couldn't use those cycles anyway, unless the CPU is
playing in chip-land.

>What about other solutions?  Like dynamicaly allocating DMA transfer sizes
>based on device priority, and contention, ect.  Isn't that what Bus Mastering
>is all about?

How about emergency-dumping into unused fast-ram whenever contention
threatens to overrun the device buffer, then transferring from fast-to-chip
when possible.  This might save a disk rotation, if that is the problem.

--Steve
-------------------------------------------------------------------------
	  {uunet,sun}!convex!swarren; swarren@convex.COM

cmcmanis%pepper@Sun.COM (Chuck McManis) (11/16/89)

Dennis Robert Gorrie writes:
> The story goes, DMA is faster.  But, as many people point out, it sometimes
> is slower than non-DMA, when there is contention for the bus.  Case in point
> is the hi-res interlace screen situation where co-processors and your DMA
> hard disk device are contending for cycles on the coprocessor bus.

DMA == Direct Memory Access. If a peripheral does DMA it will always be faster
than a non-DMA device for one simple reason. The path between the peripheral
and it's destination is shorter. Consider the following diagram :

              +---------------+           +-----+            +--------+
	      |               |           |     |            |        |
Device >----->+ I/O Interface +--+    +-->+ CPU +--+    +--->+ Memory |
	      |               |  |    |   |     |  |    |    |        |
              +---------------+  |    |   +-----+  |    |    +--------+
	                         V    ^            V    ^
			/--------+----+------------+----+-----------/
		       /     (1) +----+            +----+          /
		      /      (2) +----------------------+         /
		     /             Computer Backplane            /
		    /-------------------------------------------/
(1) is the path of Non-DMA data, it is read by the CPU from the Peripheral
    interface, then it is written by the CPU into the destination memory 
    address. This involves one Read and one Write cycle on the main bus.

(2) is the path of the DMA data, it is written directly to memory by the 
    peripheral interface. This involves a single write cycle on the main
    bus. 

Many things compete for the bus, in the Amiga these can be the CPU and 
other peripherals. Further the bus is sometimes blocked by the CPU waiting
on information to come out of "Chip" memory. This is particularly true 
during high overscan situations. For any given set of bus cycles the
DMA device will transfer more data over the bus then the CPU would be
able to move it over the bus.

Some people have been lead to believe that DMA devices are sometimes slow
because the 2090 interface is slow in transferring data sometimes and other
non-DMA drives seem to be faster. This is not due to non-DMA being faster,
rather it is due to the 2090 being incapable of dealing with bus inactive
conditions. It's internal FIFO overflows and it must abort and restart the
entire transfer. This is a bug in the 2090 and you will notice that the 
2091 (and A590) don't have this problem. 

>Then someone says a 'proper' DMA device is faster, than non-DMA, even for the
>situation above.  How is this so?  What is a 'proper' DMA device?

For any number of cycles, DMA will be faster than non DMA because it can
transfer data at full bus speeds. A 'proper' DMA device is one that can
operate even when the bus has become unavailable for relatively long 
periods of time. This implies some flow control on the peripheral interface
itself, or sufficient buffering to allow for the maximum bus latency delay.

>The current solution, for DMA controlers with slow loads during the above
>situation, from my understanding, is to limit the DMA transfer to small
>block sized transfers.  It seems that this is basicaly just like a non-DMA
>transfer.

That is the 2090 solution because by limiting the transfers to sizes that
are less likely to overflow it's inadequate FIFO so that it maintains its 
DMA performance advantage. Other controllers such as the HardFrame, and 2091
do not have this problem.

>What about other solutions?  Like dynamicaly allocating DMA transfer sizes
>based on device priority, and contention, ect.  Isn't that what Bus Mastering
>is all about?

Again, if you design your board to be able to deal with long bus latency
times as many people have, then you don't have any problem. You are 
apparently confusing a weakness with the 2090 design, with a problems in
the concept of DMA. They are not related at all. 

--Chuck McManis
uucp: {anywhere}!sun!cmcmanis   BIX: cmcmanis  ARPAnet: cmcmanis@Eng.Sun.COM
These opinions are my own and no one elses, but you knew that didn't you.
"If it didn't have bones in it, it wouldn't be crunchy now would it?!"

daveh@cbmvax.UUCP (Dave Haynie) (11/16/89)

in article <8911150430.AA24506@jade.berkeley.edu>, GORRIEDE@UREGINA1.BITNET (Dennis Robert Gorrie) says:

> The story goes, DMA is faster.  

You first have to look at the problem you're trying to solve.  The
problem, in this case, is data transfer from a hard disk controller
to the Amiga's main memory.  Except for this transfer mechanism,
there's nothing intrinsically different between DMA and non-DMA 
devices.

For the DMA transfer, a device of some kind requests the Amiga's bus
and transfers a number of words of data to or from the Amiga's main
memory.  Once this transfer is complete, it will probably have to
involve the main CPU, at least to tell the main CPU that it's done.
But the transfer is very efficient, because the CPU isn't involved
during the transfer (eg, no interrupts, no need to push and pop
stacks, etc.), and there's only one bus crossing per word transfer;
data flows only between main memory and the DMA device.

For a non-DMA transfer, the CPU is involved to some degree or another.
At the worst, it works like the Mac's hard disk interface, where the CPU
is required to talk directly to a SCSI chip, and must basically sit and
wait for each byte to be available.  Much better is the GVP approach,
where the SCSI device itself transfers a whole block (or possibly several
blocks) into local memory.  At this point, the CPU is called upon to
transfer that data to or from this local memory.  This transfer requires
two bus crossings for each word; data flows between the main memory
and the CPU, then between the CPU and the local memory (or visa versa).

> But, as many people point out, it sometimes is slower than non-DMA, 
> when there is contention for the bus.  Case in point is the hi-res 
> interlace screen situation where co-processors and your DMA hard disk 
> device are contending for cycles on the coprocessor bus.

DMA to chip memory, unless you really need it, is a bad idea with any
kind of controller, since you can be kept out of chip memory for an
extended period of time.  In order to even start a transfer from the
hard disk controller, you can't have the CPU waiting on chip memory,
for either the DMA or non-DMA controller.  Assuming DMA, the controller
will request the bus from the CPU.  The CPU can grant the bus right
away, but the DMA device can't actually take over the bus until the
CPU finishes its current cycle.  When waiting for chip bus access in
a high-activity display mode, this can be a long wait.  For the non-DMA
device, the CPU will get an interrupt signaling it's needed for a
transfer.  However, it can't service that interrupt until the current
instruction is complete, which of course can't complete until the CPU
has chip bus access.  So in either case, when the CPU's involved in a
delayed access to the chip bus, you have to wait.  As long as the actual
transfer goes to fast memory, you'll only have this initial lag (or
possibly a few of them if the transfer is done in several bits, as it
often is with FIFO based controllers), and you won't see too much DMA
slowdown.  If the transfer is into chip memory, you'll of course see
a rather noticable slowdown.

> Then someone says a 'proper' DMA device is faster, than non-DMA, even for the
> situation above.  How is this so?  What is a 'proper' DMA device?

The main problem you can get into this situation is essentially a flow control
problem.  You have data coming from a hard disk which needs to get stuck into
memory somewhere.  You can have an undetermined length of time to wait for
access to that memory.  If the controller is capable of stopping the flow of
data into the device based on it's success at getting data out of the device,
everything's cool.  Some, like the GVP controller, do this by only dealing in
whole disk blocks.  Others, like the A2091, do this by starting and stopping
the data transfer from the SCSI device itself.  The problem that's been seen
is when a device, be it DMA or non-DMA, isn't capable of starting and stopping
this data flow.  The A2090 is an example of such a device, at least as 
supported by its current software.  When it can't get access to the bus within 
a certain amount of time, its FIFO overruns, and it has to attempt the
transfer all over again.  If it could tell the SCSI device to stop sending
as its FIFO fills up, there'd be no problem (in fact, the A2091 has a smaller
FIFO but works much better, because it can start and stop the data flow).
I'm told that part of this problem is the A2090 support for ST-506.  SCSI is
a rather high-level protocol, with intelligent drives, and it can support
things like start and stop.  ST-506 is a low-level, dumb protocol that must
transfer whole blocks in a fixed amount of time.  You start a transfer
from the disk into the FIFO, then start DMA out of the FIFO.  If the DMA
gets waited too long, the FIFO overruns, and you have to start over
again.

You have to take a look at the particular controller in question.  Any
modern review of a DMA controller should include its performance with
a full bandwidth screen up (eg, 640 across, 4 bitplanes, overscan if you
like).  Modern DMA controllers like the A2091 and the Microbotics
hardframe have no trouble with this situation.

> +-----------------------------------------------------------------------+
> |Dennis Gorrie                 'Chain-Saw Tag...                        |
> |GORRIEDE AT UREGINA1.BITNET                    Try It, You'll Like It!'|
> +-----------------------------------------------------------------------+
-- 
Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: hazy     BIX: hazy
                    Too much of everything is just enough

33014-18@sjsumcs.sjsu.edu (Eduardo Horvath) (11/17/89)

	Can you DMA directly into FAST RAM, or is it necessary to go through
	CHIP RAM?  If a controller DMA'd into FAST RAM, wouldn't that solve
	the problem of contention with the custom chips?


===============================================================================
         //x                                    =	
        /// \	Try:  33014-18@sjsumcs.SJSU.EDU =	Early to bet
       ///   \                                  =	And early to raise
      ///     \		Eduardo Horvath		=	Makes a man poor
\\\  ///=======\ 				=	In a gamling craze!
 \\\///         \				=		-me
  \xxx           \miga. The computer for the corruptive mind.
===============================================================================

pds@quintus.UUCP (Peter Schachte) (11/17/89)

In article <14035@grebyn.com> ckp@grebyn.UUCP (Checkpoint Technologies) writes:
[Severely edited]
->There are "handshaking" wires on a SCSI connector that
->lets the computer tell the SCSI disk drive how fast it can accept data....
->the A2090's SCSI adapter doesn't properly use the available handshaking to
->slow the SCSI disk drive down when it's FIFO fills....  fixed on the A2091...

Does the A590 have this problem, or is it more like the A2091 than the
A2090A?
-- 
-Peter Schachte
pds@quintus.uucp
...!sun!quintus!pds

steveb@cbmvax.UUCP (Steve Beats) (11/17/89)

In article <1284@quintus.UUCP> pds@quintus.UUCP (Peter Schachte) writes:
>In article <14035@grebyn.com> ckp@grebyn.UUCP (Checkpoint Technologies) writes:
>[Severely edited]
>->the A2090's SCSI adapter doesn't properly use the available handshaking to
>->slow the SCSI disk drive down when it's FIFO fills....  fixed on the A2091...
>
>Does the A590 have this problem, or is it more like the A2091 than the
>A2090A?
>-- 
No, the A590 uses the same DMA chip as the A2091, the problem is fixed.  I
have tested the driver with multiple fast SCSI drives and a 4 plane overscanned
hi-res screen.  There is some slowdown, but not much.  Data is certainly
never lost.

	Steve

daveh@cbmvax.UUCP (Dave Haynie) (11/18/89)

in article <1989Nov16.185706.29328@sjsumcs.sjsu.edu>, 33014-18@sjsumcs.sjsu.edu (Eduardo Horvath) says:

> 	Can you DMA directly into FAST RAM, 

Yes.  In fact, it's greatly preferred and recommended.

> 	If a controller DMA'd into FAST RAM, wouldn't that solve the problem 
>	of contention with the custom chips?

It can solve most of the problem.  There are two components to the DMA transfer.
DMA to Fast memory will solve the second, which is the basic transfer rate for 
whatever block size the controller transfers in one chunk.  The first problem is
what I call "DMA lag", or how long it takes from the time your controller asks
for the bus to when it actually gets the bus.  In order to acquire the bus, the
CPU bus be finished with a bus cycle.  If the CPU is in wait states, waiting for
access to the chip bus, the DMA controller will have to wait for the CPU to
finish it's cycle, (eg, wait for the chip bus to be free), before it can take
over the bus.  DMA controllers often transfer a whole block (512 bytes) in
several DMA passes, so it's actually possible to incur this lag several times for
each block, if your CPU is doing lots of stuff with video memory.  Also, if you
have an autoboot controller of any kind that copies it's code to RAM before 
using it, you get slowdowns if your autoboot card is the first one in the machine,
since that code will get copied into chip memory.  So, unless you know your code
is running from ROM, or you have something like an A2620/A2630 that puts autoconfig
RAM in before your device is configured, it's best to put a memory card in before
your device.  Hopefully all-in-one memory/disk cards autoconfig the memory before
the disk.


>       ///     \		Eduardo Horvath		=	Makes a man poor


-- 
Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: hazy     BIX: hazy
                    Too much of everything is just enough

swarren@eugene.uucp (Steve Warren) (11/18/89)

In article <1989Nov16.185706.29328@sjsumcs.sjsu.edu> 33014-18@sjsumcs.SJSU.EDU (Eduardo Horvath) writes:
>
>	Can you DMA directly into FAST RAM, or is it necessary to go through
>	CHIP RAM?  If a controller DMA'd into FAST RAM, wouldn't that solve
>	the problem of contention with the custom chips?

Yeah, but when the data is needed in the chip space (sound data or
graphics data) or when the machine is all chip ram, then you still
need to move the bytes into contention-land.

When the destination is fast ram there isn't a problem.

On the same subject, has anyone seen the misleading product sheet
put out for the KRONOS SCSI controller?  It claims that DMA is
fundamentally flawed when trying to access chip ram.  This is
absolutely false.  Certain controllers were not designed to allow
for contention gracefully, but the fact that they were DMA was
irrelevent.  The only reason the KRONOS is fast is because the
controller-to-memory path is 16-bits wide.  If they would just
say it like it is I would be more impressed with their product.
I am suspicious when they erect straw men.

--Steve
-------------------------------------------------------------------------
	  {uunet,sun}!convex!swarren; swarren@convex.COM