[comp.sys.amiga.tech] 68030 Questions

stever@videovax.Tek.COM (Steven E. Rice, P.E.) (04/01/88)

In article <3507@cbmvax.UUCP>, Dave Haynie (daveh@cbmvax.UUCP) writes:

> in article <4890@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says:
>> 
>> Dave Haynie's (daveh@cbmvax) most recent article was number
>> <3394@cbmvax.UUCP>.  In it, he cast aspersions on the poor, struggling
>> LANCE and suggested that real systems do 32-bit DMA.  Well, maybe --
>> but if you want to use Ethernet, the LANCE is about the only way to
>> go, slow or no!
> 
> Calm down!  That's not what I said.  I said that in very high 
> bandwidth-consuming operations, such as hard disk interfacing, where the
> transfer between an I/O device and CPU addressable main memory can be sent
> in large atoms, is best served by DMA, even in a 68020 or 68030 system. I
> also said that in systems where transfers must occur in small atoms or at
> relatively slow speed (like perhaps networks or things which must be
> highly interactive), the I/O scheme to shared CPU memory was a good idea.

I think there is still some misunderstanding here.  When I mention dual-
ported memories, I am speaking of memory that is "CPU addressable main
memory"!  It just happens to also be shared (on a cycle-by-cycle basis)
with some other device, which could be an I/O device or another CPU.

The Amiga implements a form of "shared" memory -- chip memory.  The
CPU gets access to chip memory on a shared basis, arbitrated cycle
by cycle.  Another form of "shared" memory is seen on the A2620 (?)
card -- the 68020 CPU.  The 68020 will have 2 or 4 megabytes of 32-bit
wide memory which no one can deny it access to.  Thus, if DMA is
occurring to "main" memory, the 68020 may not be blocked at all.  Carrying
the idea one step further simply removes more limitations from the system,
giving the CPU unrestricted access to the system bus and immediate access
to any memory that is not in use during that memory cycle.

>> In a perfect world, 32-bit DMA with a 512-byte assembly buffer and 
>> fast-as-a-speeding-bullet burst transfers would be possible.  In real
>> life, we have to make do with what we can buy.  (Commodore can build
>> what it needs; the economics in the Television Test and Measurement
>> market are different than those in the personal computer market.)
> 
> That's true, Commodore can build what it needs for those cases.  The 16 bit 
> wide DMA driven hard disk controller on the 16 bit bus delivers around 625K
> bytes/second with the Fast FileSystem.  Fast FileSystem allows DMA from the
> hard disk directly to the target memory, not intermediate buffers used.  I
> believe that any peripheral going this fast wants DMA.  It's fully extensible
> to a 32 bit machine, though a _conservative_ 32 bit machine rates that's 
> 2.5 megabytes/second thoughput (not even getting to things like burst 
> transfers, which are ideally suited to DMA transfers).  If you're LAN is only
> going 2.5 megabits/sec, that's certainly overkill and extra cost.

Ethernet is 10 megabits/sec.

> Which seems to make sense even today; most Amiga hard drives are DMA driven,
> most Amiga LANs are CPU driven via shared RAM DMA.

In the case of Ethernet I/O, transmissions are packetized with quite
a bit of protocol overhead.  Thus, the data to be transmitted must be
broken into chunks no larger than the largest legitimate packet and
shipped out one packet at a time.  To do this, the CPU is going to have
to move the data anyway -- it has to configure it in a form the I/O
device can use.  In this case, the copy from what you might consider
"main" memory to "shared" memory is free.

Starting with the FFS rate of 625K bytes/second and doubling that for a 
32-bit bus gives 1.25 megabytes/second.  This translates to a 10
megabit/second transfer rate, which is the same as the Ethernet.  Using
your figure of 2.5 megabytes per second gives 20 megabits/second
throughput.  But our CPU bus bandwidth is about 100 megabits/second
(approximately 330 nsec main memory cycle time [not *access* time --
*cycle* time]).  Thus, a 2.5 megabyte/second disk transfer would occupy
only 20% of the bus bandwidth.

If the disk DMA is transferring into unshared main memory, the CPU will
just have to wait.  At 2.5 megabytes/second (assuming 32-bit transfers),
the disk will request one memory access every 1.6 microseconds.

One possibility is to arbitrate for the bus for each transfer.  Looking
at the timing diagrams in the Motorola 68020 manual, one finds that
there is a minimum of 1/2 clock period and a maximum of 1 clock period
from the end of clock state S5 until Bus Grant* is asserted.  There is
also a note in paragraph 5.2.7.4 which says that "all asynchronous
inputs to the MC68020 are internally synchronized in a maximum of two
cycles of the system clock."  This implies that the minimum to resume
processing is 1 clock cycle.  There is probably one additional cycle
needed for the CPU to resume driving the address and data lines.

Assuming a memory cycle time of 330 ns (which is what ours is) with
240 ns read or write access time, each 32-bit word transferred would
hold the CPU bus for one arbitration time (1/2 to 1 clock cycles, or
30 to 60 ns in a 16.7 MHz system) plus one transfer time (240 ns) plus
one bus relinquishment time (1 to 2 clock cycles, or 60 to 120 ns)
plus one driver turnon time (1 clock cycle, or 60 ns).  The minimum
time required would be 390 ns, the maximum time would be 480 ns, and
the mean time would be 435 ns.

435 ns out of 1.6 us is 27.2% of the bus bandwidth occupied.  But not
only is 27.2% of the bus bandwidth occupied, the CPU is denied the
bus 27.2% of the time!  This translates directly into throughput
reduction.

Another possibility is to block the data into (e.g.) 512 byte blocks and
then arbitrate for the bus once per block.  This drops the bus bandwidth
occupation to 20% (since one arbitration is insignificant compared to the
time to transfer 512 bytes as 128 32-bit words).  But the CPU is still
denied the bus 20% of the time.

If, however, the disk data is DMAed into dual-ported memory, it can deny
an access to the CPU a *maximum* of 20% of the time, and then only if
the CPU is fetching all of its instructions from the shared memory!  In
actual operation, it is likely to be much less than that.  There is also
no reason the receiving process cannot use the data directly from the
dual-ported memory, although in many cases there will be at least one
copy between initial transfer and use of the data.

>> There is another thought, too -- if you have only one DMA device, you
>> could argue that it shouldn't make much difference if it DMAs into
>> system RAM or into a dual-ported buffer.  If you have more than one
>> device contending for the system bus, however, multiple dual-ported
>> buffers are a clear win.
> 
> Not unless you have multiple CPUs to read them.

Given just a single hard disk transfer as you have described it, DMA into
a dual-port buffer avoids losing 20% of the CPU's processing capability.
That seems worthwhile to me!

					Steve Rice

-----------------------------------------------------------------------------
* Every knee shall bow, and every tongue confess that Jesus Christ is Lord! *
new: stever@videovax.tv.Tek.com
old: {decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever

daveh@cbmvax.UUCP (Dave Haynie) (04/12/88)

in article <4937@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says:

> Another possibility is to block the data into (e.g.) 512 byte blocks and
> then arbitrate for the bus once per block.  This drops the bus bandwidth
> occupation to 20% (since one arbitration is insignificant compared to the
> time to transfer 512 bytes as 128 32-bit words).  But the CPU is still
> denied the bus 20% of the time.

First of all, with a better bus design (eg, not the current Amiga bus, but
perhaps a future version that's 32 bits wide), there's zero or very near
zero arbitration time; the bus's owner is determined dynamically on a 
cycle by cycle basis.

Secondly, since the 68020 with cache running only wants the bus 50% or so
of the time, on average, you take your 20% figure and immediately reduce it 
to 10%, on average.  It could be as bad as 20%, it could be as good as
0%, depending on what the CPU is doing.

Now we add a priotity scheme.  If the CPU operation is more important, it
gets the bus for any cycles it needs, and the DMA device gets whatever it
wants from the remaining 50% of the bus.  And that's assuming that the bus
is limited to CPU bus speeds.  It's pretty simple to make DMA devices run
nybble or page mode cycles that the CPU can't keep up with, but most
memory systems can be designed with this in mind for nearly free.  So with
DMA going with a nybble transfer, you're now down to less than 5% of the
bus bandwidth for that transfer.  VME and non-Apple NuBus both do things
like this.

> Given just a single hard disk transfer as you have described it, DMA into
> a dual-port buffer avoids losing 20% of the CPU's processing capability.
> That seems worthwhile to me!

But you're still missing the point.  The CPU has to stop what it's doing to
transfer the data by hand.  If it did that JUST as efficiently as the DMA
device, you'd still be loosing whatever CPU time you claim is being eaten
by the DMA transfer, 20% or whatever (keep in mind this 20% figure only
applies during an actual transfer).  If the DMA transfer happens twice as
fast as the CPU could transfer the data, then I'm gaining in CPU speed,
even though I'm kicking the CPU off the bus for awhile.  DMA transfers on
the Amiga bus with a 68020 go twice as fast as the 68020 could possibly
transfer them.  68000 based CPU transfers are more like 1/4th the speed of
the DMA device.  My point is that someone has to do the work of transfer
unless you can live with the data exactly where it's dumped in your
shared memory scheme.  If you know there's no transfer required, share the
memory, but if there is, and especially if the memory can be used as is,
once it reaches it's destination (like NewFS), DMA wins.  

There's actually a test case of this available in the Amiga world.  As I've
already mentioned, the A2090 controller uses a FIFO and DMA to complete it's
transfer, and achieves about 625K Bytes/Second.  There's a new SCSI 
controller out there, from a company called Great Valley Peripherals, that
uses an I/O chip DMA to shared RAM (4K of static RAM on-board, so once
you're in sync I suspect there will rarely be a collision between the
CPU and the peripheral chip).  I don't have any benchmarks on this new board,
but I guarantee it'll be slower.

> 					Steve Rice
-- 
Dave Haynie  "The B2000 Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
		"I can't relax, 'cause I'm a Boinger!"