[comp.sys.amiga] 68030 Questions

alex@.UUCP (Alex Laney) (01/19/88)

[Don't have any reference for the '030 yet]

Since CSA has an '030 card coming out, (which sounds great), does anyone
know if the '030 boots up in a manner compatible with the '020. What I'm
wondering about, is the MMU on the Commodore 68020 card going to constrain the
internals of Exec/Dos to not use the MMU on the '030. I doubt if it is
compatible, but hopefully it does a subset, at least. It sounds like another
round of, well you can upgrade to the new CPU, but you can't use any of
the features of it. You know, turning off caches, etc., seems to be
mandatory with the '020. And the '020 really only speeds up with 32-bit
memory, as well. I suppose the rumored(:-) Unix from Commodore probably won't
run on an '030, because of the MMU issue. I assume that if CSA is releasing an
'030 card, the current ExecDos runs on it.

I hope that some of the lag time until the next OS release is used to look
at these issues. I think everyone wants over time to upgrade to at least the
'020, then the '030, [of course, the chips have to go down to $5 each :-)] so
let's try to make the path less bumpy!

I hope this is a change of pace from piracy/virus/mtask flaming!

-- 
Alex Laney   alex@xicom.UUCP   ...utzoo!dciem!nrcaer!xios!xicom!alex
Xicom Technologies, 205-1545 Carling Av., Ottawa, Ontario, Canada
We may have written the SNA software you use.
The opinions are my own.

daveh@cbmvax.UUCP (Dave Haynie) (01/26/88)

in article <494@.UUCP>, alex@.UUCP (Alex Laney) says:
> Summary: What about the new '030 Card?

> [Don't have any reference for the '030 yet]
> 
> ...is the MMU on the Commodore 68020 card going to constrain the
> internals of Exec/Dos to not use the MMU on the '030. I doubt if it is
> compatible, but hopefully it does a subset, at least. 

The 68030's MMU is in fact a subset of the 68851 on the A2620 card.

> It sounds like another round of, well you can upgrade to the new CPU, but
> you can't use any of the features of it. You know, turning off caches, etc.,
> seems to be mandatory with the '020. 

Not at all.  The Amiga OS does in fact turn on the 68020 cache when it
starts up, and most things run without any trouble with the cache on.  A
few games are apparently running self-modifying code or something that
causes a problem with the cache, but most Amiga software runs fine.

> And the '020 really only speeds up with 32-bit memory, as well. 

Well, the A2620 runs our production line test a bit faster than the 68000
with it's on-board memory disabled.  It has going for it the cache, and
the fact that anything that's happening on-chip is still going to happen
at 14.3 MHz.  And 68881 performance goes up with the 68020's real coprocessor
interface.  But for plain old integer operations, the fast 32 but memory
helps alot.

> I hope that some of the lag time until the next OS release is used to look
> at these issues. I think everyone wants over time to upgrade to at least the
> '020, then the '030, [of course, the chips have to go down to $5 each :-)] so
> let's try to make the path less bumpy!

The '030 running with data cache enabled introduces the most problems.  Still,
many of these can be solved in hardware if the designer takes the time;
you really do want to use that data cache!

> Alex Laney   alex@xicom.UUCP   ...utzoo!dciem!nrcaer!xios!xicom!alex
-- 
Dave Haynie  "The B2000 Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
		"I can't relax, 'cause I'm a Boinger!"

harald@ccicpg.UUCP ( Harald Milne) (01/28/88)

In article <3200@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes:
> The '030 running with data cache enabled introduces the most problems.  Still,
> many of these can be solved in hardware if the designer takes the time;
> you really do want to use that data cache!
	
	Amen!

	The only problem I can imagine at this point, is DMA to FAST ram. 
Would anybody be silly enough to do this? This will kill the 680x0.

	Well kind of, after all, this is still the Amiga! You get kinda jaded,
knowing what all the burdens, the coprocessor's remove.

	I'm really curious how AudioMaster can play 11 minutes of digitized
sound. (Obviously with 9.5meg of ram) How about an entire Compact Disk off an
Ethernet connection! I figure about 50 meg! Beats .6 giga.

> Dave Haynie  "The B2000 Guy"     Commodore-Amiga  "The Crew That Never Rests"
>    {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
> 		"I can't relax, 'cause I'm a Boinger!"

	I got a sneaky suspicion, that to get a few thousand miles away, you
were at AmiExpo! Darn.

-- 
Work: Computer Consoles Inc. (CCI), Advanced Development Group (ADG)
      Irvine, CA (RISCy business! Home of the CCI POWER 6/32)
UUCP: uunet!ccicpg!harald

harald@ccicpg.UUCP ( Harald Milne) (01/28/88)

In article <2044@antique.UUCP>, cjp@antique.UUCP (Charles Poirier) writes:
> In article <494@.UUCP> alex@.UUCP writes:
> >Since CSA has an '030 card coming out, (which sounds great), does anyone
> >know if the '030 boots up in a manner compatible with the '020. What I'm
> 
> A friend who works for a competitor of CSA (so apply grains of salt to
> taste) says to be wary of the CSA 68030 board.  Supposedly they did not
> do a "real" design at all, but rather used verbatim the circuit from
> Motorola's application notes that basically says "Here's how to get the
> 68030 to work exactly like a 68020."  I.e., they cripple it.  End quote.

	This wouldn't surprise me. Being 68020 compatible is a safe move,
no OS issues. That would be crippled.

/* Personal opinion follows */

	I liked the fact, that CSA persued the Amiga performance in terms of 
hardware.

	BUT, I think CSA assumes we want to pay IBM and MacII prices!

	I think CSA said, "Performance at all cost", and I think that this is a 
bad engineering decision.

	It should have been postulated, the best in price/performance ratio.
	
	Brute force, is not always a win. Especially with cost at no object,
frame of mind.

	That's why I sighed relief, seeing at least the Hurricane board.
Competition!

	And now with the A2620 from CBM, we may even achieve economy of scale!

	I'm waiting.

/* End of my bullshit opinion */

> 	Charles Poirier   (decvax,ihnp4,attmail)!vax135!cjp
> 
>    "Docking complete...       Docking complete...       Docking complete..."


-- 
Work: Computer Consoles Inc. (CCI), Advanced Development Group (ADG)
      Irvine, CA (RISCy business! Home of the CCI POWER 6/32)
UUCP: uunet!ccicpg!harald

daveh@cbmvax.UUCP (Dave Haynie) (02/02/88)

in article <10170@ccicpg.UUCP>, harald@ccicpg.UUCP ( Harald Milne) says:
> In article <3200@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes:
>> The '030 running with data cache enabled introduces the most problems.  Still,
>> many of these can be solved in hardware if the designer takes the time;
>> you really do want to use that data cache!

> 	Amen!

> 	The only problem I can imagine at this point, is DMA to FAST ram. 
> Would anybody be silly enough to do this? This will kill the 680x0.

This happens all the time with things like hard disk drives.  It sure does
hurt the 68000's speed, but consider the alternative.  You've got to get
that disk data into memory somehow.  If you make the 68000 go and read it
from an I/O port somewhere, you're running several memory cycles per data
transfer.  I mean, instruction fetch, I/O fetch, instruction fetch, write to
RAM, instruction fetch, test and branch, something like that.  Once a DMA
driven controller is set up (simple, nothing like setting up the blitter),
you have a bus arbitration, then one word transferred by the controller per
memory cycle.  If you're a 68020, you may even run a little from cache after
the arbitration.  So this is much faster than possible without DMA.

> 	I got a sneaky suspicion, that to get a few thousand miles away, you
> were at AmiExpo! Darn.

No, actually, Paradise Island, The Bahamas.

Didn't make AmiExpo.  Woulda been nice too, but I had all this work piled
up here.

> Work: Computer Consoles Inc. (CCI), Advanced Development Group (ADG)
>       Irvine, CA (RISCy business! Home of the CCI POWER 6/32)
> UUCP: uunet!ccicpg!harald
-- 
Dave Haynie  "The B2000 Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
		"I can't relax, 'cause I'm a Boinger!"

peter@sugar.UUCP (Peter da Silva) (02/03/88)

In article <3200@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes:
> in article <494@.UUCP>, alex@.UUCP (Alex Laney) says:
> > you can't use any of the features of it. You know, turning off caches, etc.,
> > seems to be mandatory with the '020. 

> few games are apparently running self-modifying code or something that
> causes a problem with the cache, but most Amiga software runs fine.

Well, I'd think that you'd probably want to invalidate the cache when you
LoadSeg() something... just in case it's LoadSegging it at the same address
as something that's already in the cache. It's a real long-shot, but it's
almost certain it's gonna hit someone sometime.
-- 
-- Peter da Silva  `-_-'  ...!hoptoad!academ!uhnix1!sugar!peter
-- Disclaimer: These U aren't mere opinions... these are *values*.

alex@xicom.UUCP (Alex Laney) (02/04/88)

In article <3200@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes:
> 
> The 68030's MMU is in fact a subset of the 68851 on the A2620 card.

Welllll, the fact that the A2620 card is using the Motorola MMU is news to
me. What I had read before is that Commodore was intending to use a custom
MMU. So, I'm happy! [I know that some people don't care for Motorola MMU's
based on past experience, but it's too late for that.]

Is there a release date other than RSN? Or even a release date on specs, etc.,
that may include a release date of the board? Just wondering ...

-- 
Alex Laney   alex@xicom.UUCP   ...utzoo!dciem!nrcaer!xios!xicom!alex
Xicom Technologies, 205-1545 Carling Av., Ottawa, Ontario, Canada
We may have written the SNA software you use.
The opinions are my own.

stever@videovax.Tek.COM (Steven E. Rice, P.E.) (02/05/88)

In article <3246@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes:

[ discussion of, among other things, DMA to fast ram ]

> This happens all the time with things like hard disk drives.  It sure does
> hurt the 68000's speed, but consider the alternative.  You've got to get
> that disk data into memory somehow.  If you make the 68000 go and read it
> from an I/O port somewhere, you're running several memory cycles per data
> transfer.  I mean, instruction fetch, I/O fetch, instruction fetch, write to
> RAM, instruction fetch, test and branch, something like that.  Once a DMA
> driven controller is set up (simple, nothing like setting up the blitter),
> you have a bus arbitration, then one word transferred by the controller per
> memory cycle.  If you're a 68020, you may even run a little from cache after
> the arbitration.  So this is much faster than possible without DMA.

This is true for a 68000 or 68010, and perhaps even for a 68020 or 68030 on
a 16-bit-wide bus.  However, for best performance you want to put the DMA
peripherals on one side of a dual-ported memory and let the CPU do the
data moving.  Why?  The reasons are as follows:

  1. Most DMA peripherals are incredibly sluggish.  An example is the
     LANCE, an Ethernet interface chip.  It transfers data in blocks of
     eight 16-bit words.  The *minimum* time to perform this transfer is
     4.8 microseconds, with no-wait-state memory.  Add arbitration time
     to this and it becomes more like 5.1 microseconds.  And if you can't
     complete a memory cycle in less than 105 nanoseconds, each cycle
     (remember, there are eight of them!) gets longer in 100-nanosecond
     steps.

     To keep up with the Ethernet, the LANCE will arbitrate for the
     bus about every 12.8 microseconds, tying it up for 5.1 microseconds
     minimum.  This is about 40% of the bus bandwidth.

  2. On a 32-bit bus, the 68020 can move data very efficiently -- once the
     instructions have been loaded into the cache, the only thing on the
     bus will be (32-bit) data transfers.  Even with reasonably slow
     memory (180-nanosecond access, 300-nanosecond cycle time), this means
     that the 68020 can transfer data twice as fast as a LANCE running
     on 100-nanosecond access memory.

If you dual-port the LANCE memory properly (32 bits wide to the 68020,
16 bits wide to the LANCE), you can move the data from the dual-ported
memory *while* the LANCE is transferring other data into it, thus
achieving an effective doubling of the transfer rate and freeing the
bus for other purposes the rest of the time.

The same thing applies to hard disks, too.  The 68020 can sustain a
48 Mbit/second transfer rate.  Typical hard disks run at 5 to 10 Mbit/
second rates.  Unless the hard disk interface is fast as greased
lightning *and* 32 bits wide, the 68020 or 68030 can move the data
faster!

So, for maximum performance, hide your peripherals behind dual-ported
memory, and then mark those pages as "non-cacheable."

					Steve Rice

-----------------------------------------------------------------------------
* Every knee shall bow, and every tongue confess that Jesus Christ is Lord! *
new: stever@videovax.tv.Tek.com
old: {decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever

daveh@cbmvax.UUCP (Dave Haynie) (02/05/88)

in article <1431@sugar.UUCP>, peter@sugar.UUCP (Peter da Silva) says:
> 
> In article <3200@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes:
>> in article <494@.UUCP>, alex@.UUCP (Alex Laney) says:
>> > you can't use any of the features of it. You know, turning off caches, etc.,
>> > seems to be mandatory with the '020. 

> Well, I'd think that you'd probably want to invalidate the cache when you
> LoadSeg() something... just in case it's LoadSegging it at the same address
> as something that's already in the cache. It's a real long-shot, but it's
> almost certain it's gonna hit someone sometime.

True.  Unless you can be certain that the LoadSeg function itself fills up
the cache.  That would be the simplest way to handle this without having to
make the 68020 a special case.  Certainly in the future, when data caches
and larger instruction caches are being used, the cache will have to be
explicitly dumped in such cases.  I'm not sure that LoadSeg actually does
this or not, but the cache is only 64 longwords long; it doesn't take long
to overrun this.  So I expect that implicit cache clearing works in this
case.  Any OS gurus out there know fer shure?

> -- Peter da Silva  `-_-'  ...!hoptoad!academ!uhnix1!sugar!peter
> -- Disclaimer: These U aren't mere opinions... these are *values*.
-- 
Dave Haynie  "The B2000 Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
		"I can't relax, 'cause I'm a Boinger!"

harald@leo.UUCP ( Harald Milne) (02/05/88)

In article <3246@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes:
> in article <10170@ccicpg.UUCP>, harald@ccicpg.UUCP ( Harald Milne) says:
> > 	The only problem I can imagine at this point, is DMA to FAST ram. 
> > Would anybody be silly enough to do this? This will kill the 680x0.
> 
> This happens all the time with things like hard disk drives.  It sure does
> hurt the 68000's speed, but consider the alternative. 

	I'm painfully aware of the alternative. My question was a bit rhetorical.
I have an A1000 at home with HD and 1.75meg, and an A2000 at work with
Ethernet and 3meg. You are right, you have to DMA to get reasonable 
performance. My reference to 680x0, was in reference to the 68030, 68020,
68000. And the best solution overall. More specifically, the 68030.

	I think we have hit the hard spot.

	Solutions? Hmm....

	Somehow, for the 68030 at least, you have to invalidate these entries
in the cache, or prevent them from ever appearing.

	To invalidate, is a software solution.

	To prevent it from appearing, could possibly be done by hardware. Or even
by a combination of hardware/software via the MMU.

	The real question is, which yeilds the most performance, while maintaining
compatability.

	Hmmm.... I have to think about this a bit. This gets even tougher when you
consider all the possible configurations, memory timings, etc. Ack!

	Looks like your prophecy in AC comes true after all, "And on the '020 vs.
'030 question, we may have a surprise or two you aren't considering."

	This sure gives me enough to chew on.

> Dave Haynie  "The B2000 Guy"     Commodore-Amiga  "The Crew That Never Rests"
>    {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
> 		"I can't relax, 'cause I'm a Boinger!"
-- 
Work: Computer Consoles Inc. (CCI), Advanced Development Group (ADG)
      Irvine, CA (RISCy business! Home of Regulus and hamiga)
UUCP: uunet!ccicpg!leo!harald

daveh@cbmvax.UUCP (Dave Haynie) (02/10/88)

in article <4822@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says:
Summary: DMA is still *FAST*er
> Summary: DMA is the *SLOW* way to go!

> In article <3246@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes:

>> This happens all the time with things like hard disk drives.  It sure does
>> hurt the 68000's speed, but consider the alternative.  You've got to get
>> that disk data into memory somehow.  If you make the 68000 go and read it
>> from an I/O port somewhere, you're running several memory cycles per data
>> transfer.  I mean, instruction fetch, I/O fetch, instruction fetch, write to
>> RAM, instruction fetch, test and branch, something like that.  Once a DMA
>> driven controller is set up (simple, nothing like setting up the blitter),
>> you have a bus arbitration, then one word transferred by the controller per
>> memory cycle.  If you're a 68020, you may even run a little from cache after
>> the arbitration.  So this is much faster than possible without DMA.

> This is true for a 68000 or 68010, and perhaps even for a 68020 or 68030 on
> a 16-bit-wide bus.  However, for best performance you want to put the DMA
> peripherals on one side of a dual-ported memory and let the CPU do the
> data moving.  

No, what you want is intelligently designed peripherals.

> Why?  The reasons are as follows:

>   1. Most DMA peripherals are incredibly sluggish...

>      To keep up with the Ethernet, the LANCE will arbitrate for the
>      bus about every 12.8 microseconds, tying it up for 5.1 microseconds
>      minimum.  This is about 40% of the bus bandwidth.

This is why we have things like FIFOs.  Even the 68020 running with cache 
enabled typically uses only around 50% of the bus bandwidth.  This is not
a bad thing, though, but a good argument for DMA.

>   2. On a 32-bit bus, the 68020 can move data very efficiently -- once the
>      instructions have been loaded into the cache, the only thing on the
>      bus will be (32-bit) data transfers.  Even with reasonably slow
>      memory (180-nanosecond access, 300-nanosecond cycle time), this means
>      that the 68020 can transfer data twice as fast as a LANCE running
>      on 100-nanosecond access memory.

Like I said, intelligently designed peripherals.  Let's look at a hard disk
controller with FIFO.  The Amiga 2090 controller is such a beast.  Though
only a 16 bit device, the same principals work in 32 bit land. 

So my hard disk controller is chugging away, fetching data from the relatively
slow hard disk and stuffing this in the FIFO.  It sees the FIFO filling up, 
and interrupts the 68020.  The '020 springs to action, being that the disk is
run by a high priority task that was just waiting on this interrupt.  So far
we're have to do this whether the disk controller is DMA or shared memory.

Now let's consider the shared memory.  Say we've got 512 bytes to move.  You
jump into a block move routine, where the cache immediately gets set up with
the move code after the first loop pass.  You've got one memory cycle to read
the data from shared RAM, one memory cycle to stuff it into your destination
RAM.  So you get 256 memory cycles, plus maybe 2 extra for cache setup.

Now we go to the DMA controller, moving the same 512 bytes.  We have to set 
up the controller with the destination RAM address, that should take maybe
3 cycles.  Give it another 3 to tell the DMA controller to go ahead.  Next,
maybe a cycle to arbitrate the bus.  Now we run the DMA transfer.  But we
already have the data at hand, so all the controller has to do is stuff it
in memory.  That's 128 memory cycles.  And another to re-arbitrate.

So in this case, DMA comes out 136 cycles, vs. 258 if the 68020 moved it all
by itself.

> If you dual-port the LANCE memory properly (32 bits wide to the 68020,
> 16 bits wide to the LANCE), you can move the data from the dual-ported
> memory *while* the LANCE is transferring other data into it, thus
> achieving an effective doubling of the transfer rate and freeing the
> bus for other purposes the rest of the time.

I get the exact same effect with my FIFO, only through use of DMA I'm tying
up the bus much less.

But not really, unless you've got some screaming RAM in that dual port 
section.  Maybe you can use some true dual-ported SRAM, or a FIFO like
what we've got on this hard disk controller, but if you're talking DRAM,
forget it, the 68020's going to eat all the available time on anything
in the 80ns or slower range.

> So, for maximum performance, hide your peripherals behind dual-ported
> memory, and then mark those pages as "non-cacheable."

There's no question that having a peripheral device dump to shared RAM
is much better than directly banging it with the CPU, Macintosh style.  And
for very small tranfer situations, it's better.  A DMA controller has a 
fixed setup time.  But if you're transferring more than a few bytes at a
time, DMA is a win.  And unless you're dealing with something that needs
immediate response (eg, you can't wait until you've got 64 or 512 or 
whatever bytes to block transfer), DMA is still a win on a 68020 system,
if done correctly.  The 68020 at 32 bits/transfer will tie a 16 bit DMA
device at transfer rate, plus it's got less setup, so you definitely want
that DMA to be 32 bits wide.

Finally, in a decent system, you can have DMA on your backplane going at
the same time you've got CPU access going on you're local bus, so the DMA
won't always kick the CPU off the bus.  Amiga's aren't doing it this way,
yet.

> 					Steve Rice
-- 
Dave Haynie  "The B2000 Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
		"I can't relax, 'cause I'm a Boinger!"

hah@mipon3.intel.com (Hans Hansen) (02/13/88)

In article <3291@cbmvax.UUCP> daveh@cbmvax.UUCP (Dave Haynie) writes:
$in article <4822@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says:
$> Summary: DMA is the *SLOW* way to go!
$Summary: DMA is still *FAST*er

What both of you are overlooking is the fact that the system w/o DMA must
do a task switch each time it goes to the well, (~50us/68000, ~30us/68020,
~20us/68030).  As the transfer data is coming in sloooooooowly the processor
is constantly switching tasks to service the "deadhead port" instead of
being left alone to calculate the next iteration of the Ray Tracing, setting
up the next pritty picture, balancing YOUR checkbook (sic).

Hans

stever@videovax.Tek.COM (Steven E. Rice, P.E.) (02/20/88)

Hmmmm. . .  I expressed my belief that (at least in a 32-bit wide 68020
system) "DMA is the *SLOW* way to go!"  In article <3291@cbmvax.UUCP>,
Dave Haynie (daveh@cbmvax.UUCP) replied:

> Summary: DMA is still *FAST*er

Now, don't get me wrong -- I'm not suggesting that we go back to the bad
old days of "programmed data transfers" (i.e., interrupt-per-byte transfers,
with the CPU stacking and unstacking its entire context for each byte that
comes in or goes out).  Long, long ago, in a galaxy far, far away, I did
that with a 6800 (our options were limited).  Maximum data transfer rate
was about 20K bytes per second, using every CPU cycle that was available.

However, I will continue to insist that there are some things that are
not fit for genteel company, and should be relegated to an appropriate
closet.  And right at the top of my list of such things is DMA I/O!!!

In my previous article, I suggested:

>>              . . .  However, for best performance you want to put the DMA
>> peripherals on one side of a dual-ported memory and let the CPU do the
>> data moving.

Dave disagreed:

> No, what you want is intelligently designed peripherals.

(AMD may be bent out of shape at such calumnies!!)  But I would suggest
that the reasons I gave are valid:

>> Why?  The reasons are as follows:
> 
>>   1. Most DMA peripherals are incredibly sluggish...
> 
>>      To keep up with the Ethernet, the LANCE will arbitrate for the
>>      bus about every 12.8 microseconds, tying it up for 5.1 microseconds
>>      minimum.  This is about 40% of the bus bandwidth.
> 
> This is why we have things like FIFOs.  Even the 68020 running with cache 
> enabled typically uses only around 50% of the bus bandwidth.  This is not
> a bad thing, though, but a good argument for DMA.

I guess I wasn't being quite as explicit as I should have been!  First, the
LANCE contains its own FIFO (they call it a "SILO").  Second, when I was
talking about the LANCE taking up 40% of the bus bandwidth, I didn't 
relate it to the transfer efficiency.  So, let me give an example I know
well -- our system:

  -- 16.67 MHz 68020 on a 32-bit wide bus.

  -- Actual memory access time about 240 nsec (from assertion of AS' to
     the CPU responding to DSACKx' by un-asserting AS').  Full memory cycle
     time about 330 nsec (provides RAS' precharge time).  Memory is
     asynchronous to the processor.

  -- LANCE Ethernet interface behind a 128K byte dual-ported memory which
     is organized as 32K x 32 bits from the 68020's perspective and
     64K x 16 bits from the LANCE's perspective.

The LANCE (along with its companion, the SIA) is an integrated solution
to Ethernet interfacing.  The LANCE manages its own "rings" of input
and output buffers, discriminates against messages that aren't intended
for it (it recognizes when it is addressed), and performs all the
housekeeping functions associated with Ethernet packet creation and
validation.  Thus, the LANCE can receive and store a complete (maximum
length 1536 octet) Ethernet packet before it pulls the CPU's chain.

For all that it interfaces to a fast bus (Ethernet is 10 Mbits/sec data
transfer rate), the LANCE has some disadvantages.  It has a minimum 600
nsec data transfer time with 100 nsec memory.  With our memory, which
responds in about 240 nsec, the LANCE would have an 800 nsec nominal
data transfer cycle.  Thus, the LANCE would transfer 8, 16-bit words
(one SILO full) every 12.8 microseconds, tying up the CPU bus for about
6.7 microseconds, which is 52% of the available CPU bus bandwidth.

The LANCE can transfer only 16 bits with each memory cycle.  Thus, its
data transfer rate, during the time it is using the bus, is:

    (8 words) * (16 bits) / (6.7 microseconds) = 19.1 Mbits/second

On the other hand, in our system the 68020 has an effective data transfer
rate (once the cache is loaded with the instructions) of:

    (1 long word) * (32 bits) / (330 nanoseconds) =  96 Mbits/second

If you cut that in half to reflect the fact that the 68020 has to both
pick the (32-bit long word) up and store it away, it still has a data
transfer rate of 48 Mbits/sec, which is over twice that of the LANCE.

>>   2. On a 32-bit bus, the 68020 can move data very efficiently -- once the
>>      instructions have been loaded into the cache, the only thing on the
>>      bus will be (32-bit) data transfers.  Even with reasonably slow
>>      memory (180-nanosecond access, 300-nanosecond cycle time), this means
>>      that the 68020 can transfer data twice as fast as a LANCE running
>>      on 100-nanosecond access memory.
> 
> Like I said, intelligently designed peripherals.  Let's look at a hard disk
> controller with FIFO.  The Amiga 2090 controller is such a beast.  Though
> only a 16 bit device, the same principals work in 32 bit land.

Most principals work in schools. . .

> So my hard disk controller is chugging away, fetching data from the
> relatively slow hard disk and stuffing this in the FIFO.  It sees the FIFO
> filling up, and interrupts the 68020.  The '020 springs to action, being
> that the disk is run by a high priority task that was just waiting on this
> interrupt.  So far we're have to do this whether the disk controller is
> DMA or shared memory.
> 
> Now let's consider the shared memory.  Say we've got 512 bytes to move.  You
> jump into a block move routine, where the cache immediately gets set up with
> the move code after the first loop pass.  You've got one memory cycle to read
> the data from shared RAM, one memory cycle to stuff it into your destination
> RAM.  So you get 256 memory cycles, plus maybe 2 extra for cache setup.
> 
> Now we go to the DMA controller, moving the same 512 bytes.  We have to set 
> up the controller with the destination RAM address, that should take maybe
> 3 cycles.  Give it another 3 to tell the DMA controller to go ahead.  Next,
> maybe a cycle to arbitrate the bus.  Now we run the DMA transfer.  But we
> already have the data at hand, so all the controller has to do is stuff it
> in memory.  That's 128 memory cycles.  And another to re-arbitrate.
> 
> So in this case, DMA comes out 136 cycles, vs. 258 if the 68020 moved it all
> by itself.

Now, let's come back down to earth!  We (Tektronix Television Systems) have
a 68020-based professional television measurement instrument (the VM700)
that is just about ready to ship to customers.  It is 32 bits wide all over
the place, for maximum data transfer rate consistent with reasonable cost.
While I will admit that the principle of the A2090 would work just fine if
one could only do it 32 bits wide, in fact it is not (reasonably) possible
for us to do it 32 bits wide!

Why?  Well, we will probably ship about as many instruments in one year
as Commodore ships Amigas in a week.  (Not bad for an instrument with a
sticker price of $16,495!)  So, I can't afford to go out and generate a
32-bit wide DMA chip with a 512-byte onboard FIFO.  I have to use what I
can buy from Motorola or Hitachi or whomever.

Believe me, we did look at DMA chips before making the basic system
design decisions -- and the DMA chips are nearly as bad as the LANCE!
Minimum DMA cycle time I found was 500 nsec, again assuming nearly
instantaneous memory response.  And the best of them were only 16 bits
wide.

>> If you dual-port the LANCE memory properly (32 bits wide to the 68020,
>> 16 bits wide to the LANCE), you can move the data from the dual-ported
>> memory *while* the LANCE is transferring other data into it, thus
>> achieving an effective doubling of the transfer rate and freeing the
>> bus for other purposes the rest of the time.
> 
> I get the exact same effect with my FIFO, only through use of DMA I'm tying
> up the bus much less.
> 
> But not really, unless you've got some screaming RAM in that dual port 
> section.  Maybe you can use some true dual-ported SRAM, or a FIFO like
> what we've got on this hard disk controller, but if you're talking DRAM,
> forget it, the 68020's going to eat all the available time on anything
> in the 80ns or slower range.

Remember, our system memory access is about 240 nsec (asynchronous).  The
dual-ported RAM on the LAN card is made of 4, 32K x 8 bit static RAM chips,
and a boatload of SSI, MSI, and PALs.  The static parts are garden-variety,
150 nsec parts, but the actual memory access time is about 240 nsec,
because there is clock-driven, no-deadlock, positive arbitration logic to
ensure that one and only one customer gets the memory at a time [it works,
too! 8^) ].  (Signetics now has a chip that allows you to do the same thing
with dynamic RAMs -- it even takes care of the refresh!)

Because of this, the LANCE can access memory once per 800 nsec (or so),
and the 68020 can get one or two 32-bit accesses in between each of the
LANCE's 16-bit accesses.  Remember, too, that while the LANCE has the
bus, its effective data rate is about 19.1 Mbits/second.  Thus, even
with the 68020 having to read the data from the dual-port RAM on one
memory cycle and write it to system memory on the next memory cycle,
the effective data transfer bandwidth for the 68020 is 48 Mbits/second.

Thus, even without the rest of the argument, my conclusion is still:

>> So, for maximum performance, hide your peripherals behind dual-ported
>> memory, and then mark those pages as "non-cacheable."

Consider something else, though.  When you read from or write to your
hard disk, the CPU is going to have to copy the data at least once.  On
a read from the disk, you do a getchr() (or whatever), which stimulates
the system to go read a sector into a buffer of its own.  Then (and only
then) it passes a byte back to you.

If the disk DMA transfer occurs on the system bus, the data moves over
that bus *twice* before it gets to the user.  On the other hand, if the
hard disk controller board has its own (dual-ported) memory, which is
accessible to the CPU, the DMA can transfer into dual-ported memory
without disturbing the CPU at all.  When the data is passed to the user,
it moves over the system bus only once.

> There's no question that having a peripheral device dump to shared RAM
> is much better than directly banging it with the CPU, Macintosh style.  And
> for very small tranfer situations, it's better.  A DMA controller has a 
> fixed setup time.  But if you're transferring more than a few bytes at a
> time, DMA is a win.  And unless you're dealing with something that needs
> immediate response (eg, you can't wait until you've got 64 or 512 or 
> whatever bytes to block transfer), DMA is still a win on a 68020 system,
> if done correctly.  The 68020 at 32 bits/transfer will tie a 16 bit DMA
> device at transfer rate, plus it's got less setup, so you definitely want
> that DMA to be 32 bits wide.

Agreed that I want the DMA to be 32 bits wide.  That is just very
difficult for those of us that cannot crank up a silicon foundry whenever
we get the itch. . .

Note again, that in real life the processor is going to have to copy the
data somewhere else (to the ultimate consumer) once it is DMA-ed into the
system disk buffer.  There will be fewer transfers over the system bus
(and thus more cycles available to the CPU) if the DMA moves data from the
disk into dual-ported memory, so it must only pass over the system bus
once.

> Finally, in a decent system, you can have DMA on your backplane going at
> the same time you've got CPU access going on you're local bus, so the DMA
> won't always kick the CPU off the bus.  Amiga's aren't doing it this way,
> yet.

But Amigas will, I hope, I hope, I hope. . . 8^)

(By the way, if you've followed what I was saying, that's what we have
in the VM700 -- except the DMA runs on its own private "bus," and the
CPU *always* has the system bus available to it!)

					Steve Rice

-----------------------------------------------------------------------------
* Every knee shall bow, and every tongue confess that Jesus Christ is Lord! *
new: stever@videovax.tv.Tek.com
old: {decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever

daveh@cbmvax.UUCP (Dave Haynie) (03/01/88)

in article <4853@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says:
> Keywords: DMA, closet
> Summary: DMA is great -- in its proper place. . .

> Hmmmm. . .  I expressed my belief that (at least in a 32-bit wide 68020
> system) "DMA is the *SLOW* way to go!"  In article <3291@cbmvax.UUCP>,
> Dave Haynie (daveh@cbmvax.UUCP) replied:

>> Summary: DMA is still *FAST*er

> In my previous article, I suggested:
> 
>>>              . . .  However, for best performance you want to put the DMA
>>> peripherals on one side of a dual-ported memory and let the CPU do the
>>> data moving.

Thus, re-creating a situation very much like the way the chip bus works.  Your
design forces memory typing (MEMF_CHIP, MEMF_LAN, MEMF_HARDDISK, etc.).

> I guess I wasn't being quite as explicit as I should have been!  ...
> Thus, the LANCE would transfer 8, 16-bit words (one SILO full) every 12.8
> microseconds, tying up the CPU bus for about 6.7 microseconds, which is
> 52% of the available CPU bus bandwidth.
> 
> The LANCE can transfer only 16 bits with each memory cycle.  

Here we going again with what I meant by intelligently designed peripherals.  If
you're on a 32 bit bus, your DMA should be 32 bits wide.  And you should use a
larger FIFO, like maybe 64-128 bytes.  If you can't do either or both of these,
than, as I showed before, you'll get better performance from a 68020 move.

> Thus, its data transfer rate, during the time it is using the bus, is:

>     (8 words) * (16 bits) / (6.7 microseconds) = 19.1 Mbits/second

No intelligence here!  Why would you take over the bus and then just 
sit there.  If you are only transferring 16 bits at a time, this should
give you half the 68020 rate, 48 Mbits/second, once arbitration has
taken place.  A big enough FIFO makes the arbitration time negligable.
Extend this to 32 bits wide and you're twice the 68020 rate.  If this can't
be done from a circuit point of view, either redesign the lan chip to make
effective use of DMA, or admit that it's a bad design.  If there are other
reasons, like software or user can't handle buffering delays, than this isn't
a good application for DMA, and we can turn our attention over to problems
that are well suited to DMA, like hard disk controllers.  But don't pan DMA
because it doesn't fit an arbitrary case on an arbitrary chip.

>> [Timing analysis removed]
>> 
>> So in this case, DMA comes out 136 cycles, vs. 258 if the 68020 moved it all
>> by itself.

> Now, let's come back down to earth!  

Naa, that's where IBM does their design work.

> While I will admit that the principle of the A2090 would work just fine if
> one could only do it 32 bits wide, in fact it is not (reasonably) possible
> for us to do it 32 bits wide!

> Why?  Well, we will probably ship about as many instruments in one year
> as Commodore ships Amigas in a week.  (Not bad for an instrument with a
> sticker price of $16,495!)  So, I can't afford to go out and generate a
> 32-bit wide DMA chip with a 512-byte onboard FIFO.  I have to use what I
> can buy from Motorola or Hitachi or whomever.

OK, but again, you shouldn't blast the concept of DMA just because you can't
use it in your particular situation.  We make lots of Amigas, and lots of 
custom chips.  Like the DMA chip on the A2090 card.  That's only 16 bit in
this case, but we're only dealing with a 16 bit bus.

> Remember, our system memory access is about 240 nsec (asynchronous).  The
> dual-ported RAM on the LAN card is made of 4, 32K x 8 bit static RAM chips,
> and a boatload of SSI, MSI, and PALs.  The static parts are garden-variety,
> 150 nsec parts, but the actual memory access time is about 240 nsec,
> because there is clock-driven, no-deadlock, positive arbitration logic to
> ensure that one and only one customer gets the memory at a time [it works,
> too! 8^) ].  (Signetics now has a chip that allows you to do the same thing
> with dynamic RAMs -- it even takes care of the refresh!)

Well, I make the memory cycle time of a 16.67 MHz 68020 at just under 180ns.
So you're slowing down already.  But obviously a DMA device has to follow the
same rules as the 68020.  Now we have this dual ported memory.  I certainly 
believe you can build an arbiter that'll allow access to the RAM by only one
customer at a time.  But what happens when they both want it?  It appears to
me that one of them is getting wait stated.  That's what I meant by having
very FAST memory there.  The FIFO scheme starts DMA before the FIFO is 
completely filled, so that it fills just a bit before the transfer is complete.
You get your chunk of memory DMAed at full bus speed, and you get it from the
disk as fast as it could be received.  Now with the dual port scheme, you can
start filling the shared RAM early, too, since your data isn't coming in at
full bus speeds.  But eventually you want the transfer to start.  If stuff is
still coming into that memory, your transfer is going to suffer unless the RAM
is very fast.  The Amiga's CHIP RAM, for instance, is twice the speed of the
68000 memory cycle, so once you're synced up with it, there are no wait states
in normal operation (eg, blitter's well behaved, graphics are medium 
resolutions).  So this is a good scheme.  If I ran memory at the same speed
as the 68000 memory cycle, I'd hit wait states all the time trying to access
CHIP RAM.  What you're describing would only work well if the shared memory
has relatively little truely shared access.

> Consider something else, though.  When you read from or write to your
> hard disk, the CPU is going to have to copy the data at least once.  On
> a read from the disk, you do a getchr() (or whatever), which stimulates
> the system to go read a sector into a buffer of its own.  Then (and only
> then) it passes a byte back to you.

No.  The latest Amiga DOS software is set up to read data directly into its
final destination.  From C language or whatever, you may get double buffering
if you use character by character I/O or whatever, but if you make a direct
OS call, DMA device can directly use the given buffers.

> Agreed that I want the DMA to be 32 bits wide.  That is just very
> difficult for those of us that cannot crank up a silicon foundry whenever
> we get the itch. . .

Oh, well, I guess some of you will always have to live like that :-).

> Note again, that in real life the processor is going to have to copy the
> data somewhere else (to the ultimate consumer) once it is DMA-ed into the
> system disk buffer.  

No it isn't.  The only time the shared memory scheme wins is if the final
destination happens to be in the area of shared memory, in MEMF_HARDDISK so
to speak.  Otherwise, you'll have to do a CPU copy to the final destination,
whereas the DMA device could have put it directly there, since it can 
address all of memory.  I guess you can always tune your system software to
take advantage of the hardware, and perhaps the other way 'round too.

> 					Steve Rice
-- 
Dave Haynie  "The B2000 Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
		"I can't relax, 'cause I'm a Boinger!"

stever@videovax.Tek.COM (Steven E. Rice, P.E.) (03/11/88)

Dave Haynie's (daveh@cbmvax) most recent article was number
<3394@cbmvax.UUCP>.  In it, he cast aspersions on the poor, struggling
LANCE and suggested that real systems do 32-bit DMA.  Well, maybe --
but if you want to use Ethernet, the LANCE is about the only way to
go, slow or no!

In a perfect world, 32-bit DMA with a 512-byte assembly buffer and 
fast-as-a-speeding-bullet burst transfers would be possible.  In real
life, we have to make do with what we can buy.  (Commodore can build
what it needs; the economics in the Television Test and Measurement
market are different than those in the personal computer market.)

There is another thought, too -- if you have only one DMA device, you
could argue that it shouldn't make much difference if it DMAs into
system RAM or into a dual-ported buffer.  If you have more than one
device contending for the system bus, however, multiple dual-ported
buffers are a clear win.

					Steve Rice

-----------------------------------------------------------------------------
* Every knee shall bow, and every tongue confess that Jesus Christ is Lord! *
new: stever@videovax.tv.Tek.com
old: {decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever

daveh@cbmvax.UUCP (Dave Haynie) (03/25/88)

in article <4890@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says:
> 
> Dave Haynie's (daveh@cbmvax) most recent article was number
> <3394@cbmvax.UUCP>.  In it, he cast aspersions on the poor, struggling
> LANCE and suggested that real systems do 32-bit DMA.  Well, maybe --
> but if you want to use Ethernet, the LANCE is about the only way to
> go, slow or no!

Calm down!  That's not what I said.  I said that in very high 
bandwidth-consuming operations, such as hard disk interfacing, where the
transfer between an I/O device and CPU addressable main memory can be sent
in large atoms, is best served by DMA, even in a 68020 or 68030 system. I
also said that in systems where transfers must occur in small atoms or at
relatively slow speed (like perhaps networks or things which must be
highly interactive), the I/O scheme to shared CPU memory was a good idea.

> In a perfect world, 32-bit DMA with a 512-byte assembly buffer and 
> fast-as-a-speeding-bullet burst transfers would be possible.  In real
> life, we have to make do with what we can buy.  (Commodore can build
> what it needs; the economics in the Television Test and Measurement
> market are different than those in the personal computer market.)

That's true, Commodore can build what it needs for those cases.  The 16 bit 
wide DMA driven hard disk controller on the 16 bit bus delivers around 625K
bytes/second with the Fast FileSystem.  Fast FileSystem allows DMA from the
hard disk directly to the target memory, not intermediate buffers used.  I
believe that any peripheral going this fast wants DMA.  It's fully extensible
to a 32 bit machine, though a _conservative_ 32 bit machine rates that's 
2.5 megabytes/second thoughput (not even getting to things like burst 
transfers, which are ideally suited to DMA transfers).  If you're LAN is only
going 2.5 megabits/sec, that's certainly overkill and extra cost.

Which seems to make sense even today; most Amiga hard drives are DMA driven,
most Amiga LANs are CPU driven via shared RAM DMA.

> There is another thought, too -- if you have only one DMA device, you
> could argue that it shouldn't make much difference if it DMAs into
> system RAM or into a dual-ported buffer.  If you have more than one
> device contending for the system bus, however, multiple dual-ported
> buffers are a clear win.

Not unless you have multiple CPUs to read them.


> 					Steve Rice
-- 
Dave Haynie  "The B2000 Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
		"I can't relax, 'cause I'm a Boinger!"

stever@videovax.Tek.COM (Steven E. Rice, P.E.) (04/01/88)

In article <3507@cbmvax.UUCP>, Dave Haynie (daveh@cbmvax.UUCP) writes:

> in article <4890@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says:
>> 
>> Dave Haynie's (daveh@cbmvax) most recent article was number
>> <3394@cbmvax.UUCP>.  In it, he cast aspersions on the poor, struggling
>> LANCE and suggested that real systems do 32-bit DMA.  Well, maybe --
>> but if you want to use Ethernet, the LANCE is about the only way to
>> go, slow or no!
> 
> Calm down!  That's not what I said.  I said that in very high 
> bandwidth-consuming operations, such as hard disk interfacing, where the
> transfer between an I/O device and CPU addressable main memory can be sent
> in large atoms, is best served by DMA, even in a 68020 or 68030 system. I
> also said that in systems where transfers must occur in small atoms or at
> relatively slow speed (like perhaps networks or things which must be
> highly interactive), the I/O scheme to shared CPU memory was a good idea.

I think there is still some misunderstanding here.  When I mention dual-
ported memories, I am speaking of memory that is "CPU addressable main
memory"!  It just happens to also be shared (on a cycle-by-cycle basis)
with some other device, which could be an I/O device or another CPU.

The Amiga implements a form of "shared" memory -- chip memory.  The
CPU gets access to chip memory on a shared basis, arbitrated cycle
by cycle.  Another form of "shared" memory is seen on the A2620 (?)
card -- the 68020 CPU.  The 68020 will have 2 or 4 megabytes of 32-bit
wide memory which no one can deny it access to.  Thus, if DMA is
occurring to "main" memory, the 68020 may not be blocked at all.  Carrying
the idea one step further simply removes more limitations from the system,
giving the CPU unrestricted access to the system bus and immediate access
to any memory that is not in use during that memory cycle.

>> In a perfect world, 32-bit DMA with a 512-byte assembly buffer and 
>> fast-as-a-speeding-bullet burst transfers would be possible.  In real
>> life, we have to make do with what we can buy.  (Commodore can build
>> what it needs; the economics in the Television Test and Measurement
>> market are different than those in the personal computer market.)
> 
> That's true, Commodore can build what it needs for those cases.  The 16 bit 
> wide DMA driven hard disk controller on the 16 bit bus delivers around 625K
> bytes/second with the Fast FileSystem.  Fast FileSystem allows DMA from the
> hard disk directly to the target memory, not intermediate buffers used.  I
> believe that any peripheral going this fast wants DMA.  It's fully extensible
> to a 32 bit machine, though a _conservative_ 32 bit machine rates that's 
> 2.5 megabytes/second thoughput (not even getting to things like burst 
> transfers, which are ideally suited to DMA transfers).  If you're LAN is only
> going 2.5 megabits/sec, that's certainly overkill and extra cost.

Ethernet is 10 megabits/sec.

> Which seems to make sense even today; most Amiga hard drives are DMA driven,
> most Amiga LANs are CPU driven via shared RAM DMA.

In the case of Ethernet I/O, transmissions are packetized with quite
a bit of protocol overhead.  Thus, the data to be transmitted must be
broken into chunks no larger than the largest legitimate packet and
shipped out one packet at a time.  To do this, the CPU is going to have
to move the data anyway -- it has to configure it in a form the I/O
device can use.  In this case, the copy from what you might consider
"main" memory to "shared" memory is free.

Starting with the FFS rate of 625K bytes/second and doubling that for a 
32-bit bus gives 1.25 megabytes/second.  This translates to a 10
megabit/second transfer rate, which is the same as the Ethernet.  Using
your figure of 2.5 megabytes per second gives 20 megabits/second
throughput.  But our CPU bus bandwidth is about 100 megabits/second
(approximately 330 nsec main memory cycle time [not *access* time --
*cycle* time]).  Thus, a 2.5 megabyte/second disk transfer would occupy
only 20% of the bus bandwidth.

If the disk DMA is transferring into unshared main memory, the CPU will
just have to wait.  At 2.5 megabytes/second (assuming 32-bit transfers),
the disk will request one memory access every 1.6 microseconds.

One possibility is to arbitrate for the bus for each transfer.  Looking
at the timing diagrams in the Motorola 68020 manual, one finds that
there is a minimum of 1/2 clock period and a maximum of 1 clock period
from the end of clock state S5 until Bus Grant* is asserted.  There is
also a note in paragraph 5.2.7.4 which says that "all asynchronous
inputs to the MC68020 are internally synchronized in a maximum of two
cycles of the system clock."  This implies that the minimum to resume
processing is 1 clock cycle.  There is probably one additional cycle
needed for the CPU to resume driving the address and data lines.

Assuming a memory cycle time of 330 ns (which is what ours is) with
240 ns read or write access time, each 32-bit word transferred would
hold the CPU bus for one arbitration time (1/2 to 1 clock cycles, or
30 to 60 ns in a 16.7 MHz system) plus one transfer time (240 ns) plus
one bus relinquishment time (1 to 2 clock cycles, or 60 to 120 ns)
plus one driver turnon time (1 clock cycle, or 60 ns).  The minimum
time required would be 390 ns, the maximum time would be 480 ns, and
the mean time would be 435 ns.

435 ns out of 1.6 us is 27.2% of the bus bandwidth occupied.  But not
only is 27.2% of the bus bandwidth occupied, the CPU is denied the
bus 27.2% of the time!  This translates directly into throughput
reduction.

Another possibility is to block the data into (e.g.) 512 byte blocks and
then arbitrate for the bus once per block.  This drops the bus bandwidth
occupation to 20% (since one arbitration is insignificant compared to the
time to transfer 512 bytes as 128 32-bit words).  But the CPU is still
denied the bus 20% of the time.

If, however, the disk data is DMAed into dual-ported memory, it can deny
an access to the CPU a *maximum* of 20% of the time, and then only if
the CPU is fetching all of its instructions from the shared memory!  In
actual operation, it is likely to be much less than that.  There is also
no reason the receiving process cannot use the data directly from the
dual-ported memory, although in many cases there will be at least one
copy between initial transfer and use of the data.

>> There is another thought, too -- if you have only one DMA device, you
>> could argue that it shouldn't make much difference if it DMAs into
>> system RAM or into a dual-ported buffer.  If you have more than one
>> device contending for the system bus, however, multiple dual-ported
>> buffers are a clear win.
> 
> Not unless you have multiple CPUs to read them.

Given just a single hard disk transfer as you have described it, DMA into
a dual-port buffer avoids losing 20% of the CPU's processing capability.
That seems worthwhile to me!

					Steve Rice

-----------------------------------------------------------------------------
* Every knee shall bow, and every tongue confess that Jesus Christ is Lord! *
new: stever@videovax.tv.Tek.com
old: {decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever

daveh@cbmvax.UUCP (Dave Haynie) (04/12/88)

in article <4937@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says:

> Another possibility is to block the data into (e.g.) 512 byte blocks and
> then arbitrate for the bus once per block.  This drops the bus bandwidth
> occupation to 20% (since one arbitration is insignificant compared to the
> time to transfer 512 bytes as 128 32-bit words).  But the CPU is still
> denied the bus 20% of the time.

First of all, with a better bus design (eg, not the current Amiga bus, but
perhaps a future version that's 32 bits wide), there's zero or very near
zero arbitration time; the bus's owner is determined dynamically on a 
cycle by cycle basis.

Secondly, since the 68020 with cache running only wants the bus 50% or so
of the time, on average, you take your 20% figure and immediately reduce it 
to 10%, on average.  It could be as bad as 20%, it could be as good as
0%, depending on what the CPU is doing.

Now we add a priotity scheme.  If the CPU operation is more important, it
gets the bus for any cycles it needs, and the DMA device gets whatever it
wants from the remaining 50% of the bus.  And that's assuming that the bus
is limited to CPU bus speeds.  It's pretty simple to make DMA devices run
nybble or page mode cycles that the CPU can't keep up with, but most
memory systems can be designed with this in mind for nearly free.  So with
DMA going with a nybble transfer, you're now down to less than 5% of the
bus bandwidth for that transfer.  VME and non-Apple NuBus both do things
like this.

> Given just a single hard disk transfer as you have described it, DMA into
> a dual-port buffer avoids losing 20% of the CPU's processing capability.
> That seems worthwhile to me!

But you're still missing the point.  The CPU has to stop what it's doing to
transfer the data by hand.  If it did that JUST as efficiently as the DMA
device, you'd still be loosing whatever CPU time you claim is being eaten
by the DMA transfer, 20% or whatever (keep in mind this 20% figure only
applies during an actual transfer).  If the DMA transfer happens twice as
fast as the CPU could transfer the data, then I'm gaining in CPU speed,
even though I'm kicking the CPU off the bus for awhile.  DMA transfers on
the Amiga bus with a 68020 go twice as fast as the 68020 could possibly
transfer them.  68000 based CPU transfers are more like 1/4th the speed of
the DMA device.  My point is that someone has to do the work of transfer
unless you can live with the data exactly where it's dumped in your
shared memory scheme.  If you know there's no transfer required, share the
memory, but if there is, and especially if the memory can be used as is,
once it reaches it's destination (like NewFS), DMA wins.  

There's actually a test case of this available in the Amiga world.  As I've
already mentioned, the A2090 controller uses a FIFO and DMA to complete it's
transfer, and achieves about 625K Bytes/Second.  There's a new SCSI 
controller out there, from a company called Great Valley Peripherals, that
uses an I/O chip DMA to shared RAM (4K of static RAM on-board, so once
you're in sync I suspect there will rarely be a collision between the
CPU and the peripheral chip).  I don't have any benchmarks on this new board,
but I guarantee it'll be slower.

> 					Steve Rice
-- 
Dave Haynie  "The B2000 Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
		"I can't relax, 'cause I'm a Boinger!"