[comp.sys.amiga.tech] DMA or polling

lphillips@lpami.wimsey.bc.ca (Larry Phillips) (08/09/89)

In <8908092130.AA23369@jade.berkeley.edu>, 451061@UOTTAWA.BITNET (Valentin Pepelea) writes:
>Steve -Raz- Berry <raz%kilowatt@sun.com> writes in <120232@sun.Eng.Sun.COM>
>
>> In article <8908072207.AA14796@jade.berkeley.edu> 451061@UOTTAWA.BITNET
>>  (Valentin Pepelea) writes:
>>
>> >The net result is that the processor therefore spends less time on the data
>> >transfer and is available more often for other concurrent tasks.
>> >Unfortunately this means that there are two transfers occurring, a slow
>> >DMA from hard disk to the cache, and a fast CPU transfer from the cache
>> >to internal memory. Other controllers such as the A2090 and HardFrame DMA
>> >directly from the harddrive into internal memory, thus tying up you CPU
>> >much longer.
>>
>>       Yikes! I'm sorry, but I TOTALLY disagree with you on this one.
>> Logicly, if you look at the time to complete a given task, based only
>> on the number of bus cycles it takes to transfer a given block of data,
>> DMA will always win. Period. Unless of course your DMA circuitry is
>> totally braindead.
>
>Clearly you don't understand, or perhaps I did not explain well. The bottleneck
>here is the speed at which the hard disk turns, and therefore the rate at which
>data is available to the DMA channel.

No, it's you who do not understand. Clearly, you don't have the faintest idea
of how the 2090A works. The speed of the data coming off the disk is the same
(given the same disk drive) in all cases, and given a sufficiently small amount
of data so that things like MaxTransfer do not come into play.

>That is why DMAing directly from the hard disk to internal memory is a loosing
>proposition.  GVP provides a cache into which it reads from the disk while
>leaving the Amiga's 680x0 alone.  Only then does it transfer the data from the
>cache into internal memory at full speed, without having to wait for the
>mechanical limitations of the hard disk.

The 2090 waits for the mechanical limitation of the dard disk, yes, but it most
definitely does NOT hang on to the bus for the entire time. It has a FIFO that
fills from the HD, and that empties in small bursts. The effect of this is that
the data comes in to the FIFO at a fixed rate, that being exactly the same rate
as the cache on the GVP, and it is in the unloading of the data into main
memory where the 2090 comes out a resounding winner over the GVP. The only time
it does not come out a clear winner is in the case where multiple sectors are
being transferred and there is a lot of contention from other DMA sources. In
this case, the 2090 has to retry because of data overruns.

I have yet to play with or study a HardFrame, so I cannot vouch for its method
of transfer, but since it is as fast or faster than the 2090, I can only assume
they did it right.

>Perhaps it should then DMA from its cache into internal memory, but that is
>another question. Even if it did that, it still would get lower diskperf's than
>the A2090 or HardFrame. The improvement would be rather limited, and the cost
>would be higher. The GVP controller is expensive enough as it is.

Yes, perhaps they should, but they don't. That, and the necessity for
specifying a low MaxTransfer value are what makes the thing a comparative slug.
Check out the 2090A and the HardFrame first, then repost your perceptions.
There will be a quiz. :-)

>> Sorry, this is one EE type that
>> just won't believe it. The Amiga is a DMA machine, that is part of
>> what gives it it's amazing speed for graphics and sound.
>
>Obviously some EE types are better than others. Good luck on your '030
>accelerator design.

A rather unfortunate choice of parting shot, wouldn't you say? You've been a
tad cranky lately Valentin. Mellow out willya?

-larry

--
"So what the hell are we going to do with a Sun?" - Darlene Phillips -
+-----------------------------------------------------------------------+ 
|   //   Larry Phillips                                                 |
| \X/    lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips |
|        COMPUSERVE: 76703,4322  -or-  76703.4322@compuserve.com        |
+-----------------------------------------------------------------------+

atheybey@lcs.mit.edu (Andrew Heybey) (08/09/89)

In article <120232@sun.Eng.Sun.COM> raz%kilowatt@Sun.COM (Steve -Raz- Berry) writes:
	   Let's look at a typical bus cycle for the GVP or any polled
   device. First, your device driver has to find out that data is waiting
   to be transfered, in either a DMA or polled transfer this is likely to
   be a similar amount of overhead. Secondly the data must be transfered.
   To do this a polled device has to perform at least three bus cycles.
   One to fetch the data, two to transfer the data to it's new
   destination and three to decrement and branch to the top of the loop
   again. This of course is the absolute minimum for the loop.

Sounds like a good argument to me.  That said, I've got a GVP and as
soon as I can scrape together the cash to buy a drive, I'll even have
it installed :-(.  *If* GVP's software has this hypothetical tight
loop to transfer data, I should be able to win big by installing a
68010, no?

Am I all wet?  Has anyone disassembled their GVP driver to find out
what's going on in there?

andrew

--
------------
Andrew Heybey, atheybey@ptt.lcs.mit.edu, uunet!ptt.lcs.mit.edu!atheybey
MIT Laboratory for Computer Science
Room 509, 545 Technology Square, Cambridge, MA  02139    (617) 253-6011

451061@UOTTAWA.BITNET (Valentin Pepelea) (08/10/89)

Steve -Raz- Berry <raz%kilowatt@sun.com> writes in <120232@sun.Eng.Sun.COM>

> In article <8908072207.AA14796@jade.berkeley.edu> 451061@UOTTAWA.BITNET
>  (Valentin Pepelea) writes:
>
> >The net result is that the processor therefore spends less time on the data
> >transfer and is available more often for other concurrent tasks.
> >Unfortunately this means that there are two transfers occurring, a slow
> >DMA from hard disk to the cache, and a fast CPU transfer from the cache
> >to internal memory. Other controllers such as the A2090 and HardFrame DMA
> >directly from the harddrive into internal memory, thus tying up you CPU
> >much longer.
>
>       Yikes! I'm sorry, but I TOTALLY disagree with you on this one.
> Logicly, if you look at the time to complete a given task, based only
> on the number of bus cycles it takes to transfer a given block of data,
> DMA will always win. Period. Unless of course your DMA circuitry is
> totally braindead.

Clearly you don't understand, or perhaps I did not explain well. The bottleneck
here is the speed at which the hard disk turns, and therefore the rate at which
data is available to the DMA channel. That is why DMAing directly from the hard
disk to internal memory is a loosing proposition. GVP provides a cache into
which it reads from the disk while leaving the Amiga's 680x0 alone. Only then
does it transfer the data from the cache into internal memory at full speed,
without having to wait for the mechanical limitations of the hard disk.

Perhaps it should then DMA from its cache into internal memory, but that is
another question. Even if it did that, it still would get lower diskperf's than
the A2090 or HardFrame. The improvement would be rather limited, and the cost
would be higher. The GVP controller is expensive enough as it is.

> Sorry, this is one EE type that
> just won't believe it. The Amiga is a DMA machine, that is part of
> what gives it it's amazing speed for graphics and sound.

Obviously some EE types are better than others. Good luck on your '030
accelerator design.

> (this is part of my effort to insure a "kinder and gentler" netdome)

Perhaps you meant "thunder.net.dome". Two men enter, one flamed leaves.

Valentin
_________________________________________________________________________
"An  operating  system  without         Name:   Valentin Pepelea
 virtual memory is an operating         Phonet: (613) 231-7476
 system without virtue."                Bitnet: 451061@Uottawa.bitnet
                                        Usenet: Use cunyvm.cuny.edu gate
         - Ancient Inca Proverb         Planet: 451061@acadvm1.UOttawa.CA

raz%kilowatt@Sun.COM (Steve -Raz- Berry) (08/10/89)

In article <ATHEYBEY.89Aug9093829@allspice.lcs.mit.edu> atheybey@lcs.mit.edu (Andrew Heybey) writes:
>In article <120232@sun.Eng.Sun.COM> raz%kilowatt@Sun.COM (Steve -Raz- Berry) writes:
> [I delete my own diatribe in comparing and contrasting DMA vrs polling]
>
>Sounds like a good argument to me.  That said, I've got a GVP and as
>soon as I can scrape together the cash to buy a drive, I'll even have
>it installed :-(.  *If* GVP's software has this hypothetical tight
>loop to transfer data, I should be able to win big by installing a
>68010, no?

The three word instruction loop will definitely help your performance
*if* GVP wrote their software that way. BTW, the 3 bus cycle figure
is more than likely wrong for a plain jane 68K, mainly cause I didn't
figure in the instruction fetches. Probably more like 6 or so.
So for a 68010, you only have to count the data transfers. I'd guess
a bus cycle for data fetch from the drive, and a bus cycle for storing
to the destination memory. You still have some latency waiting for the
CPU to decrement and branch to the start of the loop, but this is still
not too bad.

start:
	move.w (a0),(a1)+	;move data from fifo to destination.
	dbeq   d0,start 	;decrement counter and loop.

That should fit into the three word '010 instruction cache.

Of course you still win bigger with a DMA card.

>Am I all wet?  Has anyone disassembled their GVP driver to find out
>what's going on in there?

I'd be curious to find out too.

---
Steve -Raz- Berry     Disclaimer: It wasn't me! I was volatilizing my esters.
UUCP: sun!kilowatt!raz                   ARPA: raz%kilowatt.EBay@sun.com
KILOWATT: sun!kilowatt!archive-server    archive-server%kilowatt.EBay@sun.com

raz%kilowatt@Sun.COM (Steve -Raz- Berry) (08/10/89)

In article <8908092130.AA23369@jade.berkeley.edu> 451061@UOTTAWA.BITNET (Valentin Pepelea) writes:
>Steve -Raz- Berry <raz%kilowatt@sun.com> writes in <120232@sun.Eng.Sun.COM>
>
>> In article <8908072207.AA14796@jade.berkeley.edu> 451061@UOTTAWA.BITNET
>>  (Valentin Pepelea) writes:
>>
> [old argument deleted]
>
>Clearly you don't understand, or perhaps I did not explain well. The bottleneck
>here is the speed at which the hard disk turns, and therefore the rate at which
>data is available to the DMA channel. That is why DMAing directly from the hard
>disk to internal memory is a loosing proposition. GVP provides a cache into
>which it reads from the disk while leaving the Amiga's 680x0 alone. Only then
>does it transfer the data from the cache into internal memory at full speed,
>without having to wait for the mechanical limitations of the hard disk.

The Hardframe also provides a FIFO (call it a cache if you like) on
board. I would think that if a DMA controller operated slower that the
bus it's connected to, then that would fall under the catagory of
brain-dead.

>Perhaps it should then DMA from its cache into internal memory, but that is
>another question. Even if it did that, it still would get lower diskperf's than
>the A2090 or HardFrame. The improvement would be rather limited, and the cost
>would be higher. The GVP controller is expensive enough as it is.

How do you justify that? DMA means running at bus speeds, full tilt, all
out gangbusters etc. You are using every cycle to transfer data, up to
the limit imposed by the device driver. The only increased cost that I see
is in the engineering time put into it. The Hardframe goes for $299, I
don't see that as astronomical, especially when you can probably get
$50 off of that price mail order. {Computer Mart has it for $257}

>> Sorry, this is one EE type that
>> just won't believe it. The Amiga is a DMA machine, that is part of
>> what gives it it's amazing speed for graphics and sound.

>Obviously some EE types are better than others. Good luck on your '030
>accelerator design.

I'll take it. I *know* it's going to be a bitch.

>> (this is part of my effort to insure a "kinder and gentler" netdome)

>Perhaps you meant "thunder.net.dome". Two men enter, one flamed leaves.
>Valentin

I told you via email, I mean no malice. If you refuse to accept that
that's up to you.

---
Steve -Raz- Berry     Disclaimer: It wasn't me! I was volatilizing my esters.
UUCP: sun!kilowatt!raz                   ARPA: raz%kilowatt.EBay@sun.com
KILOWATT: sun!kilowatt!archive-server    archive-server%kilowatt.EBay@sun.com

daveh@cbmvax.UUCP (Dave Haynie) (08/11/89)

in article <8908092130.AA23369@jade.berkeley.edu>, 451061@UOTTAWA.BITNET (Valentin Pepelea) says:

> Steve -Raz- Berry <raz%kilowatt@sun.com> writes in <120232@sun.Eng.Sun.COM>

>> In article <8908072207.AA14796@jade.berkeley.edu> 451061@UOTTAWA.BITNET
>>  (Valentin Pepelea) writes:

>> >The net result is that the processor therefore spends less time on the data
>> >transfer and is available more often for other concurrent tasks.

>>       Yikes! I'm sorry, but I TOTALLY disagree with you on this one.
>> Logicly, if you look at the time to complete a given task, based only
>> on the number of bus cycles it takes to transfer a given block of data,
>> DMA will always win. Period. 

> Clearly you don't understand, or perhaps I did not explain well. The 
> bottleneck here is the speed at which the hard disk turns, and therefore 
> the rate at which data is available to the DMA channel. 

>> Sorry, this is one EE type that just won't believe it. 

> Obviously some EE types are better than others. 

Well, you all know me as an EE type.  

I think there confusion here because the problem hasn't been properly 
decomposed.  There are two transfers going on in most hard drive systems --
from the drive to the controller, and from the controller to system memory.
It's always a losing proposition to transfer directly from the data as
read from the drive to the system memory, regardless of whether you go via
a CPU read method or a DMA method.  Fortunately, it's almost impossible as
well, unless you're dealing with direct manipulation of an ST-506 interface.

Assuming a SCSI device, you really don't have any idea how the data is 
handled between the physical hard drive and the SCSI channel.  Still, the best
a direct asynchronous SCSI read or DMA can do is significantly less that any
buffering scheme you might come up with.  The Apple Macintosh is a good example
of what happens when you don't buffer up your SCSI, if for no other reason than
to convert the SCSI byte stream to a word stream before travelling between the
controller and the system memory.  So let's agree not to take any simple,
stupid approaches -- all the mentioned controllers, GVP, Commodore, and
Microbotics, take a much more intelligent approach.

GVP is the simplest in concept.  It sucks up a whole block into local RAM,
then transfers this at memory-to-memory speeds across the bus, from it's
local RAM to it's final destination.  On a 68000, even with some cleverly
designed copy loops like CopyMemQuick() or similar, you'll still have over
two bus crossings per word transferred -- one from the local RAM to the 
68000, one from the 68000 to the system RAM, and occasional stops to fetch
opcodes.  With a 68010 or better, you can basically ignore the opcode fetch
time, but you still have the two complete bus crossings per word.  With a
68020 or 68030 and some 32 bit memory, you can reduce this to two slow and
one fast bus crossings per longword, which comes pretty close to one bus
crossing per word, but not quite.

The Commodore controllers are all DMA driven and backed by a FIFO.  The 2090
will read from the SCSI controller into it's FIFO, and when the FIFO starts to
fill, it'll take the bus, dump 32 words across at full speed, and then give
back the bus.  This results in one bus crossing per word, plus a small bus
arbitration time.  Most other DMA driven controllers work very similarly.

The main idea here is that the fastest a non-DMA controller will ever run is
approximately the same as the normal speed of a DMA controller.  Without a
68020 or 68030 and some 32 bit RAM, the DMA controller is always a win.  You
can, of course, pick a bad DMA controller and compare it to a good programmed
controller, or visa versa, to accentuate the point of YOUR particular
religious views, but I'm dealing in science here.

There is one situation where a non-DMA device will run faster than a DMA device
in Amiga systems.  If you have a 68020 or 68030 system with 32 bit memory above
the 24 bit address space of the 68000, a good non-DMA device like GVPs will go
faster under FFS.  The deal here is that the programmed transfer doesn't have 
any 24 bit limits, while the DMA transfer does.   Plus, with a 32 bit card, the
non-DMA transfer is already approaching the speed of the DMA transfer (the
difference with a fast '030 card may be as much software overhead as hardware
differences).  So while the non-DMA transfer works normally, the DMA device must
dump it's data to a temporary RAM buffer, and then run a CPU driven copy to the
final destination.  That copy is likely about as fast as the non-DMA transfer,
so in this situation, the non-DMA device may be around twice as fast as the
DMA transfer.  This situation will disappear with full 32 bit DMA device, but
you won't be having them on the A2000 bus.

> Valentin

-- 
Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
           Be careful what you wish for -- you just might get it

rachamp@mbunix.mitre.org (Richard A. Champeaux) (08/11/89)

In article <8908092130.AA23369@jade.berkeley.edu> 451061@UOTTAWA.BITNET (Valentin Pepelea) writes:
>Steve -Raz- Berry <raz%kilowatt@sun.com> writes in <120232@sun.Eng.Sun.COM>
>
>> In article <8908072207.AA14796@jade.berkeley.edu> 451061@UOTTAWA.BITNET
>>  (Valentin Pepelea) writes:
>>
>> >The net result is that the processor therefore spends less time on the data
>> >transfer and is available more often for other concurrent tasks.
>> >Unfortunately this means that there are two transfers occurring, a slow
>> >DMA from hard disk to the cache, and a fast CPU transfer from the cache
>> >to internal memory. Other controllers such as the A2090 and HardFrame DMA
>> >directly from the harddrive into internal memory, thus tying up you CPU
>> >much longer.
>>
>>       Yikes! I'm sorry, but I TOTALLY disagree with you on this one.
>> Logicly, if you look at the time to complete a given task, based only
>> on the number of bus cycles it takes to transfer a given block of data,
>> DMA will always win. Period. Unless of course your DMA circuitry is
>> totally braindead.
>
>Clearly you don't understand, or perhaps I did not explain well. The bottleneck
>here is the speed at which the hard disk turns, and therefore the rate at which
>data is available to the DMA channel. That is why DMAing directly from the hard
>disk to internal memory is a loosing proposition. GVP provides a cache into
>which it reads from the disk while leaving the Amiga's 680x0 alone. Only then
>does it transfer the data from the cache into internal memory at full speed,
>without having to wait for the mechanical limitations of the hard disk.

You keep claiming that the transfer from the disk is slow, and the transfer by
the CPU is fast.  Have you ever bothered to calculate the speeds?

The minimum loop required for transferring words from GVP's onboard buffer to
main memory is the following:

loop:	move  (a0)+,(a1)+	(24 clock cycles)
	dbra  d0,loop		(18 clock cycles)

The execution times were found in Motorola's 68000 programmer reference manual.
So it takes 42 clock cycles to transfer 2 bytes.  At 7.12MHZ, that 5.89 us to 
transfer 2 bytes, or 339 kbytes/sec, maximum.  Assuming no transfer time from
the disk, and no track to track stepping time, the maximum transfer raste is
339 kbytes/sec.

I don't even pretend to know all of the delays and bottlenecks associated with
a SCSI drive, but lets look at the on you mentioned: the disk rpms.  Hard
disk drives, I believe, spin at 3600 rpms.  My ST296N has 34 sectors per track
with 512 bytes per sector.  It is formatted with an interleave of 1.  Assuming
that there are no other delays, the disk itself can deliver 1.04448 Mbytes/sec.
I realize that there are a bunch of other factors, but the drive rpm is not
the bottleneck.

Your argument now seems to be sitting on a pretty poor foundation.

You also claim that part of the advantage of the GVP is that the processor can
do other things while data is being transfered to the onboard buffer.  Lets
look at that.  My HardFrame is giving me 655 kbyte/sec reads.  Lets call the
time it takes to transfer a chunck of data X.  The transfer rate from GVP's
onboard buffer to main memory is roughly half that of the HardFrame, so lets
call it's transfer time 2X.  The time it takes to transfer from the drive to
the onboard buffer can not be bigger than X, so lets also call it X.  Lets
also assume that with the HardFrame, the processor can not access the bus 
durring the transfer.  

So, the time it takes for the transfer to complete on the GVP is 3X.  Durring
that time, the processor is busy 2 thirds of the time.  Durring the same time
period 3X, however, the HardFrame is busy only 1 third of the time.  So where's
this mythical free time advantage the processor is supposed to have with the
GVP?

>
>> Sorry, this is one EE type that
>> just won't believe it. The Amiga is a DMA machine, that is part of
>> what gives it it's amazing speed for graphics and sound.
>
>Obviously some EE types are better than others.

Oh that's OK, don't put yourself down.  You'll do better next time.

>Valentin

Rich Champeaux   (rachamp@mbunix.mitre.org)

addison@pollux.usc.edu (Richard Addison) (08/14/89)

In article <63241@linus.UUCP> rachamp@mbunix (Champeaux) writes:
>The minimum loop required for transferring words from GVP's onboard buffer to
>main memory is the following:
>
>loop:	move  (a0)+,(a1)+	(24 clock cycles)
>	dbra  d0,loop		(18 clock cycles)

Try again.  This is an obvious way of doing it, but it is not the fastest.

Richard Addison
"No comment."

ckp@grebyn.com (Checkpoint Technologies) (08/15/89)

In article <19160@usc.edu> addison@pollux.usc.edu (Richard Addison) writes:
>In article <63241@linus.UUCP> rachamp@mbunix (Champeaux) writes:
>>The minimum loop required for transferring words from GVP's onboard buffer to
>>main memory is the following:
>>
>>loop:	move  (a0)+,(a1)+	(24 clock cycles)
>>	dbra  d0,loop		(18 clock cycles)
>
>Try again.  This is an obvious way of doing it, but it is not the fastest.
>
>Richard Addison

	Well, assuming you have the registers available and the move is
a multiple of 12 longwords, how about this:

loop:	movem.l	(a0)+,d1-d7/a2-a6
	movem.l	d1-d7/a2-a6,(a1)+
	dbra	d0,loop

	Won't give you loop mode on the 68010, but it won't matter. You
get 24 words moved for each 6 words of instruction fetch.
-- 
First comes the logo: C H E C K P O I N T  T E C H N O L O G I E S      / /  
                                                                    \\ / /    
Then, the disclaimer:  All expressed opinions are, indeed, opinions. \  / o
Now for the witty part:    I'm pink, therefore, I'm spam!             \/