lphillips@lpami.wimsey.bc.ca (Larry Phillips) (08/09/89)
In <8908092130.AA23369@jade.berkeley.edu>, 451061@UOTTAWA.BITNET (Valentin Pepelea) writes: >Steve -Raz- Berry <raz%kilowatt@sun.com> writes in <120232@sun.Eng.Sun.COM> > >> In article <8908072207.AA14796@jade.berkeley.edu> 451061@UOTTAWA.BITNET >> (Valentin Pepelea) writes: >> >> >The net result is that the processor therefore spends less time on the data >> >transfer and is available more often for other concurrent tasks. >> >Unfortunately this means that there are two transfers occurring, a slow >> >DMA from hard disk to the cache, and a fast CPU transfer from the cache >> >to internal memory. Other controllers such as the A2090 and HardFrame DMA >> >directly from the harddrive into internal memory, thus tying up you CPU >> >much longer. >> >> Yikes! I'm sorry, but I TOTALLY disagree with you on this one. >> Logicly, if you look at the time to complete a given task, based only >> on the number of bus cycles it takes to transfer a given block of data, >> DMA will always win. Period. Unless of course your DMA circuitry is >> totally braindead. > >Clearly you don't understand, or perhaps I did not explain well. The bottleneck >here is the speed at which the hard disk turns, and therefore the rate at which >data is available to the DMA channel. No, it's you who do not understand. Clearly, you don't have the faintest idea of how the 2090A works. The speed of the data coming off the disk is the same (given the same disk drive) in all cases, and given a sufficiently small amount of data so that things like MaxTransfer do not come into play. >That is why DMAing directly from the hard disk to internal memory is a loosing >proposition. GVP provides a cache into which it reads from the disk while >leaving the Amiga's 680x0 alone. Only then does it transfer the data from the >cache into internal memory at full speed, without having to wait for the >mechanical limitations of the hard disk. The 2090 waits for the mechanical limitation of the dard disk, yes, but it most definitely does NOT hang on to the bus for the entire time. It has a FIFO that fills from the HD, and that empties in small bursts. The effect of this is that the data comes in to the FIFO at a fixed rate, that being exactly the same rate as the cache on the GVP, and it is in the unloading of the data into main memory where the 2090 comes out a resounding winner over the GVP. The only time it does not come out a clear winner is in the case where multiple sectors are being transferred and there is a lot of contention from other DMA sources. In this case, the 2090 has to retry because of data overruns. I have yet to play with or study a HardFrame, so I cannot vouch for its method of transfer, but since it is as fast or faster than the 2090, I can only assume they did it right. >Perhaps it should then DMA from its cache into internal memory, but that is >another question. Even if it did that, it still would get lower diskperf's than >the A2090 or HardFrame. The improvement would be rather limited, and the cost >would be higher. The GVP controller is expensive enough as it is. Yes, perhaps they should, but they don't. That, and the necessity for specifying a low MaxTransfer value are what makes the thing a comparative slug. Check out the 2090A and the HardFrame first, then repost your perceptions. There will be a quiz. :-) >> Sorry, this is one EE type that >> just won't believe it. The Amiga is a DMA machine, that is part of >> what gives it it's amazing speed for graphics and sound. > >Obviously some EE types are better than others. Good luck on your '030 >accelerator design. A rather unfortunate choice of parting shot, wouldn't you say? You've been a tad cranky lately Valentin. Mellow out willya? -larry -- "So what the hell are we going to do with a Sun?" - Darlene Phillips - +-----------------------------------------------------------------------+ | // Larry Phillips | | \X/ lphillips@lpami.wimsey.bc.ca -or- uunet!van-bc!lpami!lphillips | | COMPUSERVE: 76703,4322 -or- 76703.4322@compuserve.com | +-----------------------------------------------------------------------+
atheybey@lcs.mit.edu (Andrew Heybey) (08/09/89)
In article <120232@sun.Eng.Sun.COM> raz%kilowatt@Sun.COM (Steve -Raz- Berry) writes:
Let's look at a typical bus cycle for the GVP or any polled
device. First, your device driver has to find out that data is waiting
to be transfered, in either a DMA or polled transfer this is likely to
be a similar amount of overhead. Secondly the data must be transfered.
To do this a polled device has to perform at least three bus cycles.
One to fetch the data, two to transfer the data to it's new
destination and three to decrement and branch to the top of the loop
again. This of course is the absolute minimum for the loop.
Sounds like a good argument to me. That said, I've got a GVP and as
soon as I can scrape together the cash to buy a drive, I'll even have
it installed :-(. *If* GVP's software has this hypothetical tight
loop to transfer data, I should be able to win big by installing a
68010, no?
Am I all wet? Has anyone disassembled their GVP driver to find out
what's going on in there?
andrew
--
------------
Andrew Heybey, atheybey@ptt.lcs.mit.edu, uunet!ptt.lcs.mit.edu!atheybey
MIT Laboratory for Computer Science
Room 509, 545 Technology Square, Cambridge, MA 02139 (617) 253-6011
451061@UOTTAWA.BITNET (Valentin Pepelea) (08/10/89)
Steve -Raz- Berry <raz%kilowatt@sun.com> writes in <120232@sun.Eng.Sun.COM> > In article <8908072207.AA14796@jade.berkeley.edu> 451061@UOTTAWA.BITNET > (Valentin Pepelea) writes: > > >The net result is that the processor therefore spends less time on the data > >transfer and is available more often for other concurrent tasks. > >Unfortunately this means that there are two transfers occurring, a slow > >DMA from hard disk to the cache, and a fast CPU transfer from the cache > >to internal memory. Other controllers such as the A2090 and HardFrame DMA > >directly from the harddrive into internal memory, thus tying up you CPU > >much longer. > > Yikes! I'm sorry, but I TOTALLY disagree with you on this one. > Logicly, if you look at the time to complete a given task, based only > on the number of bus cycles it takes to transfer a given block of data, > DMA will always win. Period. Unless of course your DMA circuitry is > totally braindead. Clearly you don't understand, or perhaps I did not explain well. The bottleneck here is the speed at which the hard disk turns, and therefore the rate at which data is available to the DMA channel. That is why DMAing directly from the hard disk to internal memory is a loosing proposition. GVP provides a cache into which it reads from the disk while leaving the Amiga's 680x0 alone. Only then does it transfer the data from the cache into internal memory at full speed, without having to wait for the mechanical limitations of the hard disk. Perhaps it should then DMA from its cache into internal memory, but that is another question. Even if it did that, it still would get lower diskperf's than the A2090 or HardFrame. The improvement would be rather limited, and the cost would be higher. The GVP controller is expensive enough as it is. > Sorry, this is one EE type that > just won't believe it. The Amiga is a DMA machine, that is part of > what gives it it's amazing speed for graphics and sound. Obviously some EE types are better than others. Good luck on your '030 accelerator design. > (this is part of my effort to insure a "kinder and gentler" netdome) Perhaps you meant "thunder.net.dome". Two men enter, one flamed leaves. Valentin _________________________________________________________________________ "An operating system without Name: Valentin Pepelea virtual memory is an operating Phonet: (613) 231-7476 system without virtue." Bitnet: 451061@Uottawa.bitnet Usenet: Use cunyvm.cuny.edu gate - Ancient Inca Proverb Planet: 451061@acadvm1.UOttawa.CA
raz%kilowatt@Sun.COM (Steve -Raz- Berry) (08/10/89)
In article <ATHEYBEY.89Aug9093829@allspice.lcs.mit.edu> atheybey@lcs.mit.edu (Andrew Heybey) writes: >In article <120232@sun.Eng.Sun.COM> raz%kilowatt@Sun.COM (Steve -Raz- Berry) writes: > [I delete my own diatribe in comparing and contrasting DMA vrs polling] > >Sounds like a good argument to me. That said, I've got a GVP and as >soon as I can scrape together the cash to buy a drive, I'll even have >it installed :-(. *If* GVP's software has this hypothetical tight >loop to transfer data, I should be able to win big by installing a >68010, no? The three word instruction loop will definitely help your performance *if* GVP wrote their software that way. BTW, the 3 bus cycle figure is more than likely wrong for a plain jane 68K, mainly cause I didn't figure in the instruction fetches. Probably more like 6 or so. So for a 68010, you only have to count the data transfers. I'd guess a bus cycle for data fetch from the drive, and a bus cycle for storing to the destination memory. You still have some latency waiting for the CPU to decrement and branch to the start of the loop, but this is still not too bad. start: move.w (a0),(a1)+ ;move data from fifo to destination. dbeq d0,start ;decrement counter and loop. That should fit into the three word '010 instruction cache. Of course you still win bigger with a DMA card. >Am I all wet? Has anyone disassembled their GVP driver to find out >what's going on in there? I'd be curious to find out too. --- Steve -Raz- Berry Disclaimer: It wasn't me! I was volatilizing my esters. UUCP: sun!kilowatt!raz ARPA: raz%kilowatt.EBay@sun.com KILOWATT: sun!kilowatt!archive-server archive-server%kilowatt.EBay@sun.com
raz%kilowatt@Sun.COM (Steve -Raz- Berry) (08/10/89)
In article <8908092130.AA23369@jade.berkeley.edu> 451061@UOTTAWA.BITNET (Valentin Pepelea) writes: >Steve -Raz- Berry <raz%kilowatt@sun.com> writes in <120232@sun.Eng.Sun.COM> > >> In article <8908072207.AA14796@jade.berkeley.edu> 451061@UOTTAWA.BITNET >> (Valentin Pepelea) writes: >> > [old argument deleted] > >Clearly you don't understand, or perhaps I did not explain well. The bottleneck >here is the speed at which the hard disk turns, and therefore the rate at which >data is available to the DMA channel. That is why DMAing directly from the hard >disk to internal memory is a loosing proposition. GVP provides a cache into >which it reads from the disk while leaving the Amiga's 680x0 alone. Only then >does it transfer the data from the cache into internal memory at full speed, >without having to wait for the mechanical limitations of the hard disk. The Hardframe also provides a FIFO (call it a cache if you like) on board. I would think that if a DMA controller operated slower that the bus it's connected to, then that would fall under the catagory of brain-dead. >Perhaps it should then DMA from its cache into internal memory, but that is >another question. Even if it did that, it still would get lower diskperf's than >the A2090 or HardFrame. The improvement would be rather limited, and the cost >would be higher. The GVP controller is expensive enough as it is. How do you justify that? DMA means running at bus speeds, full tilt, all out gangbusters etc. You are using every cycle to transfer data, up to the limit imposed by the device driver. The only increased cost that I see is in the engineering time put into it. The Hardframe goes for $299, I don't see that as astronomical, especially when you can probably get $50 off of that price mail order. {Computer Mart has it for $257} >> Sorry, this is one EE type that >> just won't believe it. The Amiga is a DMA machine, that is part of >> what gives it it's amazing speed for graphics and sound. >Obviously some EE types are better than others. Good luck on your '030 >accelerator design. I'll take it. I *know* it's going to be a bitch. >> (this is part of my effort to insure a "kinder and gentler" netdome) >Perhaps you meant "thunder.net.dome". Two men enter, one flamed leaves. >Valentin I told you via email, I mean no malice. If you refuse to accept that that's up to you. --- Steve -Raz- Berry Disclaimer: It wasn't me! I was volatilizing my esters. UUCP: sun!kilowatt!raz ARPA: raz%kilowatt.EBay@sun.com KILOWATT: sun!kilowatt!archive-server archive-server%kilowatt.EBay@sun.com
daveh@cbmvax.UUCP (Dave Haynie) (08/11/89)
in article <8908092130.AA23369@jade.berkeley.edu>, 451061@UOTTAWA.BITNET (Valentin Pepelea) says: > Steve -Raz- Berry <raz%kilowatt@sun.com> writes in <120232@sun.Eng.Sun.COM> >> In article <8908072207.AA14796@jade.berkeley.edu> 451061@UOTTAWA.BITNET >> (Valentin Pepelea) writes: >> >The net result is that the processor therefore spends less time on the data >> >transfer and is available more often for other concurrent tasks. >> Yikes! I'm sorry, but I TOTALLY disagree with you on this one. >> Logicly, if you look at the time to complete a given task, based only >> on the number of bus cycles it takes to transfer a given block of data, >> DMA will always win. Period. > Clearly you don't understand, or perhaps I did not explain well. The > bottleneck here is the speed at which the hard disk turns, and therefore > the rate at which data is available to the DMA channel. >> Sorry, this is one EE type that just won't believe it. > Obviously some EE types are better than others. Well, you all know me as an EE type. I think there confusion here because the problem hasn't been properly decomposed. There are two transfers going on in most hard drive systems -- from the drive to the controller, and from the controller to system memory. It's always a losing proposition to transfer directly from the data as read from the drive to the system memory, regardless of whether you go via a CPU read method or a DMA method. Fortunately, it's almost impossible as well, unless you're dealing with direct manipulation of an ST-506 interface. Assuming a SCSI device, you really don't have any idea how the data is handled between the physical hard drive and the SCSI channel. Still, the best a direct asynchronous SCSI read or DMA can do is significantly less that any buffering scheme you might come up with. The Apple Macintosh is a good example of what happens when you don't buffer up your SCSI, if for no other reason than to convert the SCSI byte stream to a word stream before travelling between the controller and the system memory. So let's agree not to take any simple, stupid approaches -- all the mentioned controllers, GVP, Commodore, and Microbotics, take a much more intelligent approach. GVP is the simplest in concept. It sucks up a whole block into local RAM, then transfers this at memory-to-memory speeds across the bus, from it's local RAM to it's final destination. On a 68000, even with some cleverly designed copy loops like CopyMemQuick() or similar, you'll still have over two bus crossings per word transferred -- one from the local RAM to the 68000, one from the 68000 to the system RAM, and occasional stops to fetch opcodes. With a 68010 or better, you can basically ignore the opcode fetch time, but you still have the two complete bus crossings per word. With a 68020 or 68030 and some 32 bit memory, you can reduce this to two slow and one fast bus crossings per longword, which comes pretty close to one bus crossing per word, but not quite. The Commodore controllers are all DMA driven and backed by a FIFO. The 2090 will read from the SCSI controller into it's FIFO, and when the FIFO starts to fill, it'll take the bus, dump 32 words across at full speed, and then give back the bus. This results in one bus crossing per word, plus a small bus arbitration time. Most other DMA driven controllers work very similarly. The main idea here is that the fastest a non-DMA controller will ever run is approximately the same as the normal speed of a DMA controller. Without a 68020 or 68030 and some 32 bit RAM, the DMA controller is always a win. You can, of course, pick a bad DMA controller and compare it to a good programmed controller, or visa versa, to accentuate the point of YOUR particular religious views, but I'm dealing in science here. There is one situation where a non-DMA device will run faster than a DMA device in Amiga systems. If you have a 68020 or 68030 system with 32 bit memory above the 24 bit address space of the 68000, a good non-DMA device like GVPs will go faster under FFS. The deal here is that the programmed transfer doesn't have any 24 bit limits, while the DMA transfer does. Plus, with a 32 bit card, the non-DMA transfer is already approaching the speed of the DMA transfer (the difference with a fast '030 card may be as much software overhead as hardware differences). So while the non-DMA transfer works normally, the DMA device must dump it's data to a temporary RAM buffer, and then run a CPU driven copy to the final destination. That copy is likely about as fast as the non-DMA transfer, so in this situation, the non-DMA device may be around twice as fast as the DMA transfer. This situation will disappear with full 32 bit DMA device, but you won't be having them on the A2000 bus. > Valentin -- Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests" {uunet|pyramid|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy Be careful what you wish for -- you just might get it
rachamp@mbunix.mitre.org (Richard A. Champeaux) (08/11/89)
In article <8908092130.AA23369@jade.berkeley.edu> 451061@UOTTAWA.BITNET (Valentin Pepelea) writes: >Steve -Raz- Berry <raz%kilowatt@sun.com> writes in <120232@sun.Eng.Sun.COM> > >> In article <8908072207.AA14796@jade.berkeley.edu> 451061@UOTTAWA.BITNET >> (Valentin Pepelea) writes: >> >> >The net result is that the processor therefore spends less time on the data >> >transfer and is available more often for other concurrent tasks. >> >Unfortunately this means that there are two transfers occurring, a slow >> >DMA from hard disk to the cache, and a fast CPU transfer from the cache >> >to internal memory. Other controllers such as the A2090 and HardFrame DMA >> >directly from the harddrive into internal memory, thus tying up you CPU >> >much longer. >> >> Yikes! I'm sorry, but I TOTALLY disagree with you on this one. >> Logicly, if you look at the time to complete a given task, based only >> on the number of bus cycles it takes to transfer a given block of data, >> DMA will always win. Period. Unless of course your DMA circuitry is >> totally braindead. > >Clearly you don't understand, or perhaps I did not explain well. The bottleneck >here is the speed at which the hard disk turns, and therefore the rate at which >data is available to the DMA channel. That is why DMAing directly from the hard >disk to internal memory is a loosing proposition. GVP provides a cache into >which it reads from the disk while leaving the Amiga's 680x0 alone. Only then >does it transfer the data from the cache into internal memory at full speed, >without having to wait for the mechanical limitations of the hard disk. You keep claiming that the transfer from the disk is slow, and the transfer by the CPU is fast. Have you ever bothered to calculate the speeds? The minimum loop required for transferring words from GVP's onboard buffer to main memory is the following: loop: move (a0)+,(a1)+ (24 clock cycles) dbra d0,loop (18 clock cycles) The execution times were found in Motorola's 68000 programmer reference manual. So it takes 42 clock cycles to transfer 2 bytes. At 7.12MHZ, that 5.89 us to transfer 2 bytes, or 339 kbytes/sec, maximum. Assuming no transfer time from the disk, and no track to track stepping time, the maximum transfer raste is 339 kbytes/sec. I don't even pretend to know all of the delays and bottlenecks associated with a SCSI drive, but lets look at the on you mentioned: the disk rpms. Hard disk drives, I believe, spin at 3600 rpms. My ST296N has 34 sectors per track with 512 bytes per sector. It is formatted with an interleave of 1. Assuming that there are no other delays, the disk itself can deliver 1.04448 Mbytes/sec. I realize that there are a bunch of other factors, but the drive rpm is not the bottleneck. Your argument now seems to be sitting on a pretty poor foundation. You also claim that part of the advantage of the GVP is that the processor can do other things while data is being transfered to the onboard buffer. Lets look at that. My HardFrame is giving me 655 kbyte/sec reads. Lets call the time it takes to transfer a chunck of data X. The transfer rate from GVP's onboard buffer to main memory is roughly half that of the HardFrame, so lets call it's transfer time 2X. The time it takes to transfer from the drive to the onboard buffer can not be bigger than X, so lets also call it X. Lets also assume that with the HardFrame, the processor can not access the bus durring the transfer. So, the time it takes for the transfer to complete on the GVP is 3X. Durring that time, the processor is busy 2 thirds of the time. Durring the same time period 3X, however, the HardFrame is busy only 1 third of the time. So where's this mythical free time advantage the processor is supposed to have with the GVP? > >> Sorry, this is one EE type that >> just won't believe it. The Amiga is a DMA machine, that is part of >> what gives it it's amazing speed for graphics and sound. > >Obviously some EE types are better than others. Oh that's OK, don't put yourself down. You'll do better next time. >Valentin Rich Champeaux (rachamp@mbunix.mitre.org)
addison@pollux.usc.edu (Richard Addison) (08/14/89)
In article <63241@linus.UUCP> rachamp@mbunix (Champeaux) writes: >The minimum loop required for transferring words from GVP's onboard buffer to >main memory is the following: > >loop: move (a0)+,(a1)+ (24 clock cycles) > dbra d0,loop (18 clock cycles) Try again. This is an obvious way of doing it, but it is not the fastest. Richard Addison "No comment."
ckp@grebyn.com (Checkpoint Technologies) (08/15/89)
In article <19160@usc.edu> addison@pollux.usc.edu (Richard Addison) writes: >In article <63241@linus.UUCP> rachamp@mbunix (Champeaux) writes: >>The minimum loop required for transferring words from GVP's onboard buffer to >>main memory is the following: >> >>loop: move (a0)+,(a1)+ (24 clock cycles) >> dbra d0,loop (18 clock cycles) > >Try again. This is an obvious way of doing it, but it is not the fastest. > >Richard Addison Well, assuming you have the registers available and the move is a multiple of 12 longwords, how about this: loop: movem.l (a0)+,d1-d7/a2-a6 movem.l d1-d7/a2-a6,(a1)+ dbra d0,loop Won't give you loop mode on the 68010, but it won't matter. You get 24 words moved for each 6 words of instruction fetch. -- First comes the logo: C H E C K P O I N T T E C H N O L O G I E S / / \\ / / Then, the disclaimer: All expressed opinions are, indeed, opinions. \ / o Now for the witty part: I'm pink, therefore, I'm spam! \/