GORRIEDE@UREGINA1.BITNET (Dennis Robert Gorrie) (11/15/89)
I know this has been discussed alot already, but its still not completely clear to me. I would appreciate any information regarding this subject. The story goes, DMA is faster. But, as many people point out, it sometimes is slower than non-DMA, when there is contention for the bus. Case in point is the hi-res interlace screen situation where co-processors and your DMA hard disk device are contending for cycles on the coprocessor bus. Then someone says a 'proper' DMA device is faster, than non-DMA, even for the situation above. How is this so? What is a 'proper' DMA device? The current solution, for DMA controlers with slow loads during the above situation, from my understanding, is to limit the DMA transfer to small block sized transfers. It seems that this is basicaly just like a non-DMA transfer. What about other solutions? Like dynamicaly allocating DMA transfer sizes based on device priority, and contention, ect. Isn't that what Bus Mastering is all about? +-----------------------------------------------------------------------+ |Dennis Gorrie 'Chain-Saw Tag... | |GORRIEDE AT UREGINA1.BITNET Try It, You'll Like It!'| +-----------------------------------------------------------------------+
ckp@grebyn.com (Checkpoint Technologies) (11/16/89)
In article <8911150430.AA24506@jade.berkeley.edu> GORRIEDE@UREGINA1.BITNET (Dennis Robert Gorrie) writes: > >I know this has been discussed alot already, but its still not completely >clear to me. I would appreciate any information regarding this subject. > > [The Universal Question (condensed): Which is faster, DMA or non-DMA? Why?] All this talk of DMA vs non-DMA *has* to be because of the Commodore A2090(A) disk controller, which is a DMA device that performs poorly under high-chip-RAM-contention situations. So it brought DMA into a bad light. First of all, "contention" is when two or more devices want to use the same bus at the same time. Only one device at a time may use a given bus, so when several want it, somebody has to wait. Every Amiga has about 26 devices which want the chip RAM bus; 25 of them are in the custom chips, amd one is the 68000. There are many ways in which contention can be resolved, and all of them involve choosing one device which "wins" the bus. In the Amiga, there are assigned priorities to chip RAM, and the device with the highest priority wins. First comes the carefully-programmed cycles which are never contended (bit plane, DRAM refresh, audio, sprites, floppy disk), then comes the Copper, then the Blitter. The 68000 is last, it only gets the chip RAM bus when the custom chips don't want it. Now, when you plug in a card which performs DMA, it takes it's place in the priority scheme. In fact, bus cards will win over the CPU, but will lose to the custom chips. This is only for the chip RAM bus, however; when contending for fast RAM, a DMA device will always win over the CPU which is the only other contender (unless there are other DMA devices; then Busteer picks a winner based on slot number I think). There are times when the custom chips want the chip RAM bus a LOT; high res overscanned 4 bit plane with the Blitter and Copper running, for example, in which case you may not win the bus for a WHOLE FRAME, which is about a hundredth of a second. That's a LONG TIME. If your DMA device is a disk, then the data comes off the disk at the rate that the disk is spinning. The disk will *not* stop and wait for the bus to become available. So the DMA device has to be able to do something else with the data. The A2090(A) has a 64 byte FIFO, which is a place to hold 64 bytes of a transfer while waiting for the bus. This is good, but not good enough, since a whole sector is 512 bytes; at the speed of an ST506 disk, 512 bytes will arrive in less than a thousandth of a second. So the A2090(A) will *lose* that sector. The device driver is smart enough to try again, after the disk revolves around again, but this takes another 60th of a second. This is the real reason that the A2090(A) will run soooo slooooow when in a bad situation like this. Now, a properly designed DMA device will be able to handle the situation where it can't get the bus for a LONG time, and to do this it'll have a FIFO big enough to handle at least one sector (512 bytes). More FIFO is better. When it does win the bus, it can then transfer everything into RAM quickly (DMA's forte is fast transfers). The Microbotics HardFrame disk controller is designed this way; therefore it is the fastest available Amiga hard disk controller, and does not have problems with chip RAM contention. Another note: SCSI disk drives have FIFOs built into them, big enough for a sector. This is good enough to handle the problems of DMA contention. Commodore's new A2091 DMA controller actually has a *smaller* FIFO than the A2090(A), and they depend on the SCSI disk drive's FIFO to keep them out of trouble. There are "handshaking" wires on a SCSI connector that lets the computer tell the SCSI disk drive how fast it can accept data from the disk drive. Why then does the A2090(A) have contention problems with SCSI disk drives, just like it has with ST506 drives? Because they botched the A2090's SCSI adapter. It doesn't properly use the available handshaking to slow the SCSI disk drive down when it's FIFO fills. They have fixed this on the A2091, and so they don't need such a large FIFO. (Incidentally, the A2091 doesn't have an ST506 controller.) On non-DMA controllers: These devices (like the GVP controller) move disk drive data to a built-in RAM chip first. Since this movement does not need the Amiga's bus at all, and so are never affected by the chip RAM bus. When the built-in RAM is filled, the 68000 CPU copies that RAM to the place in system RAM where the data's really needed. Now, with this design, the contention problem is solved. However, it's using your CPU to move the data, and while it's doing that, it's *not* running your applications. So you may get acceptable IO performance, but your Amiga's CPU is being penalized. I dislike this, personally; I think it goes against the Amiga's high-performance IO philosophies. Well, I hope this helps everyone in their understanding of the Amiga and DMA. To sum up: the A2090(A) has given DMA a bad name, which it does not deserve. The A2090(A) deserves the bad name; DMA is still the fastest possible way to perform IO, even on the Amiga.
swarren@eugene.uucp (Steve Warren) (11/16/89)
In article <8911150430.AA24506@jade.berkeley.edu> GORRIEDE@UREGINA1.BITNET (Dennis Robert Gorrie) writes: > >I know this has been discussed alot already, but its still not completely >clear to me. I would appreciate any information regarding this subject. > >The story goes, DMA is faster. But, as many people point out, it sometimes >is slower than non-DMA, when there is contention for the bus. Case in point >is the hi-res interlace screen situation where co-processors and your DMA >hard disk device are contending for cycles on the coprocessor bus. No matter where the request comes from, the same bus cycle will have to occur to read or write to the chip ram. The non-DMA transfer to chip ram will experience the same contention that the DMA transfer would. If the DMA device has trouble where the CPU doesn't then it just isn't glued into the bus properly. >Then someone says a 'proper' DMA device is faster, than non-DMA, even for the >situation above. How is this so? What is a 'proper' DMA device? The only thing I can think of is that maybe the DMA device is held off so long that it overflows its buffer and has to wait for another disk rotation to get the rest of the data. In this case a bigger buffer would have fixed it. Otherwise the two devices (DMA & CPU) should both see the same contention and suffer the same constraints. >The current solution, for DMA controlers with slow loads during the above >situation, from my understanding, is to limit the DMA transfer to small >block sized transfers. It seems that this is basicaly just like a non-DMA >transfer. That sounds like a work-around for a too-small buffer on the disk interface card. The only time I think you might find non-DMA faster is on 32-bit ram on a coprocessor (where DMA is 16-bit), and you still take a performance hit, because you can't use your cpu cycles for other tasks while you are doing the transfer. Of course, if the DMA device uses 100% of the bus bandwidth then you couldn't use those cycles anyway, unless the CPU is playing in chip-land. >What about other solutions? Like dynamicaly allocating DMA transfer sizes >based on device priority, and contention, ect. Isn't that what Bus Mastering >is all about? How about emergency-dumping into unused fast-ram whenever contention threatens to overrun the device buffer, then transferring from fast-to-chip when possible. This might save a disk rotation, if that is the problem. --Steve ------------------------------------------------------------------------- {uunet,sun}!convex!swarren; swarren@convex.COM
cmcmanis%pepper@Sun.COM (Chuck McManis) (11/16/89)
Dennis Robert Gorrie writes: > The story goes, DMA is faster. But, as many people point out, it sometimes > is slower than non-DMA, when there is contention for the bus. Case in point > is the hi-res interlace screen situation where co-processors and your DMA > hard disk device are contending for cycles on the coprocessor bus. DMA == Direct Memory Access. If a peripheral does DMA it will always be faster than a non-DMA device for one simple reason. The path between the peripheral and it's destination is shorter. Consider the following diagram : +---------------+ +-----+ +--------+ | | | | | | Device >----->+ I/O Interface +--+ +-->+ CPU +--+ +--->+ Memory | | | | | | | | | | | +---------------+ | | +-----+ | | +--------+ V ^ V ^ /--------+----+------------+----+-----------/ / (1) +----+ +----+ / / (2) +----------------------+ / / Computer Backplane / /-------------------------------------------/ (1) is the path of Non-DMA data, it is read by the CPU from the Peripheral interface, then it is written by the CPU into the destination memory address. This involves one Read and one Write cycle on the main bus. (2) is the path of the DMA data, it is written directly to memory by the peripheral interface. This involves a single write cycle on the main bus. Many things compete for the bus, in the Amiga these can be the CPU and other peripherals. Further the bus is sometimes blocked by the CPU waiting on information to come out of "Chip" memory. This is particularly true during high overscan situations. For any given set of bus cycles the DMA device will transfer more data over the bus then the CPU would be able to move it over the bus. Some people have been lead to believe that DMA devices are sometimes slow because the 2090 interface is slow in transferring data sometimes and other non-DMA drives seem to be faster. This is not due to non-DMA being faster, rather it is due to the 2090 being incapable of dealing with bus inactive conditions. It's internal FIFO overflows and it must abort and restart the entire transfer. This is a bug in the 2090 and you will notice that the 2091 (and A590) don't have this problem. >Then someone says a 'proper' DMA device is faster, than non-DMA, even for the >situation above. How is this so? What is a 'proper' DMA device? For any number of cycles, DMA will be faster than non DMA because it can transfer data at full bus speeds. A 'proper' DMA device is one that can operate even when the bus has become unavailable for relatively long periods of time. This implies some flow control on the peripheral interface itself, or sufficient buffering to allow for the maximum bus latency delay. >The current solution, for DMA controlers with slow loads during the above >situation, from my understanding, is to limit the DMA transfer to small >block sized transfers. It seems that this is basicaly just like a non-DMA >transfer. That is the 2090 solution because by limiting the transfers to sizes that are less likely to overflow it's inadequate FIFO so that it maintains its DMA performance advantage. Other controllers such as the HardFrame, and 2091 do not have this problem. >What about other solutions? Like dynamicaly allocating DMA transfer sizes >based on device priority, and contention, ect. Isn't that what Bus Mastering >is all about? Again, if you design your board to be able to deal with long bus latency times as many people have, then you don't have any problem. You are apparently confusing a weakness with the 2090 design, with a problems in the concept of DMA. They are not related at all. --Chuck McManis uucp: {anywhere}!sun!cmcmanis BIX: cmcmanis ARPAnet: cmcmanis@Eng.Sun.COM These opinions are my own and no one elses, but you knew that didn't you. "If it didn't have bones in it, it wouldn't be crunchy now would it?!"
daveh@cbmvax.UUCP (Dave Haynie) (11/16/89)
in article <8911150430.AA24506@jade.berkeley.edu>, GORRIEDE@UREGINA1.BITNET (Dennis Robert Gorrie) says: > The story goes, DMA is faster. You first have to look at the problem you're trying to solve. The problem, in this case, is data transfer from a hard disk controller to the Amiga's main memory. Except for this transfer mechanism, there's nothing intrinsically different between DMA and non-DMA devices. For the DMA transfer, a device of some kind requests the Amiga's bus and transfers a number of words of data to or from the Amiga's main memory. Once this transfer is complete, it will probably have to involve the main CPU, at least to tell the main CPU that it's done. But the transfer is very efficient, because the CPU isn't involved during the transfer (eg, no interrupts, no need to push and pop stacks, etc.), and there's only one bus crossing per word transfer; data flows only between main memory and the DMA device. For a non-DMA transfer, the CPU is involved to some degree or another. At the worst, it works like the Mac's hard disk interface, where the CPU is required to talk directly to a SCSI chip, and must basically sit and wait for each byte to be available. Much better is the GVP approach, where the SCSI device itself transfers a whole block (or possibly several blocks) into local memory. At this point, the CPU is called upon to transfer that data to or from this local memory. This transfer requires two bus crossings for each word; data flows between the main memory and the CPU, then between the CPU and the local memory (or visa versa). > But, as many people point out, it sometimes is slower than non-DMA, > when there is contention for the bus. Case in point is the hi-res > interlace screen situation where co-processors and your DMA hard disk > device are contending for cycles on the coprocessor bus. DMA to chip memory, unless you really need it, is a bad idea with any kind of controller, since you can be kept out of chip memory for an extended period of time. In order to even start a transfer from the hard disk controller, you can't have the CPU waiting on chip memory, for either the DMA or non-DMA controller. Assuming DMA, the controller will request the bus from the CPU. The CPU can grant the bus right away, but the DMA device can't actually take over the bus until the CPU finishes its current cycle. When waiting for chip bus access in a high-activity display mode, this can be a long wait. For the non-DMA device, the CPU will get an interrupt signaling it's needed for a transfer. However, it can't service that interrupt until the current instruction is complete, which of course can't complete until the CPU has chip bus access. So in either case, when the CPU's involved in a delayed access to the chip bus, you have to wait. As long as the actual transfer goes to fast memory, you'll only have this initial lag (or possibly a few of them if the transfer is done in several bits, as it often is with FIFO based controllers), and you won't see too much DMA slowdown. If the transfer is into chip memory, you'll of course see a rather noticable slowdown. > Then someone says a 'proper' DMA device is faster, than non-DMA, even for the > situation above. How is this so? What is a 'proper' DMA device? The main problem you can get into this situation is essentially a flow control problem. You have data coming from a hard disk which needs to get stuck into memory somewhere. You can have an undetermined length of time to wait for access to that memory. If the controller is capable of stopping the flow of data into the device based on it's success at getting data out of the device, everything's cool. Some, like the GVP controller, do this by only dealing in whole disk blocks. Others, like the A2091, do this by starting and stopping the data transfer from the SCSI device itself. The problem that's been seen is when a device, be it DMA or non-DMA, isn't capable of starting and stopping this data flow. The A2090 is an example of such a device, at least as supported by its current software. When it can't get access to the bus within a certain amount of time, its FIFO overruns, and it has to attempt the transfer all over again. If it could tell the SCSI device to stop sending as its FIFO fills up, there'd be no problem (in fact, the A2091 has a smaller FIFO but works much better, because it can start and stop the data flow). I'm told that part of this problem is the A2090 support for ST-506. SCSI is a rather high-level protocol, with intelligent drives, and it can support things like start and stop. ST-506 is a low-level, dumb protocol that must transfer whole blocks in a fixed amount of time. You start a transfer from the disk into the FIFO, then start DMA out of the FIFO. If the DMA gets waited too long, the FIFO overruns, and you have to start over again. You have to take a look at the particular controller in question. Any modern review of a DMA controller should include its performance with a full bandwidth screen up (eg, 640 across, 4 bitplanes, overscan if you like). Modern DMA controllers like the A2091 and the Microbotics hardframe have no trouble with this situation. > +-----------------------------------------------------------------------+ > |Dennis Gorrie 'Chain-Saw Tag... | > |GORRIEDE AT UREGINA1.BITNET Try It, You'll Like It!'| > +-----------------------------------------------------------------------+ -- Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests" {uunet|pyramid|rutgers}!cbmvax!daveh PLINK: hazy BIX: hazy Too much of everything is just enough
33014-18@sjsumcs.sjsu.edu (Eduardo Horvath) (11/17/89)
Can you DMA directly into FAST RAM, or is it necessary to go through CHIP RAM? If a controller DMA'd into FAST RAM, wouldn't that solve the problem of contention with the custom chips? =============================================================================== //x = /// \ Try: 33014-18@sjsumcs.SJSU.EDU = Early to bet /// \ = And early to raise /// \ Eduardo Horvath = Makes a man poor \\\ ///=======\ = In a gamling craze! \\\/// \ = -me \xxx \miga. The computer for the corruptive mind. ===============================================================================
pds@quintus.UUCP (Peter Schachte) (11/17/89)
In article <14035@grebyn.com> ckp@grebyn.UUCP (Checkpoint Technologies) writes:
[Severely edited]
->There are "handshaking" wires on a SCSI connector that
->lets the computer tell the SCSI disk drive how fast it can accept data....
->the A2090's SCSI adapter doesn't properly use the available handshaking to
->slow the SCSI disk drive down when it's FIFO fills.... fixed on the A2091...
Does the A590 have this problem, or is it more like the A2091 than the
A2090A?
--
-Peter Schachte
pds@quintus.uucp
...!sun!quintus!pds
steveb@cbmvax.UUCP (Steve Beats) (11/17/89)
In article <1284@quintus.UUCP> pds@quintus.UUCP (Peter Schachte) writes: >In article <14035@grebyn.com> ckp@grebyn.UUCP (Checkpoint Technologies) writes: >[Severely edited] >->the A2090's SCSI adapter doesn't properly use the available handshaking to >->slow the SCSI disk drive down when it's FIFO fills.... fixed on the A2091... > >Does the A590 have this problem, or is it more like the A2091 than the >A2090A? >-- No, the A590 uses the same DMA chip as the A2091, the problem is fixed. I have tested the driver with multiple fast SCSI drives and a 4 plane overscanned hi-res screen. There is some slowdown, but not much. Data is certainly never lost. Steve
daveh@cbmvax.UUCP (Dave Haynie) (11/18/89)
in article <1989Nov16.185706.29328@sjsumcs.sjsu.edu>, 33014-18@sjsumcs.sjsu.edu (Eduardo Horvath) says: > Can you DMA directly into FAST RAM, Yes. In fact, it's greatly preferred and recommended. > If a controller DMA'd into FAST RAM, wouldn't that solve the problem > of contention with the custom chips? It can solve most of the problem. There are two components to the DMA transfer. DMA to Fast memory will solve the second, which is the basic transfer rate for whatever block size the controller transfers in one chunk. The first problem is what I call "DMA lag", or how long it takes from the time your controller asks for the bus to when it actually gets the bus. In order to acquire the bus, the CPU bus be finished with a bus cycle. If the CPU is in wait states, waiting for access to the chip bus, the DMA controller will have to wait for the CPU to finish it's cycle, (eg, wait for the chip bus to be free), before it can take over the bus. DMA controllers often transfer a whole block (512 bytes) in several DMA passes, so it's actually possible to incur this lag several times for each block, if your CPU is doing lots of stuff with video memory. Also, if you have an autoboot controller of any kind that copies it's code to RAM before using it, you get slowdowns if your autoboot card is the first one in the machine, since that code will get copied into chip memory. So, unless you know your code is running from ROM, or you have something like an A2620/A2630 that puts autoconfig RAM in before your device is configured, it's best to put a memory card in before your device. Hopefully all-in-one memory/disk cards autoconfig the memory before the disk. > /// \ Eduardo Horvath = Makes a man poor -- Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests" {uunet|pyramid|rutgers}!cbmvax!daveh PLINK: hazy BIX: hazy Too much of everything is just enough
swarren@eugene.uucp (Steve Warren) (11/18/89)
In article <1989Nov16.185706.29328@sjsumcs.sjsu.edu> 33014-18@sjsumcs.SJSU.EDU (Eduardo Horvath) writes: > > Can you DMA directly into FAST RAM, or is it necessary to go through > CHIP RAM? If a controller DMA'd into FAST RAM, wouldn't that solve > the problem of contention with the custom chips? Yeah, but when the data is needed in the chip space (sound data or graphics data) or when the machine is all chip ram, then you still need to move the bytes into contention-land. When the destination is fast ram there isn't a problem. On the same subject, has anyone seen the misleading product sheet put out for the KRONOS SCSI controller? It claims that DMA is fundamentally flawed when trying to access chip ram. This is absolutely false. Certain controllers were not designed to allow for contention gracefully, but the fact that they were DMA was irrelevent. The only reason the KRONOS is fast is because the controller-to-memory path is 16-bits wide. If they would just say it like it is I would be more impressed with their product. I am suspicious when they erect straw men. --Steve ------------------------------------------------------------------------- {uunet,sun}!convex!swarren; swarren@convex.COM