alex@.UUCP (Alex Laney) (01/19/88)
[Don't have any reference for the '030 yet] Since CSA has an '030 card coming out, (which sounds great), does anyone know if the '030 boots up in a manner compatible with the '020. What I'm wondering about, is the MMU on the Commodore 68020 card going to constrain the internals of Exec/Dos to not use the MMU on the '030. I doubt if it is compatible, but hopefully it does a subset, at least. It sounds like another round of, well you can upgrade to the new CPU, but you can't use any of the features of it. You know, turning off caches, etc., seems to be mandatory with the '020. And the '020 really only speeds up with 32-bit memory, as well. I suppose the rumored(:-) Unix from Commodore probably won't run on an '030, because of the MMU issue. I assume that if CSA is releasing an '030 card, the current ExecDos runs on it. I hope that some of the lag time until the next OS release is used to look at these issues. I think everyone wants over time to upgrade to at least the '020, then the '030, [of course, the chips have to go down to $5 each :-)] so let's try to make the path less bumpy! I hope this is a change of pace from piracy/virus/mtask flaming! -- Alex Laney alex@xicom.UUCP ...utzoo!dciem!nrcaer!xios!xicom!alex Xicom Technologies, 205-1545 Carling Av., Ottawa, Ontario, Canada We may have written the SNA software you use. The opinions are my own.
daveh@cbmvax.UUCP (Dave Haynie) (01/26/88)
in article <494@.UUCP>, alex@.UUCP (Alex Laney) says: > Summary: What about the new '030 Card? > [Don't have any reference for the '030 yet] > > ...is the MMU on the Commodore 68020 card going to constrain the > internals of Exec/Dos to not use the MMU on the '030. I doubt if it is > compatible, but hopefully it does a subset, at least. The 68030's MMU is in fact a subset of the 68851 on the A2620 card. > It sounds like another round of, well you can upgrade to the new CPU, but > you can't use any of the features of it. You know, turning off caches, etc., > seems to be mandatory with the '020. Not at all. The Amiga OS does in fact turn on the 68020 cache when it starts up, and most things run without any trouble with the cache on. A few games are apparently running self-modifying code or something that causes a problem with the cache, but most Amiga software runs fine. > And the '020 really only speeds up with 32-bit memory, as well. Well, the A2620 runs our production line test a bit faster than the 68000 with it's on-board memory disabled. It has going for it the cache, and the fact that anything that's happening on-chip is still going to happen at 14.3 MHz. And 68881 performance goes up with the 68020's real coprocessor interface. But for plain old integer operations, the fast 32 but memory helps alot. > I hope that some of the lag time until the next OS release is used to look > at these issues. I think everyone wants over time to upgrade to at least the > '020, then the '030, [of course, the chips have to go down to $5 each :-)] so > let's try to make the path less bumpy! The '030 running with data cache enabled introduces the most problems. Still, many of these can be solved in hardware if the designer takes the time; you really do want to use that data cache! > Alex Laney alex@xicom.UUCP ...utzoo!dciem!nrcaer!xios!xicom!alex -- Dave Haynie "The B2000 Guy" Commodore-Amiga "The Crew That Never Rests" {ihnp4|uunet|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy "I can't relax, 'cause I'm a Boinger!"
harald@ccicpg.UUCP ( Harald Milne) (01/28/88)
In article <3200@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes: > The '030 running with data cache enabled introduces the most problems. Still, > many of these can be solved in hardware if the designer takes the time; > you really do want to use that data cache! Amen! The only problem I can imagine at this point, is DMA to FAST ram. Would anybody be silly enough to do this? This will kill the 680x0. Well kind of, after all, this is still the Amiga! You get kinda jaded, knowing what all the burdens, the coprocessor's remove. I'm really curious how AudioMaster can play 11 minutes of digitized sound. (Obviously with 9.5meg of ram) How about an entire Compact Disk off an Ethernet connection! I figure about 50 meg! Beats .6 giga. > Dave Haynie "The B2000 Guy" Commodore-Amiga "The Crew That Never Rests" > {ihnp4|uunet|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy > "I can't relax, 'cause I'm a Boinger!" I got a sneaky suspicion, that to get a few thousand miles away, you were at AmiExpo! Darn. -- Work: Computer Consoles Inc. (CCI), Advanced Development Group (ADG) Irvine, CA (RISCy business! Home of the CCI POWER 6/32) UUCP: uunet!ccicpg!harald
harald@ccicpg.UUCP ( Harald Milne) (01/28/88)
In article <2044@antique.UUCP>, cjp@antique.UUCP (Charles Poirier) writes: > In article <494@.UUCP> alex@.UUCP writes: > >Since CSA has an '030 card coming out, (which sounds great), does anyone > >know if the '030 boots up in a manner compatible with the '020. What I'm > > A friend who works for a competitor of CSA (so apply grains of salt to > taste) says to be wary of the CSA 68030 board. Supposedly they did not > do a "real" design at all, but rather used verbatim the circuit from > Motorola's application notes that basically says "Here's how to get the > 68030 to work exactly like a 68020." I.e., they cripple it. End quote. This wouldn't surprise me. Being 68020 compatible is a safe move, no OS issues. That would be crippled. /* Personal opinion follows */ I liked the fact, that CSA persued the Amiga performance in terms of hardware. BUT, I think CSA assumes we want to pay IBM and MacII prices! I think CSA said, "Performance at all cost", and I think that this is a bad engineering decision. It should have been postulated, the best in price/performance ratio. Brute force, is not always a win. Especially with cost at no object, frame of mind. That's why I sighed relief, seeing at least the Hurricane board. Competition! And now with the A2620 from CBM, we may even achieve economy of scale! I'm waiting. /* End of my bullshit opinion */ > Charles Poirier (decvax,ihnp4,attmail)!vax135!cjp > > "Docking complete... Docking complete... Docking complete..." -- Work: Computer Consoles Inc. (CCI), Advanced Development Group (ADG) Irvine, CA (RISCy business! Home of the CCI POWER 6/32) UUCP: uunet!ccicpg!harald
daveh@cbmvax.UUCP (Dave Haynie) (02/02/88)
in article <10170@ccicpg.UUCP>, harald@ccicpg.UUCP ( Harald Milne) says: > In article <3200@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes: >> The '030 running with data cache enabled introduces the most problems. Still, >> many of these can be solved in hardware if the designer takes the time; >> you really do want to use that data cache! > Amen! > The only problem I can imagine at this point, is DMA to FAST ram. > Would anybody be silly enough to do this? This will kill the 680x0. This happens all the time with things like hard disk drives. It sure does hurt the 68000's speed, but consider the alternative. You've got to get that disk data into memory somehow. If you make the 68000 go and read it from an I/O port somewhere, you're running several memory cycles per data transfer. I mean, instruction fetch, I/O fetch, instruction fetch, write to RAM, instruction fetch, test and branch, something like that. Once a DMA driven controller is set up (simple, nothing like setting up the blitter), you have a bus arbitration, then one word transferred by the controller per memory cycle. If you're a 68020, you may even run a little from cache after the arbitration. So this is much faster than possible without DMA. > I got a sneaky suspicion, that to get a few thousand miles away, you > were at AmiExpo! Darn. No, actually, Paradise Island, The Bahamas. Didn't make AmiExpo. Woulda been nice too, but I had all this work piled up here. > Work: Computer Consoles Inc. (CCI), Advanced Development Group (ADG) > Irvine, CA (RISCy business! Home of the CCI POWER 6/32) > UUCP: uunet!ccicpg!harald -- Dave Haynie "The B2000 Guy" Commodore-Amiga "The Crew That Never Rests" {ihnp4|uunet|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy "I can't relax, 'cause I'm a Boinger!"
peter@sugar.UUCP (Peter da Silva) (02/03/88)
In article <3200@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes: > in article <494@.UUCP>, alex@.UUCP (Alex Laney) says: > > you can't use any of the features of it. You know, turning off caches, etc., > > seems to be mandatory with the '020. > few games are apparently running self-modifying code or something that > causes a problem with the cache, but most Amiga software runs fine. Well, I'd think that you'd probably want to invalidate the cache when you LoadSeg() something... just in case it's LoadSegging it at the same address as something that's already in the cache. It's a real long-shot, but it's almost certain it's gonna hit someone sometime. -- -- Peter da Silva `-_-' ...!hoptoad!academ!uhnix1!sugar!peter -- Disclaimer: These U aren't mere opinions... these are *values*.
alex@xicom.UUCP (Alex Laney) (02/04/88)
In article <3200@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes: > > The 68030's MMU is in fact a subset of the 68851 on the A2620 card. Welllll, the fact that the A2620 card is using the Motorola MMU is news to me. What I had read before is that Commodore was intending to use a custom MMU. So, I'm happy! [I know that some people don't care for Motorola MMU's based on past experience, but it's too late for that.] Is there a release date other than RSN? Or even a release date on specs, etc., that may include a release date of the board? Just wondering ... -- Alex Laney alex@xicom.UUCP ...utzoo!dciem!nrcaer!xios!xicom!alex Xicom Technologies, 205-1545 Carling Av., Ottawa, Ontario, Canada We may have written the SNA software you use. The opinions are my own.
stever@videovax.Tek.COM (Steven E. Rice, P.E.) (02/05/88)
In article <3246@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes: [ discussion of, among other things, DMA to fast ram ] > This happens all the time with things like hard disk drives. It sure does > hurt the 68000's speed, but consider the alternative. You've got to get > that disk data into memory somehow. If you make the 68000 go and read it > from an I/O port somewhere, you're running several memory cycles per data > transfer. I mean, instruction fetch, I/O fetch, instruction fetch, write to > RAM, instruction fetch, test and branch, something like that. Once a DMA > driven controller is set up (simple, nothing like setting up the blitter), > you have a bus arbitration, then one word transferred by the controller per > memory cycle. If you're a 68020, you may even run a little from cache after > the arbitration. So this is much faster than possible without DMA. This is true for a 68000 or 68010, and perhaps even for a 68020 or 68030 on a 16-bit-wide bus. However, for best performance you want to put the DMA peripherals on one side of a dual-ported memory and let the CPU do the data moving. Why? The reasons are as follows: 1. Most DMA peripherals are incredibly sluggish. An example is the LANCE, an Ethernet interface chip. It transfers data in blocks of eight 16-bit words. The *minimum* time to perform this transfer is 4.8 microseconds, with no-wait-state memory. Add arbitration time to this and it becomes more like 5.1 microseconds. And if you can't complete a memory cycle in less than 105 nanoseconds, each cycle (remember, there are eight of them!) gets longer in 100-nanosecond steps. To keep up with the Ethernet, the LANCE will arbitrate for the bus about every 12.8 microseconds, tying it up for 5.1 microseconds minimum. This is about 40% of the bus bandwidth. 2. On a 32-bit bus, the 68020 can move data very efficiently -- once the instructions have been loaded into the cache, the only thing on the bus will be (32-bit) data transfers. Even with reasonably slow memory (180-nanosecond access, 300-nanosecond cycle time), this means that the 68020 can transfer data twice as fast as a LANCE running on 100-nanosecond access memory. If you dual-port the LANCE memory properly (32 bits wide to the 68020, 16 bits wide to the LANCE), you can move the data from the dual-ported memory *while* the LANCE is transferring other data into it, thus achieving an effective doubling of the transfer rate and freeing the bus for other purposes the rest of the time. The same thing applies to hard disks, too. The 68020 can sustain a 48 Mbit/second transfer rate. Typical hard disks run at 5 to 10 Mbit/ second rates. Unless the hard disk interface is fast as greased lightning *and* 32 bits wide, the 68020 or 68030 can move the data faster! So, for maximum performance, hide your peripherals behind dual-ported memory, and then mark those pages as "non-cacheable." Steve Rice ----------------------------------------------------------------------------- * Every knee shall bow, and every tongue confess that Jesus Christ is Lord! * new: stever@videovax.tv.Tek.com old: {decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever
daveh@cbmvax.UUCP (Dave Haynie) (02/05/88)
in article <1431@sugar.UUCP>, peter@sugar.UUCP (Peter da Silva) says: > > In article <3200@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes: >> in article <494@.UUCP>, alex@.UUCP (Alex Laney) says: >> > you can't use any of the features of it. You know, turning off caches, etc., >> > seems to be mandatory with the '020. > Well, I'd think that you'd probably want to invalidate the cache when you > LoadSeg() something... just in case it's LoadSegging it at the same address > as something that's already in the cache. It's a real long-shot, but it's > almost certain it's gonna hit someone sometime. True. Unless you can be certain that the LoadSeg function itself fills up the cache. That would be the simplest way to handle this without having to make the 68020 a special case. Certainly in the future, when data caches and larger instruction caches are being used, the cache will have to be explicitly dumped in such cases. I'm not sure that LoadSeg actually does this or not, but the cache is only 64 longwords long; it doesn't take long to overrun this. So I expect that implicit cache clearing works in this case. Any OS gurus out there know fer shure? > -- Peter da Silva `-_-' ...!hoptoad!academ!uhnix1!sugar!peter > -- Disclaimer: These U aren't mere opinions... these are *values*. -- Dave Haynie "The B2000 Guy" Commodore-Amiga "The Crew That Never Rests" {ihnp4|uunet|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy "I can't relax, 'cause I'm a Boinger!"
harald@leo.UUCP ( Harald Milne) (02/05/88)
In article <3246@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes: > in article <10170@ccicpg.UUCP>, harald@ccicpg.UUCP ( Harald Milne) says: > > The only problem I can imagine at this point, is DMA to FAST ram. > > Would anybody be silly enough to do this? This will kill the 680x0. > > This happens all the time with things like hard disk drives. It sure does > hurt the 68000's speed, but consider the alternative. I'm painfully aware of the alternative. My question was a bit rhetorical. I have an A1000 at home with HD and 1.75meg, and an A2000 at work with Ethernet and 3meg. You are right, you have to DMA to get reasonable performance. My reference to 680x0, was in reference to the 68030, 68020, 68000. And the best solution overall. More specifically, the 68030. I think we have hit the hard spot. Solutions? Hmm.... Somehow, for the 68030 at least, you have to invalidate these entries in the cache, or prevent them from ever appearing. To invalidate, is a software solution. To prevent it from appearing, could possibly be done by hardware. Or even by a combination of hardware/software via the MMU. The real question is, which yeilds the most performance, while maintaining compatability. Hmmm.... I have to think about this a bit. This gets even tougher when you consider all the possible configurations, memory timings, etc. Ack! Looks like your prophecy in AC comes true after all, "And on the '020 vs. '030 question, we may have a surprise or two you aren't considering." This sure gives me enough to chew on. > Dave Haynie "The B2000 Guy" Commodore-Amiga "The Crew That Never Rests" > {ihnp4|uunet|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy > "I can't relax, 'cause I'm a Boinger!" -- Work: Computer Consoles Inc. (CCI), Advanced Development Group (ADG) Irvine, CA (RISCy business! Home of Regulus and hamiga) UUCP: uunet!ccicpg!leo!harald
daveh@cbmvax.UUCP (Dave Haynie) (02/10/88)
in article <4822@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says: Summary: DMA is still *FAST*er > Summary: DMA is the *SLOW* way to go! > In article <3246@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes: >> This happens all the time with things like hard disk drives. It sure does >> hurt the 68000's speed, but consider the alternative. You've got to get >> that disk data into memory somehow. If you make the 68000 go and read it >> from an I/O port somewhere, you're running several memory cycles per data >> transfer. I mean, instruction fetch, I/O fetch, instruction fetch, write to >> RAM, instruction fetch, test and branch, something like that. Once a DMA >> driven controller is set up (simple, nothing like setting up the blitter), >> you have a bus arbitration, then one word transferred by the controller per >> memory cycle. If you're a 68020, you may even run a little from cache after >> the arbitration. So this is much faster than possible without DMA. > This is true for a 68000 or 68010, and perhaps even for a 68020 or 68030 on > a 16-bit-wide bus. However, for best performance you want to put the DMA > peripherals on one side of a dual-ported memory and let the CPU do the > data moving. No, what you want is intelligently designed peripherals. > Why? The reasons are as follows: > 1. Most DMA peripherals are incredibly sluggish... > To keep up with the Ethernet, the LANCE will arbitrate for the > bus about every 12.8 microseconds, tying it up for 5.1 microseconds > minimum. This is about 40% of the bus bandwidth. This is why we have things like FIFOs. Even the 68020 running with cache enabled typically uses only around 50% of the bus bandwidth. This is not a bad thing, though, but a good argument for DMA. > 2. On a 32-bit bus, the 68020 can move data very efficiently -- once the > instructions have been loaded into the cache, the only thing on the > bus will be (32-bit) data transfers. Even with reasonably slow > memory (180-nanosecond access, 300-nanosecond cycle time), this means > that the 68020 can transfer data twice as fast as a LANCE running > on 100-nanosecond access memory. Like I said, intelligently designed peripherals. Let's look at a hard disk controller with FIFO. The Amiga 2090 controller is such a beast. Though only a 16 bit device, the same principals work in 32 bit land. So my hard disk controller is chugging away, fetching data from the relatively slow hard disk and stuffing this in the FIFO. It sees the FIFO filling up, and interrupts the 68020. The '020 springs to action, being that the disk is run by a high priority task that was just waiting on this interrupt. So far we're have to do this whether the disk controller is DMA or shared memory. Now let's consider the shared memory. Say we've got 512 bytes to move. You jump into a block move routine, where the cache immediately gets set up with the move code after the first loop pass. You've got one memory cycle to read the data from shared RAM, one memory cycle to stuff it into your destination RAM. So you get 256 memory cycles, plus maybe 2 extra for cache setup. Now we go to the DMA controller, moving the same 512 bytes. We have to set up the controller with the destination RAM address, that should take maybe 3 cycles. Give it another 3 to tell the DMA controller to go ahead. Next, maybe a cycle to arbitrate the bus. Now we run the DMA transfer. But we already have the data at hand, so all the controller has to do is stuff it in memory. That's 128 memory cycles. And another to re-arbitrate. So in this case, DMA comes out 136 cycles, vs. 258 if the 68020 moved it all by itself. > If you dual-port the LANCE memory properly (32 bits wide to the 68020, > 16 bits wide to the LANCE), you can move the data from the dual-ported > memory *while* the LANCE is transferring other data into it, thus > achieving an effective doubling of the transfer rate and freeing the > bus for other purposes the rest of the time. I get the exact same effect with my FIFO, only through use of DMA I'm tying up the bus much less. But not really, unless you've got some screaming RAM in that dual port section. Maybe you can use some true dual-ported SRAM, or a FIFO like what we've got on this hard disk controller, but if you're talking DRAM, forget it, the 68020's going to eat all the available time on anything in the 80ns or slower range. > So, for maximum performance, hide your peripherals behind dual-ported > memory, and then mark those pages as "non-cacheable." There's no question that having a peripheral device dump to shared RAM is much better than directly banging it with the CPU, Macintosh style. And for very small tranfer situations, it's better. A DMA controller has a fixed setup time. But if you're transferring more than a few bytes at a time, DMA is a win. And unless you're dealing with something that needs immediate response (eg, you can't wait until you've got 64 or 512 or whatever bytes to block transfer), DMA is still a win on a 68020 system, if done correctly. The 68020 at 32 bits/transfer will tie a 16 bit DMA device at transfer rate, plus it's got less setup, so you definitely want that DMA to be 32 bits wide. Finally, in a decent system, you can have DMA on your backplane going at the same time you've got CPU access going on you're local bus, so the DMA won't always kick the CPU off the bus. Amiga's aren't doing it this way, yet. > Steve Rice -- Dave Haynie "The B2000 Guy" Commodore-Amiga "The Crew That Never Rests" {ihnp4|uunet|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy "I can't relax, 'cause I'm a Boinger!"
hah@mipon3.intel.com (Hans Hansen) (02/13/88)
In article <3291@cbmvax.UUCP> daveh@cbmvax.UUCP (Dave Haynie) writes:
$in article <4822@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says:
$> Summary: DMA is the *SLOW* way to go!
$Summary: DMA is still *FAST*er
What both of you are overlooking is the fact that the system w/o DMA must
do a task switch each time it goes to the well, (~50us/68000, ~30us/68020,
~20us/68030). As the transfer data is coming in sloooooooowly the processor
is constantly switching tasks to service the "deadhead port" instead of
being left alone to calculate the next iteration of the Ray Tracing, setting
up the next pritty picture, balancing YOUR checkbook (sic).
Hans
stever@videovax.Tek.COM (Steven E. Rice, P.E.) (02/20/88)
Hmmmm. . . I expressed my belief that (at least in a 32-bit wide 68020 system) "DMA is the *SLOW* way to go!" In article <3291@cbmvax.UUCP>, Dave Haynie (daveh@cbmvax.UUCP) replied: > Summary: DMA is still *FAST*er Now, don't get me wrong -- I'm not suggesting that we go back to the bad old days of "programmed data transfers" (i.e., interrupt-per-byte transfers, with the CPU stacking and unstacking its entire context for each byte that comes in or goes out). Long, long ago, in a galaxy far, far away, I did that with a 6800 (our options were limited). Maximum data transfer rate was about 20K bytes per second, using every CPU cycle that was available. However, I will continue to insist that there are some things that are not fit for genteel company, and should be relegated to an appropriate closet. And right at the top of my list of such things is DMA I/O!!! In my previous article, I suggested: >> . . . However, for best performance you want to put the DMA >> peripherals on one side of a dual-ported memory and let the CPU do the >> data moving. Dave disagreed: > No, what you want is intelligently designed peripherals. (AMD may be bent out of shape at such calumnies!!) But I would suggest that the reasons I gave are valid: >> Why? The reasons are as follows: > >> 1. Most DMA peripherals are incredibly sluggish... > >> To keep up with the Ethernet, the LANCE will arbitrate for the >> bus about every 12.8 microseconds, tying it up for 5.1 microseconds >> minimum. This is about 40% of the bus bandwidth. > > This is why we have things like FIFOs. Even the 68020 running with cache > enabled typically uses only around 50% of the bus bandwidth. This is not > a bad thing, though, but a good argument for DMA. I guess I wasn't being quite as explicit as I should have been! First, the LANCE contains its own FIFO (they call it a "SILO"). Second, when I was talking about the LANCE taking up 40% of the bus bandwidth, I didn't relate it to the transfer efficiency. So, let me give an example I know well -- our system: -- 16.67 MHz 68020 on a 32-bit wide bus. -- Actual memory access time about 240 nsec (from assertion of AS' to the CPU responding to DSACKx' by un-asserting AS'). Full memory cycle time about 330 nsec (provides RAS' precharge time). Memory is asynchronous to the processor. -- LANCE Ethernet interface behind a 128K byte dual-ported memory which is organized as 32K x 32 bits from the 68020's perspective and 64K x 16 bits from the LANCE's perspective. The LANCE (along with its companion, the SIA) is an integrated solution to Ethernet interfacing. The LANCE manages its own "rings" of input and output buffers, discriminates against messages that aren't intended for it (it recognizes when it is addressed), and performs all the housekeeping functions associated with Ethernet packet creation and validation. Thus, the LANCE can receive and store a complete (maximum length 1536 octet) Ethernet packet before it pulls the CPU's chain. For all that it interfaces to a fast bus (Ethernet is 10 Mbits/sec data transfer rate), the LANCE has some disadvantages. It has a minimum 600 nsec data transfer time with 100 nsec memory. With our memory, which responds in about 240 nsec, the LANCE would have an 800 nsec nominal data transfer cycle. Thus, the LANCE would transfer 8, 16-bit words (one SILO full) every 12.8 microseconds, tying up the CPU bus for about 6.7 microseconds, which is 52% of the available CPU bus bandwidth. The LANCE can transfer only 16 bits with each memory cycle. Thus, its data transfer rate, during the time it is using the bus, is: (8 words) * (16 bits) / (6.7 microseconds) = 19.1 Mbits/second On the other hand, in our system the 68020 has an effective data transfer rate (once the cache is loaded with the instructions) of: (1 long word) * (32 bits) / (330 nanoseconds) = 96 Mbits/second If you cut that in half to reflect the fact that the 68020 has to both pick the (32-bit long word) up and store it away, it still has a data transfer rate of 48 Mbits/sec, which is over twice that of the LANCE. >> 2. On a 32-bit bus, the 68020 can move data very efficiently -- once the >> instructions have been loaded into the cache, the only thing on the >> bus will be (32-bit) data transfers. Even with reasonably slow >> memory (180-nanosecond access, 300-nanosecond cycle time), this means >> that the 68020 can transfer data twice as fast as a LANCE running >> on 100-nanosecond access memory. > > Like I said, intelligently designed peripherals. Let's look at a hard disk > controller with FIFO. The Amiga 2090 controller is such a beast. Though > only a 16 bit device, the same principals work in 32 bit land. Most principals work in schools. . . > So my hard disk controller is chugging away, fetching data from the > relatively slow hard disk and stuffing this in the FIFO. It sees the FIFO > filling up, and interrupts the 68020. The '020 springs to action, being > that the disk is run by a high priority task that was just waiting on this > interrupt. So far we're have to do this whether the disk controller is > DMA or shared memory. > > Now let's consider the shared memory. Say we've got 512 bytes to move. You > jump into a block move routine, where the cache immediately gets set up with > the move code after the first loop pass. You've got one memory cycle to read > the data from shared RAM, one memory cycle to stuff it into your destination > RAM. So you get 256 memory cycles, plus maybe 2 extra for cache setup. > > Now we go to the DMA controller, moving the same 512 bytes. We have to set > up the controller with the destination RAM address, that should take maybe > 3 cycles. Give it another 3 to tell the DMA controller to go ahead. Next, > maybe a cycle to arbitrate the bus. Now we run the DMA transfer. But we > already have the data at hand, so all the controller has to do is stuff it > in memory. That's 128 memory cycles. And another to re-arbitrate. > > So in this case, DMA comes out 136 cycles, vs. 258 if the 68020 moved it all > by itself. Now, let's come back down to earth! We (Tektronix Television Systems) have a 68020-based professional television measurement instrument (the VM700) that is just about ready to ship to customers. It is 32 bits wide all over the place, for maximum data transfer rate consistent with reasonable cost. While I will admit that the principle of the A2090 would work just fine if one could only do it 32 bits wide, in fact it is not (reasonably) possible for us to do it 32 bits wide! Why? Well, we will probably ship about as many instruments in one year as Commodore ships Amigas in a week. (Not bad for an instrument with a sticker price of $16,495!) So, I can't afford to go out and generate a 32-bit wide DMA chip with a 512-byte onboard FIFO. I have to use what I can buy from Motorola or Hitachi or whomever. Believe me, we did look at DMA chips before making the basic system design decisions -- and the DMA chips are nearly as bad as the LANCE! Minimum DMA cycle time I found was 500 nsec, again assuming nearly instantaneous memory response. And the best of them were only 16 bits wide. >> If you dual-port the LANCE memory properly (32 bits wide to the 68020, >> 16 bits wide to the LANCE), you can move the data from the dual-ported >> memory *while* the LANCE is transferring other data into it, thus >> achieving an effective doubling of the transfer rate and freeing the >> bus for other purposes the rest of the time. > > I get the exact same effect with my FIFO, only through use of DMA I'm tying > up the bus much less. > > But not really, unless you've got some screaming RAM in that dual port > section. Maybe you can use some true dual-ported SRAM, or a FIFO like > what we've got on this hard disk controller, but if you're talking DRAM, > forget it, the 68020's going to eat all the available time on anything > in the 80ns or slower range. Remember, our system memory access is about 240 nsec (asynchronous). The dual-ported RAM on the LAN card is made of 4, 32K x 8 bit static RAM chips, and a boatload of SSI, MSI, and PALs. The static parts are garden-variety, 150 nsec parts, but the actual memory access time is about 240 nsec, because there is clock-driven, no-deadlock, positive arbitration logic to ensure that one and only one customer gets the memory at a time [it works, too! 8^) ]. (Signetics now has a chip that allows you to do the same thing with dynamic RAMs -- it even takes care of the refresh!) Because of this, the LANCE can access memory once per 800 nsec (or so), and the 68020 can get one or two 32-bit accesses in between each of the LANCE's 16-bit accesses. Remember, too, that while the LANCE has the bus, its effective data rate is about 19.1 Mbits/second. Thus, even with the 68020 having to read the data from the dual-port RAM on one memory cycle and write it to system memory on the next memory cycle, the effective data transfer bandwidth for the 68020 is 48 Mbits/second. Thus, even without the rest of the argument, my conclusion is still: >> So, for maximum performance, hide your peripherals behind dual-ported >> memory, and then mark those pages as "non-cacheable." Consider something else, though. When you read from or write to your hard disk, the CPU is going to have to copy the data at least once. On a read from the disk, you do a getchr() (or whatever), which stimulates the system to go read a sector into a buffer of its own. Then (and only then) it passes a byte back to you. If the disk DMA transfer occurs on the system bus, the data moves over that bus *twice* before it gets to the user. On the other hand, if the hard disk controller board has its own (dual-ported) memory, which is accessible to the CPU, the DMA can transfer into dual-ported memory without disturbing the CPU at all. When the data is passed to the user, it moves over the system bus only once. > There's no question that having a peripheral device dump to shared RAM > is much better than directly banging it with the CPU, Macintosh style. And > for very small tranfer situations, it's better. A DMA controller has a > fixed setup time. But if you're transferring more than a few bytes at a > time, DMA is a win. And unless you're dealing with something that needs > immediate response (eg, you can't wait until you've got 64 or 512 or > whatever bytes to block transfer), DMA is still a win on a 68020 system, > if done correctly. The 68020 at 32 bits/transfer will tie a 16 bit DMA > device at transfer rate, plus it's got less setup, so you definitely want > that DMA to be 32 bits wide. Agreed that I want the DMA to be 32 bits wide. That is just very difficult for those of us that cannot crank up a silicon foundry whenever we get the itch. . . Note again, that in real life the processor is going to have to copy the data somewhere else (to the ultimate consumer) once it is DMA-ed into the system disk buffer. There will be fewer transfers over the system bus (and thus more cycles available to the CPU) if the DMA moves data from the disk into dual-ported memory, so it must only pass over the system bus once. > Finally, in a decent system, you can have DMA on your backplane going at > the same time you've got CPU access going on you're local bus, so the DMA > won't always kick the CPU off the bus. Amiga's aren't doing it this way, > yet. But Amigas will, I hope, I hope, I hope. . . 8^) (By the way, if you've followed what I was saying, that's what we have in the VM700 -- except the DMA runs on its own private "bus," and the CPU *always* has the system bus available to it!) Steve Rice ----------------------------------------------------------------------------- * Every knee shall bow, and every tongue confess that Jesus Christ is Lord! * new: stever@videovax.tv.Tek.com old: {decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever
daveh@cbmvax.UUCP (Dave Haynie) (03/01/88)
in article <4853@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says: > Keywords: DMA, closet > Summary: DMA is great -- in its proper place. . . > Hmmmm. . . I expressed my belief that (at least in a 32-bit wide 68020 > system) "DMA is the *SLOW* way to go!" In article <3291@cbmvax.UUCP>, > Dave Haynie (daveh@cbmvax.UUCP) replied: >> Summary: DMA is still *FAST*er > In my previous article, I suggested: > >>> . . . However, for best performance you want to put the DMA >>> peripherals on one side of a dual-ported memory and let the CPU do the >>> data moving. Thus, re-creating a situation very much like the way the chip bus works. Your design forces memory typing (MEMF_CHIP, MEMF_LAN, MEMF_HARDDISK, etc.). > I guess I wasn't being quite as explicit as I should have been! ... > Thus, the LANCE would transfer 8, 16-bit words (one SILO full) every 12.8 > microseconds, tying up the CPU bus for about 6.7 microseconds, which is > 52% of the available CPU bus bandwidth. > > The LANCE can transfer only 16 bits with each memory cycle. Here we going again with what I meant by intelligently designed peripherals. If you're on a 32 bit bus, your DMA should be 32 bits wide. And you should use a larger FIFO, like maybe 64-128 bytes. If you can't do either or both of these, than, as I showed before, you'll get better performance from a 68020 move. > Thus, its data transfer rate, during the time it is using the bus, is: > (8 words) * (16 bits) / (6.7 microseconds) = 19.1 Mbits/second No intelligence here! Why would you take over the bus and then just sit there. If you are only transferring 16 bits at a time, this should give you half the 68020 rate, 48 Mbits/second, once arbitration has taken place. A big enough FIFO makes the arbitration time negligable. Extend this to 32 bits wide and you're twice the 68020 rate. If this can't be done from a circuit point of view, either redesign the lan chip to make effective use of DMA, or admit that it's a bad design. If there are other reasons, like software or user can't handle buffering delays, than this isn't a good application for DMA, and we can turn our attention over to problems that are well suited to DMA, like hard disk controllers. But don't pan DMA because it doesn't fit an arbitrary case on an arbitrary chip. >> [Timing analysis removed] >> >> So in this case, DMA comes out 136 cycles, vs. 258 if the 68020 moved it all >> by itself. > Now, let's come back down to earth! Naa, that's where IBM does their design work. > While I will admit that the principle of the A2090 would work just fine if > one could only do it 32 bits wide, in fact it is not (reasonably) possible > for us to do it 32 bits wide! > Why? Well, we will probably ship about as many instruments in one year > as Commodore ships Amigas in a week. (Not bad for an instrument with a > sticker price of $16,495!) So, I can't afford to go out and generate a > 32-bit wide DMA chip with a 512-byte onboard FIFO. I have to use what I > can buy from Motorola or Hitachi or whomever. OK, but again, you shouldn't blast the concept of DMA just because you can't use it in your particular situation. We make lots of Amigas, and lots of custom chips. Like the DMA chip on the A2090 card. That's only 16 bit in this case, but we're only dealing with a 16 bit bus. > Remember, our system memory access is about 240 nsec (asynchronous). The > dual-ported RAM on the LAN card is made of 4, 32K x 8 bit static RAM chips, > and a boatload of SSI, MSI, and PALs. The static parts are garden-variety, > 150 nsec parts, but the actual memory access time is about 240 nsec, > because there is clock-driven, no-deadlock, positive arbitration logic to > ensure that one and only one customer gets the memory at a time [it works, > too! 8^) ]. (Signetics now has a chip that allows you to do the same thing > with dynamic RAMs -- it even takes care of the refresh!) Well, I make the memory cycle time of a 16.67 MHz 68020 at just under 180ns. So you're slowing down already. But obviously a DMA device has to follow the same rules as the 68020. Now we have this dual ported memory. I certainly believe you can build an arbiter that'll allow access to the RAM by only one customer at a time. But what happens when they both want it? It appears to me that one of them is getting wait stated. That's what I meant by having very FAST memory there. The FIFO scheme starts DMA before the FIFO is completely filled, so that it fills just a bit before the transfer is complete. You get your chunk of memory DMAed at full bus speed, and you get it from the disk as fast as it could be received. Now with the dual port scheme, you can start filling the shared RAM early, too, since your data isn't coming in at full bus speeds. But eventually you want the transfer to start. If stuff is still coming into that memory, your transfer is going to suffer unless the RAM is very fast. The Amiga's CHIP RAM, for instance, is twice the speed of the 68000 memory cycle, so once you're synced up with it, there are no wait states in normal operation (eg, blitter's well behaved, graphics are medium resolutions). So this is a good scheme. If I ran memory at the same speed as the 68000 memory cycle, I'd hit wait states all the time trying to access CHIP RAM. What you're describing would only work well if the shared memory has relatively little truely shared access. > Consider something else, though. When you read from or write to your > hard disk, the CPU is going to have to copy the data at least once. On > a read from the disk, you do a getchr() (or whatever), which stimulates > the system to go read a sector into a buffer of its own. Then (and only > then) it passes a byte back to you. No. The latest Amiga DOS software is set up to read data directly into its final destination. From C language or whatever, you may get double buffering if you use character by character I/O or whatever, but if you make a direct OS call, DMA device can directly use the given buffers. > Agreed that I want the DMA to be 32 bits wide. That is just very > difficult for those of us that cannot crank up a silicon foundry whenever > we get the itch. . . Oh, well, I guess some of you will always have to live like that :-). > Note again, that in real life the processor is going to have to copy the > data somewhere else (to the ultimate consumer) once it is DMA-ed into the > system disk buffer. No it isn't. The only time the shared memory scheme wins is if the final destination happens to be in the area of shared memory, in MEMF_HARDDISK so to speak. Otherwise, you'll have to do a CPU copy to the final destination, whereas the DMA device could have put it directly there, since it can address all of memory. I guess you can always tune your system software to take advantage of the hardware, and perhaps the other way 'round too. > Steve Rice -- Dave Haynie "The B2000 Guy" Commodore-Amiga "The Crew That Never Rests" {ihnp4|uunet|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy "I can't relax, 'cause I'm a Boinger!"
stever@videovax.Tek.COM (Steven E. Rice, P.E.) (03/11/88)
Dave Haynie's (daveh@cbmvax) most recent article was number <3394@cbmvax.UUCP>. In it, he cast aspersions on the poor, struggling LANCE and suggested that real systems do 32-bit DMA. Well, maybe -- but if you want to use Ethernet, the LANCE is about the only way to go, slow or no! In a perfect world, 32-bit DMA with a 512-byte assembly buffer and fast-as-a-speeding-bullet burst transfers would be possible. In real life, we have to make do with what we can buy. (Commodore can build what it needs; the economics in the Television Test and Measurement market are different than those in the personal computer market.) There is another thought, too -- if you have only one DMA device, you could argue that it shouldn't make much difference if it DMAs into system RAM or into a dual-ported buffer. If you have more than one device contending for the system bus, however, multiple dual-ported buffers are a clear win. Steve Rice ----------------------------------------------------------------------------- * Every knee shall bow, and every tongue confess that Jesus Christ is Lord! * new: stever@videovax.tv.Tek.com old: {decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever
daveh@cbmvax.UUCP (Dave Haynie) (03/25/88)
in article <4890@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says: > > Dave Haynie's (daveh@cbmvax) most recent article was number > <3394@cbmvax.UUCP>. In it, he cast aspersions on the poor, struggling > LANCE and suggested that real systems do 32-bit DMA. Well, maybe -- > but if you want to use Ethernet, the LANCE is about the only way to > go, slow or no! Calm down! That's not what I said. I said that in very high bandwidth-consuming operations, such as hard disk interfacing, where the transfer between an I/O device and CPU addressable main memory can be sent in large atoms, is best served by DMA, even in a 68020 or 68030 system. I also said that in systems where transfers must occur in small atoms or at relatively slow speed (like perhaps networks or things which must be highly interactive), the I/O scheme to shared CPU memory was a good idea. > In a perfect world, 32-bit DMA with a 512-byte assembly buffer and > fast-as-a-speeding-bullet burst transfers would be possible. In real > life, we have to make do with what we can buy. (Commodore can build > what it needs; the economics in the Television Test and Measurement > market are different than those in the personal computer market.) That's true, Commodore can build what it needs for those cases. The 16 bit wide DMA driven hard disk controller on the 16 bit bus delivers around 625K bytes/second with the Fast FileSystem. Fast FileSystem allows DMA from the hard disk directly to the target memory, not intermediate buffers used. I believe that any peripheral going this fast wants DMA. It's fully extensible to a 32 bit machine, though a _conservative_ 32 bit machine rates that's 2.5 megabytes/second thoughput (not even getting to things like burst transfers, which are ideally suited to DMA transfers). If you're LAN is only going 2.5 megabits/sec, that's certainly overkill and extra cost. Which seems to make sense even today; most Amiga hard drives are DMA driven, most Amiga LANs are CPU driven via shared RAM DMA. > There is another thought, too -- if you have only one DMA device, you > could argue that it shouldn't make much difference if it DMAs into > system RAM or into a dual-ported buffer. If you have more than one > device contending for the system bus, however, multiple dual-ported > buffers are a clear win. Not unless you have multiple CPUs to read them. > Steve Rice -- Dave Haynie "The B2000 Guy" Commodore-Amiga "The Crew That Never Rests" {ihnp4|uunet|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy "I can't relax, 'cause I'm a Boinger!"
stever@videovax.Tek.COM (Steven E. Rice, P.E.) (04/01/88)
In article <3507@cbmvax.UUCP>, Dave Haynie (daveh@cbmvax.UUCP) writes: > in article <4890@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says: >> >> Dave Haynie's (daveh@cbmvax) most recent article was number >> <3394@cbmvax.UUCP>. In it, he cast aspersions on the poor, struggling >> LANCE and suggested that real systems do 32-bit DMA. Well, maybe -- >> but if you want to use Ethernet, the LANCE is about the only way to >> go, slow or no! > > Calm down! That's not what I said. I said that in very high > bandwidth-consuming operations, such as hard disk interfacing, where the > transfer between an I/O device and CPU addressable main memory can be sent > in large atoms, is best served by DMA, even in a 68020 or 68030 system. I > also said that in systems where transfers must occur in small atoms or at > relatively slow speed (like perhaps networks or things which must be > highly interactive), the I/O scheme to shared CPU memory was a good idea. I think there is still some misunderstanding here. When I mention dual- ported memories, I am speaking of memory that is "CPU addressable main memory"! It just happens to also be shared (on a cycle-by-cycle basis) with some other device, which could be an I/O device or another CPU. The Amiga implements a form of "shared" memory -- chip memory. The CPU gets access to chip memory on a shared basis, arbitrated cycle by cycle. Another form of "shared" memory is seen on the A2620 (?) card -- the 68020 CPU. The 68020 will have 2 or 4 megabytes of 32-bit wide memory which no one can deny it access to. Thus, if DMA is occurring to "main" memory, the 68020 may not be blocked at all. Carrying the idea one step further simply removes more limitations from the system, giving the CPU unrestricted access to the system bus and immediate access to any memory that is not in use during that memory cycle. >> In a perfect world, 32-bit DMA with a 512-byte assembly buffer and >> fast-as-a-speeding-bullet burst transfers would be possible. In real >> life, we have to make do with what we can buy. (Commodore can build >> what it needs; the economics in the Television Test and Measurement >> market are different than those in the personal computer market.) > > That's true, Commodore can build what it needs for those cases. The 16 bit > wide DMA driven hard disk controller on the 16 bit bus delivers around 625K > bytes/second with the Fast FileSystem. Fast FileSystem allows DMA from the > hard disk directly to the target memory, not intermediate buffers used. I > believe that any peripheral going this fast wants DMA. It's fully extensible > to a 32 bit machine, though a _conservative_ 32 bit machine rates that's > 2.5 megabytes/second thoughput (not even getting to things like burst > transfers, which are ideally suited to DMA transfers). If you're LAN is only > going 2.5 megabits/sec, that's certainly overkill and extra cost. Ethernet is 10 megabits/sec. > Which seems to make sense even today; most Amiga hard drives are DMA driven, > most Amiga LANs are CPU driven via shared RAM DMA. In the case of Ethernet I/O, transmissions are packetized with quite a bit of protocol overhead. Thus, the data to be transmitted must be broken into chunks no larger than the largest legitimate packet and shipped out one packet at a time. To do this, the CPU is going to have to move the data anyway -- it has to configure it in a form the I/O device can use. In this case, the copy from what you might consider "main" memory to "shared" memory is free. Starting with the FFS rate of 625K bytes/second and doubling that for a 32-bit bus gives 1.25 megabytes/second. This translates to a 10 megabit/second transfer rate, which is the same as the Ethernet. Using your figure of 2.5 megabytes per second gives 20 megabits/second throughput. But our CPU bus bandwidth is about 100 megabits/second (approximately 330 nsec main memory cycle time [not *access* time -- *cycle* time]). Thus, a 2.5 megabyte/second disk transfer would occupy only 20% of the bus bandwidth. If the disk DMA is transferring into unshared main memory, the CPU will just have to wait. At 2.5 megabytes/second (assuming 32-bit transfers), the disk will request one memory access every 1.6 microseconds. One possibility is to arbitrate for the bus for each transfer. Looking at the timing diagrams in the Motorola 68020 manual, one finds that there is a minimum of 1/2 clock period and a maximum of 1 clock period from the end of clock state S5 until Bus Grant* is asserted. There is also a note in paragraph 5.2.7.4 which says that "all asynchronous inputs to the MC68020 are internally synchronized in a maximum of two cycles of the system clock." This implies that the minimum to resume processing is 1 clock cycle. There is probably one additional cycle needed for the CPU to resume driving the address and data lines. Assuming a memory cycle time of 330 ns (which is what ours is) with 240 ns read or write access time, each 32-bit word transferred would hold the CPU bus for one arbitration time (1/2 to 1 clock cycles, or 30 to 60 ns in a 16.7 MHz system) plus one transfer time (240 ns) plus one bus relinquishment time (1 to 2 clock cycles, or 60 to 120 ns) plus one driver turnon time (1 clock cycle, or 60 ns). The minimum time required would be 390 ns, the maximum time would be 480 ns, and the mean time would be 435 ns. 435 ns out of 1.6 us is 27.2% of the bus bandwidth occupied. But not only is 27.2% of the bus bandwidth occupied, the CPU is denied the bus 27.2% of the time! This translates directly into throughput reduction. Another possibility is to block the data into (e.g.) 512 byte blocks and then arbitrate for the bus once per block. This drops the bus bandwidth occupation to 20% (since one arbitration is insignificant compared to the time to transfer 512 bytes as 128 32-bit words). But the CPU is still denied the bus 20% of the time. If, however, the disk data is DMAed into dual-ported memory, it can deny an access to the CPU a *maximum* of 20% of the time, and then only if the CPU is fetching all of its instructions from the shared memory! In actual operation, it is likely to be much less than that. There is also no reason the receiving process cannot use the data directly from the dual-ported memory, although in many cases there will be at least one copy between initial transfer and use of the data. >> There is another thought, too -- if you have only one DMA device, you >> could argue that it shouldn't make much difference if it DMAs into >> system RAM or into a dual-ported buffer. If you have more than one >> device contending for the system bus, however, multiple dual-ported >> buffers are a clear win. > > Not unless you have multiple CPUs to read them. Given just a single hard disk transfer as you have described it, DMA into a dual-port buffer avoids losing 20% of the CPU's processing capability. That seems worthwhile to me! Steve Rice ----------------------------------------------------------------------------- * Every knee shall bow, and every tongue confess that Jesus Christ is Lord! * new: stever@videovax.tv.Tek.com old: {decvax | hplabs | ihnp4 | uw-beaver}!tektronix!videovax!stever
daveh@cbmvax.UUCP (Dave Haynie) (04/12/88)
in article <4937@videovax.Tek.COM>, stever@videovax.Tek.COM (Steven E. Rice, P.E.) says: > Another possibility is to block the data into (e.g.) 512 byte blocks and > then arbitrate for the bus once per block. This drops the bus bandwidth > occupation to 20% (since one arbitration is insignificant compared to the > time to transfer 512 bytes as 128 32-bit words). But the CPU is still > denied the bus 20% of the time. First of all, with a better bus design (eg, not the current Amiga bus, but perhaps a future version that's 32 bits wide), there's zero or very near zero arbitration time; the bus's owner is determined dynamically on a cycle by cycle basis. Secondly, since the 68020 with cache running only wants the bus 50% or so of the time, on average, you take your 20% figure and immediately reduce it to 10%, on average. It could be as bad as 20%, it could be as good as 0%, depending on what the CPU is doing. Now we add a priotity scheme. If the CPU operation is more important, it gets the bus for any cycles it needs, and the DMA device gets whatever it wants from the remaining 50% of the bus. And that's assuming that the bus is limited to CPU bus speeds. It's pretty simple to make DMA devices run nybble or page mode cycles that the CPU can't keep up with, but most memory systems can be designed with this in mind for nearly free. So with DMA going with a nybble transfer, you're now down to less than 5% of the bus bandwidth for that transfer. VME and non-Apple NuBus both do things like this. > Given just a single hard disk transfer as you have described it, DMA into > a dual-port buffer avoids losing 20% of the CPU's processing capability. > That seems worthwhile to me! But you're still missing the point. The CPU has to stop what it's doing to transfer the data by hand. If it did that JUST as efficiently as the DMA device, you'd still be loosing whatever CPU time you claim is being eaten by the DMA transfer, 20% or whatever (keep in mind this 20% figure only applies during an actual transfer). If the DMA transfer happens twice as fast as the CPU could transfer the data, then I'm gaining in CPU speed, even though I'm kicking the CPU off the bus for awhile. DMA transfers on the Amiga bus with a 68020 go twice as fast as the 68020 could possibly transfer them. 68000 based CPU transfers are more like 1/4th the speed of the DMA device. My point is that someone has to do the work of transfer unless you can live with the data exactly where it's dumped in your shared memory scheme. If you know there's no transfer required, share the memory, but if there is, and especially if the memory can be used as is, once it reaches it's destination (like NewFS), DMA wins. There's actually a test case of this available in the Amiga world. As I've already mentioned, the A2090 controller uses a FIFO and DMA to complete it's transfer, and achieves about 625K Bytes/Second. There's a new SCSI controller out there, from a company called Great Valley Peripherals, that uses an I/O chip DMA to shared RAM (4K of static RAM on-board, so once you're in sync I suspect there will rarely be a collision between the CPU and the peripheral chip). I don't have any benchmarks on this new board, but I guarantee it'll be slower. > Steve Rice -- Dave Haynie "The B2000 Guy" Commodore-Amiga "The Crew That Never Rests" {ihnp4|uunet|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy "I can't relax, 'cause I'm a Boinger!"