sandrock@uxe.cso.uiuc.edu (05/25/89)
I am interested in the pros and cons of DMA transfers in RISC systems. In particular I am interested in the notion that the DECsystem 3100 has no DMA to its main memory, but instead relies upon the CPU to copy i/o buffers to/from an auxilliary memory. First, is this statement accurate? And second, if true, is this a reasonable tradeoff to make on a RISC system? We are interested in the DECsystem 3100 (allegedly same h/w as DECstation) versus the MIPS M/120 as far as multiuser (32 simultaneous, say) performance. The load would likely be a mix of compute-bound and i/o-bound applications, including possibly Ingres and NFS-serving, along with various chemistry codes. Also, wrt RISC systems, would we be better off segregating the interactive usage, i.e., editing, email, etc. from the compute-bound batch-mode jobs, by running them on two separate systems? My thinking is that a high rate of context switching caused by interactive processes would tend to counteract the advantage of having large instruction and data caches on the machine. Any pertinant advice is welcome, and I will summarize if need be. Mark Sandrock =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+= BITNET: sandrock@uiucscs University of Illinois at Urbana-Champaign Internet: sandrock@b.scs.uiuc.edu School of Chemical Sciences Computing Serv. Voice: 217-244-0561 505 S. Mathews Ave., Urbana, IL 61801 USA Home of the Fighting Illini of 'Battle to Seattle' fame. NCAA Final Four, 1989. =+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=
henry@utzoo.uucp (Henry Spencer) (05/27/89)
In article <46500067@uxe.cso.uiuc.edu> sandrock@uxe.cso.uiuc.edu writes: >In particular I am interested in the notion that the DECsystem 3100 has >no DMA to its main memory, but instead relies upon the CPU to copy i/o >buffers to/from an auxilliary memory. First, is this statement accurate? >And second, if true, is this a reasonable tradeoff to make on a RISC system? Not infrequently, a fast, well-designed CPU can copy data faster than all but the very best DMA peripherals. The DMA device may still be a net win if it can use the memory while the CPU is busy elsewhere, giving worthwhile parallelism, but this depends on how hard the CPU works the memory. The bottleneck nowadays is usually memory bandwidth rather than CPU crunch, and caches aren't a complete solution, so DMA may end up stalling the CPU. If that happens, it's not clear that DMA is worth the trouble, especially since it's easier to design memory to serve only one master. Having the CPU do the copying is not an obviously *un*reasonable idea. Much depends on the details. DMA historically was more popular than auxiliary memory because memory was expensive. This is no longer true. -- Van Allen, adj: pertaining to | Henry Spencer at U of Toronto Zoology deadly hazards to spaceflight. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
chris@softway.oz (Chris Maltby) (05/29/89)
In article <1989May26.170247.1165@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: > Having the CPU do the copying is not an obviously *un*reasonable idea. > Much depends on the details. > DMA historically was more popular than auxiliary memory because memory was > expensive. This is no longer true. Of course, there are many benefits that can be gained by having controllers with their own buffers. Disk drivers can stop worrying about rotational placement if the disk controller is providing whole tracks or cylinders at a time for no extra bus overhead. LAN drivers can avoid copying stuff like protocol headers etc into and out of main memory. Generally, the CPU can be a lot smarter about I/O than any brain-damaged microprocessor controlled device interface. -- Chris Maltby - Softway Pty Ltd (chris@softway.sw.oz) PHONE: +61-2-698-2322 UUCP: uunet!softway.sw.oz.au!chris FAX: +61-2-699-9174 INTERNET: chris@softway.sw.oz.au
rec@dg.dg.com (Robert Cousins) (05/30/89)
In article <46500067@uxe.cso.uiuc.edu> sandrock@uxe.cso.uiuc.edu writes: > >I am interested in the pros and cons of DMA transfers in RISC systems. >In particular I am interested in the notion that the DECsystem 3100 has >no DMA to its main memory, but instead relies upon the CPU to copy i/o >buffers to/from an auxilliary memory. First, is this statement accurate? Yes, there is no DMA in the traditional form on the PMAX. >And second, if true, is this a reasonable tradeoff to make on a RISC system? IMHO, no. When designing the AViiON, 88K based workstations, we found that it is possible to provide the increased performance from DMA at a low cost (lower than the 3100). >Mark Sandrock >=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+= >BITNET: sandrock@uiucscs University of Illinois at Urbana-Champaign >Internet: sandrock@b.scs.uiuc.edu School of Chemical Sciences Computing Serv. >Voice: 217-244-0561 505 S. Mathews Ave., Urbana, IL 61801 USA >Home of the Fighting Illini of 'Battle to Seattle' fame. NCAA Final Four, 1989. >=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+= There are some basic requirements, IMHO, which must be met to be considered state-of-the art: I/O which does not involve the CPU for moving every byte, graphics which does not require total CPU dedication for normal operations such as line drawing or bit blitting, dedicated LAN controllers to handle the low levels of the LAN protocol and a number of similar minimums which most new machines have. It is interesting, however, to notice the number of machines which do not meet up with even the most basic criteria. I make this point to begin discussion. What are some of the minimum standards which should be applied to these classes of machines and which machines fail to meet them? Robert Cousins Dept. Mgr, Workstation Dev't. Data General Corp. Speaking for myself alone.
rec@dg.dg.com (Robert Cousins) (05/31/89)
In article <1552@softway.oz> chris@softway.oz (Chris Maltby) writes: >In article <1989May26.170247.1165@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >> Having the CPU do the copying is not an obviously *un*reasonable idea. >> Much depends on the details. >> DMA historically was more popular than auxiliary memory because memory was >> expensive. This is no longer true. > >Of course, there are many benefits that can be gained by having controllers >with their own buffers. Disk drivers can stop worrying about rotational >placement if the disk controller is providing whole tracks or cylinders >at a time for no extra bus overhead. LAN drivers can avoid copying stuff >like protocol headers etc into and out of main memory. The same or similar tricks can generally be played using DMA. However, there are certain penalties payed for using buffers: 1. Additional latency -- effectively, disk or LAN devices perform DMA operations into their own buffers. After this, the CPU must perform a copy into main memory. Since these peripheral buffers are not cached (or if they are, then the there is no excuse for not copying into main memory to begin with), the copy will be more expensive. THere are already several versions of Unix which directly page programs from disk to user code space. The use of a dedicated buffer will substantially slow this down. Future versions of Unix may choose to take advantage of these features in greater ways for performance enhancements. The bottom line is that this approach requires an additional copy which can make CPU latency a problem. 2. Buffer size -- provision of a private buffer for a peripheral implies that the driver must now manage the buffer memory. Since certain classes of peripherals such as Ethernet can have semi-continuous traffic, this management must be timely and efficient. The CPU must be able to drain the buffer in a short period of time (which can be a problem under standard Unix due to the design of the dispatcher). The easiest way to handle this is to provide a LARGE buffer to store the data. So, at this point in time, one must ask oneself: "Would I rather have 4 megabytes of dedicated LAN buffer or 4 megabytes of additional main memory?" Most people would rather have the main memory. 3. Architectural generality -- There are a variety of cases where having the data "beamed down" into main memory is useful though strictly not required. In tightly coupled multiprocessors (TCMPs) it is convenient to avoid excessive data movement and to simplify the driver to minimize the time in which a particular device's code is single threaded. The real reason why some machines avoid DMA is because of CPU braindamage. Many CPUs are either poorly cached (causing them to demand too much bus bandwidth and therefore suffering from major performance loss when minor peripherals begin to take bus cycles) or have defective architectures which do not support cache coherency (or atleast support it effectively). Some examples of the first include some of the low end microprocessors which can take 100% of the CPU bandwidth for extended periods of time. Some examples of the second include some of the higher end microprocessors with on-chip caches or cache controllers. A number of DMA buffer workarounds have been used over the years. One favorite hack is to provide a hole in the cache coverage so that some areas of memory are not cached. In one form or another almost every system provides for this. Sometimes it is on a page by page basis (88K for example). Others create a dedicated area of memory for it (MIPs). >Generally, the CPU can be a lot smarter about I/O than any brain-damaged >microprocessor controlled device interface. However, just remember that you are throwing MIPS away doing the copying. I would rather have a $5 DMA controller spending the time than my high powered CPU. Sure, it works to use the CPU to do the copying, but when you realize the amount of time the CPU may be forced to spend because of the copy (including extra interrupt service, context switches, polling loops, cache flushes, etc.), it often turns out that a DMA controller can provide the user with VERY CHEAP MIPS by freeing up the CPU. It is this logic which allows people to avoid using graphics processors in workstations by saying "the CPU is fast therefore I don't need one." >-- >Chris Maltby - Softway Pty Ltd (chris@softway.sw.oz) > >PHONE: +61-2-698-2322 UUCP: uunet!softway.sw.oz.au!chris >FAX: +61-2-699-9174 INTERNET: chris@softway.sw.oz.au Robert Cousins Dept. Mgr, Workstation Dev't. Data General Corp. Speaking for myself alone.
henry@utzoo.uucp (Henry Spencer) (05/31/89)
In article <181@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes: >There are some basic requirements, IMHO, which must be met to be considered >state-of-the art: I/O which does not involve the CPU for moving every >byte, graphics which does not require total CPU dedication for normal >operations such as line drawing or bit blitting, dedicated LAN controllers >to handle the low levels of the LAN protocol... > >I make this point to begin discussion. What are some of the minimum >standards which should be applied to these classes of machines and which >machines fail to meet them? The three obvious ones are: 1. A serious assessment of what performance in each of these areas is necessary to meet the machine's objectives, and what fraction of the CPU would be necessary to do so with "dumb" hardware. It is not likely to be cost-effective to add hardware to save 1% of the CPU. 10% might be a different story. 50% definitely is. 2. A serious assessment of the overheads of adding smart hardware, like the extra memory bandwidth it eats and the software hassles that all too often are necessary. 3. A serious assessment of whether the added performance can be had in a more versatile and cost-effective way by just souping up the CPU. Simply saying "we've got to have smart i/o, and smart graphics, and smart networks" without justifying this with numbers is marketingspeak, not a sound technical argument. As RISC processors have shown us, taking things *out* of the hardware can result in better systems. -- You *can* understand sendmail, | Henry Spencer at U of Toronto Zoology but it's not worth it. -Collyer| uunet!attcan!utzoo!henry henry@zoo.toronto.edu
aglew@mcdurb.Urbana.Gould.COM (06/01/89)
>Generally, the CPU can be a lot smarter about I/O than any brain-damaged >microprocessor controlled device interface. Smarter and faster. One of the big problems with smart I/O is that it is done using slow microprocessors 1 or 2 generations old. Now, if your smart I/O cards (1) run the latest, greatest, processors (which requires a big development commitment) and (2) share software with the "standard" UNIX, so that you don't throw out software investment when you upgrade I/O cards - or even move the functionality back to the CPU for some models, then you may have got something... Of course, some people prefer symmetric multiprocessing...
rec@dg.dg.com (Robert Cousins) (06/01/89)
In article <1989May31.163057.543@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >In article <181@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes: >>There are some basic requirements, IMHO, which must be met to be considered >>state-of-the art: I/O which does not involve the CPU for moving every >>byte, graphics which does not require total CPU dedication for normal >>operations such as line drawing or bit blitting, dedicated LAN controllers >>to handle the low levels of the LAN protocol... >> >>I make this point to begin discussion. What are some of the minimum >>standards which should be applied to these classes of machines and which >>machines fail to meet them? > >The three obvious ones are: > >1. A serious assessment of what performance in each of these areas is > necessary to meet the machine's objectives, and what fraction of > the CPU would be necessary to do so with "dumb" hardware. It is > not likely to be cost-effective to add hardware to save 1% of > the CPU. 10% might be a different story. 50% definitely is. Your point is well taken. However, it is clear from reading about some products on the market that the sum total of the penalty considered in the calculations is the number of bus cycles required to do the copy. In fact, many of these copies will take place with interrupts disabled or at some spl() level which restricts some interrupts. Secondly, some copies will require additional interrupt service which will require additional overhead for context switches. Thirdly, some of the management will require task activation which can take a long time under traditional Unix. When these are considered, the cost function should be computed based upon Cost of smarter peripherals cost/MIPS = ----------------------------------- MIPS freed up by better peripherals If this number comes out lower than the price of your CPU, the odds are that you would be better off using the smarter peripherals. >2. A serious assessment of the overheads of adding smart hardware, like > the extra memory bandwidth it eats and the software hassles that > all too often are necessary. As a rule, smarter peripherals require less memory bandwidth than dumb ones. For example, copying data from a dedicated disk buffer to main memory using software entails not only the instruction bandwidth but also a read and a write for each word or two bus cycles per word. DMA on the other hand should be able to perform the same operation with a single cycle per word. As for the software hassels, I've written drivers for both smart and dumb devices. It is true that some classes of smart devices can be more difficult to program than their dumb counterparts. However, in my experience, this is not the rule but is the exception. In fact, if you graph the software hassel factor (if one can truely be quantified), my experience shows that as hardware goes from brain dead to genius the curve is "U" shaped. Managing very stupid hardware can be as difficult as the most sophisticated. >3. A serious assessment of whether the added performance can be had in a > more versatile and cost-effective way by just souping up the CPU. Agreed. As was pointed out in the equation above, the real issue is getting the end user the most bang for the buck. >Simply saying "we've got to have smart i/o, and smart graphics, and smart >networks" without justifying this with numbers is marketingspeak, not >a sound technical argument. As RISC processors have shown us, taking things >*out* of the hardware can result in better systems. >-- >You *can* understand sendmail, | Henry Spencer at U of Toronto Zoology >but it's not worth it. -Collyer| uunet!attcan!utzoo!henry henry@zoo.toronto.edu I disagree. There are certain requirements for a product to be considered useful. It is possible to design 500 Mhz Z80s. There are also a number of users who would find this an attractive product, though many people would mutter that this is waste of technology. There are certain things which users have a right to demand: quality software, state of the art hardware, reasonable performance for the price, dependable support. I would suggest that provision of non-braindead peripherals is in this class almost (but not quite) by default. Robert Cousins Dept. Mgr, Workstation Dev't. Data General Corp. Speaking for myself alone.
rcd@ico.ISC.COM (Dick Dunn) (06/02/89)
> There are some basic requirements, IMHO, which must be met to be considered > state-of-the art: I/O which does not involve the CPU for moving every > byte, graphics which does not require total CPU dedication for normal > operations such as line drawing or bit blitting, dedicated LAN controllers > to handle the low levels of the LAN protocol and a number of similar > minimums which most new machines have... Why are these requirements? At some point in the past, for some set of architectural constraints, smart DMA was an improvement over having the CPU move the data. *However*, remember that smart DMA was not the goal! The goal was to speed up I/O, and you can only substitute the goal of smart DMA if the numbers work right--that is, if you get enough performance gain to justify the cost of the DMA controller, dual-porting the memory (or putting it on the bus and making the bus fast enough), etc. Similar arguments go for the other two putative requirements. For example, if it's going to make sense to have a separate bit-blitter, you have to be able to set it up quickly. If the setup time is longer than the time it would take the CPU to draw the typical line or blt the typical bits, you haven't gained anything. Even if the setup is fast, you have to be able to do something useful with the CPU--which is likely to mean that you have to be able to do a blazingly fast context switch to another process while the drawing is going on. Folks working on networking here have found that it tends to be easier and faster to run "host-based" TCP than to deal with "smart" boards. >...It is interesting, however, to > notice the number of machines which do not meet up with even the most > basic criteria. But the criteria you've given are artificial...they come not from the direct goals (such as performance in a particular application) but from derived goals based on certain assumptions of how to increase performance. I suggest that the problem is *not* that these machines are deficient, but that the assumptions are wrong. > I make this point to begin discussion. What are some of the minimum > standards which should be applied to these classes of machines and which > machines fail to meet them? I suggest that we proceed with the discussion by looking at the machines as black boxen for a bit--establish the standards based on WHAT you want the machine to do, not HOW it gets it done. -- Dick Dunn UUCP: {ncar,nbires}!ico!rcd (303)449-2870 ...CAUTION: I get mean when my blood-capsaicin level gets low.
peter@ficc.uu.net (Peter da Silva) (06/02/89)
In article <15809@vail.ICO.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) writes: > if it's going to make sense to have a separate bit-blitter, you have to be > able to set it up quickly. ... Even if the setup is fast, you have to be > able to do something useful with the CPU--which is likely to mean that you > have to be able to do a blazingly fast context switch... Which is not out of the question. There is plenty of room for improvement in this department in most operating systems (understatement of the century). -- Peter da Silva, Xenix Support, Ferranti International Controls Corporation. Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180. Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.
andrew@frip.WV.TEK.COM (Andrew Klossner) (06/03/89)
[] "basic requirements ... which must be met to be considered state-of-the art ... [include] dedicated LAN controllers to handle the low levels of the LAN protocol ..." This can backfire on you. I've seen more than one example of a very smart LAN interface board which actually slowed down system throughput, because its Chevy on-board processor couldn't do nearly as fast a job as the Formula 1 dragcar that was the main CPU, and the single active process was blocked waiting for LAN completion. Ranging a bit, I've also dealt with systems with a fast main CPU, a SCSI channel, and a wimpy Z8 in the on-disk SCSI controller. Yep, the Z8 was the overall system bottleneck -- lots of time wasted while it slooooowly processed all the messages that SCSI bus master and slave must exchange. If you're going to buy into off-CPU agents to move I/O around, make sure that those agents will improve as fast as the CPU, or your future generation machines will be crippled. -=- Andrew Klossner (uunet!tektronix!orca!frip!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
seanf@sco.COM (Sean Fagan) (06/03/89)
In article <28200325@mcdurb> aglew@mcdurb.Urbana.Gould.COM writes: >>Generally, the CPU can be a lot smarter about I/O than any brain-damaged >>microprocessor controlled device interface. >Smarter and faster. One of the big problems with smart I/O is that it is done >using slow microprocessors 1 or 2 generations old. Now, if your smart I/O >cards (1) run the latest, greatest, processors (which requires a big >development commitment) and (2) share software with the "standard" UNIX, >so that you don't throw out software investment when you upgrade I/O cards >- or even move the functionality back to the CPU for some models, >then you may have got something... Of course, some people prefer symmetric >multiprocessing... Well, more than 20 years ago, a machine was built which had smart I/O processors. Just for the sake of fun, let's call the central processor a "CP," and the I/O processors "PP"'s. Fun, huh? Now, the "CP" was 60-bits, had something like 70-odd instructions, and was a load-store/3-address design. The "PP"'s were 12-bit, accumulator based machines, also with a small instruction set. With each "CP" you got at least 10 "PP"'s. Incidently, the "PP"'s were a barrel-processor: each set of 10 had only 1 ALU. This machine *screamed*. It had, for the time, an incredibly fast processor (the "CP"), which, even today, will outperform things like Elxsi's. With the "PP"'s, it even had I/O that causes it to outperform most of today's mainframes, at a fraction of the price. True, it didn't run UNIX(tm) (although I was did a paper-design of what it would take), but, if you need the speed, it doesn't always matter, does it? For those of you who haven't guessed, the machine was the CDC Cyber, designed (chiefly) by Seymour Cray (God). The machine I played on mostly was a Cyber 170/760, which was estimated at about 10 MIPS or so, and could support hundreds of people, all doing "real" work (database, editing, compiling, etc.). As I, and others, keep trying to say, MIPS are fine, but can it do I/O? -- Sean Eric Fagan | "[Space] is not for the timid." seanf@sco.UUCP | -- Q (John deLancie), "Star Trek: TNG" (408) 458-1422 | Any opinions expressed are my own, not my employers'.
sauer@dell.dell.com (Charlie Sauer) (06/03/89)
In article <2819@scolex.sco.COM> seanf@scolex.UUCP (Sean Fagan) writes: >Well, more than 20 years ago, a machine was built which had smart I/O ... >... >For those of you who haven't guessed, the machine was the CDC Cyber, >designed (chiefly) by Seymour Cray (God). The machine I played on mostly >was a Cyber 170/760, ... Until I got to the punch line, I was sure you were going to say "CDC 6600," which was the first of that series of machines. I'm at home and can't lay my hands on Thornton's book ("Design of the CDC 6600" I think was the title), and I didn't actually use a 6600 until 1970, but it seems like you would have to say 6600, or at least 7600, to make the "more than 20 years" accurate. -- Charlie Sauer Dell Computer Corp. !'s:cs.utexas.edu!dell!sauer 9505 Arboretum Blvd @'s:sauer@dell.com Austin, TX 78759-7299 (512) 343-3310
brooks@vette.llnl.gov (Eugene Brooks) (06/04/89)
In article <3480@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM (Andrew Klossner) writes: >because its Chevy on-board processor couldn't do nearly as fast a job >as the Formula 1 dragcar that was the main CPU, and the single active For the record, a CHEVY won INDY this year! That meager 71 Vette of mine, with truely OBSOLETE CHEVY power is rarely challenged on the roads of California either. brooks@maddog.llnl.gov, brooks@maddog.uucp
seanf@sco.COM (Sean Fagan) (06/05/89)
In article <1429@dell.dell.com> sauer@dell.UUCP (Charlie Sauer, ) writes: >In article <2819@scolex.sco.COM> seanf@scolex.UUCP (Sean Fagan) writes: >>Well, more than 20 years ago, a machine was built which had smart I/O ... >>... >>For those of you who haven't guessed, the machine was the CDC Cyber, >>designed (chiefly) by Seymour Cray (God). The machine I played on mostly >>was a Cyber 170/760, ... > >Until I got to the punch line, I was sure you were going to say "CDC 6600," >which was the first of that series of machines. >but it seems like you would have >to say 6600, or at least 7600, to make the "more than 20 years" accurate. For all intents and purposes, they're the same machine. However, the 760 is *much* faster than the 6600 (the 760 is the second fastest 170 machine; the fastest being one that has 2 processors and an extra 3 bits of addressing [for the OS, not user]). They have the same architecture, but I never played on a 6600, so I had to use what I knew. However, what I wrote is still true for the 6600. Also, note that I said "the machine was the CDC Cyber," but that the model "I played on mostly" was the 760... -- Sean Eric Fagan | "[Space] is not for the timid." seanf@sco.UUCP | -- Q (John deLancie), "Star Trek: TNG" (408) 458-1422 | Any opinions expressed are my own, not my employers'.
snoopy@sopwith.UUCP (Snoopy) (06/05/89)
In article <2819@scolex.sco.COM> seanf@scolex.UUCP (Sean Fagan) writes: |For those of you who haven't guessed, the machine was the CDC Cyber, |designed (chiefly) by Seymour Cray (God). The machine I played on mostly |was a Cyber 170/760, which was estimated at about 10 MIPS or so, and could |support hundreds of people, all doing "real" work (database, editing, |compiling, etc.). As I, and others, keep trying to say, MIPS are fine, but |can it do I/O? I remember an 11/70 doing 1/3 the number of jobs of a pair of CDC 6500s. It supported ~50 users with *very* fast response, faster baud rate terminals (19.2k vs 300/1200), upper/lower case vs. the 6500's upper-case only, and of course Unix. To this day, I haven't used a machine with multi-user response that's even close. It's not what you have, it's what you do with it. _____ .-----. /_____\ Snoopy ./ RIP \. /_______\ qiclab!sopwith!snoopy | | |___| parsely!sopwith!snoopy | tekecs | |___| sun!nosun!illian!sopwith!snoopy |_________| "I *was* the next man!" -Indy
rec@dg.dg.com (Robert Cousins) (06/05/89)
In article <3480@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM (Andrew Klossner) writes: >[] > "basic requirements ... which must be met to be considered > state-of-the art ... [include] dedicated LAN controllers to > handle the low levels of the LAN protocol ..." >If you're going to buy into off-CPU agents to move I/O around, make >sure that those agents will improve as fast as the CPU, or your future >generation machines will be crippled. > > -=- Andrew Klossner (uunet!tektronix!orca!frip!andrew) [UUCP] > (andrew%frip.wv.tek.com@relay.cs.net) [ARPA] I agree, however, I fear that some people misunderstood my point concerning "RDA of hardware support." There are a number of ways to produce brain-damaged hardware. For example, Seeq makes an Ethernet controller chip which requires external DMA support. If a dumb DMA channel (or no DMA) is used, the lowest levels of software will end up being exceptionally complex since all of the buffer management and scatter/gather will be in software. THere is also danger of droping packets on the floor which has nasty implications for performance. :-) If, however, some slightly more reasonable DMA is supplied (similar to the LANCE, or Intel chip's) the software complexity drops substantially. WHile I never intended my comments would imply INTELLIGENT control, it is worthwhile to add it to the discussion. At DG, our experience is that it is possible to provide DMA services at prices below competitive non-DMA products. Does this mean that the DMA products run faster than the non-DMA ones? Often the peripherals are the limiting factor. However, the following analysis may be enlightening: Scenario one: Programmed I/O. Given that a disk channel will be averaging 200K bytes/second in 4K byte bursts 20 milliseconds apart using a 1 megabyte/second SCSI channel, the time required to transfer the data will be SCSI limited (given a CPU of > ~3 MIPS). However, since each byte takes 1 microsecond, the CPU will be forced to be dedicated to the SCSI channel for 4 milliseconds each tenure, 50 times per second for a total of 200 milliseconds each second. This has cost the user 20% of the available computing power. Scenario two: Small dedicated buffer. The buffer is 4K bytes long so the processor is no longer required to make as timely response as above. The real issue is now the copy time, of which there is two components: transfer time and context overhead. The transfer time will be limited by the memory/ cache/CPU bottleneck. Since the buffer is not cacheable (by implication), half of the transfer will involve a bus cycle in all cases. Given a minor penalty of 4 or 5 instruction periods for this half and assuming a cache hit on the other side always, the code will look something like this: ld r1,4096 ld r2,bufferaddress ld r3,destaddress loop: ldb r4,(r2) / byte load = 4 clocks for miss ldb (r3),r4 / byte store = 1 for cache hit addi r2,1 / 1 addi r3,1 / 1 addi r1,-1 / 1 brz loop / 1 (code could be reorg'd) Total clocks required: 9*4096=36864 per block * 50 blocks/ second = 1843200 clocks Given a CPU speed of 20 Mhz, this translates into 9% of the CPU time. If the CPU is required to perform the copy during an interrupt service, there is the danger that lower priority interrupts may be lost. If the copy takes place in the top half of the driver, then task latency becomes an issue. The buffer will not be drained until after the task wakes up and completes the copy. On some Unix implementations, the task wake up time can be long periods of time -- enough to impact upon total throughput. Scenario three: Stupid DMA. Here, the CPU just sets up the DMA and awaits completion. The overhead is approximately 0 compared to the above examples. Where does the DMA pay off given that all three examples have approximately identical throughput? DMA is preferable to the first choice whenever the cost of DMA is less than 20% of the cost of the CPU or less than the cost of speeding up the CPU by 20%. DMA is preferable to the second choice whenever the cost of DMA is less than 9% of the cost of the CPU or less than the cost of speeding up the CPU by 9%. I am the first to admit that these models are simplistic, but they do represent valid considerations and reasonable approximations to to the actual solutions. Comments? Robert Cousins Dept. Mgr, Workstation Dev't Data General Corp. Speaking for myself alone.
sauer@dell.dell.com (Charlie Sauer) (06/05/89)
In article <2822@scolex.sco.COM> seanf@scolex.UUCP (Sean Fagan) writes: >For all intents and purposes, they're the same machine. However, the 760 is >*much* faster than the 6600 (the 760 is the second fastest 170 machine; the >fastest being one that has 2 processors and an extra 3 bits of addressing >[for the OS, not user]). I know I was being picky, but since I'm stuck in that vein, let me offer a slightly more substantive quibble, slightly more relevant to the original topic reflected in the subject line: Isn't it true that the 6600 and the 7600 differed in that the PP's were all peers in the 6600 but the 7600 had one PP that had authority over the others? -- Charlie Sauer Dell Computer Corp. !'s:cs.utexas.edu!dell!sauer 9505 Arboretum Blvd @'s:sauer@dell.com Austin, TX 78759-7299 (512) 343-3310
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (06/06/89)
In article <2819@scolex.sco.COM> seanf@scolex.UUCP (Sean Fagan) writes: >In article <28200325@mcdurb> aglew@mcdurb.Urbana.Gould.COM writes: >>Smarter and faster. One of the big problems with smart I/O is that it is done >>using slow microprocessors 1 or 2 generations old. Now, if your smart I/O >Well, more than 20 years ago, a machine was built which had smart I/O >processors. Just for the sake of fun, let's call the central processor a >"CP," and the I/O processors "PP"'s. Fun, huh? Now, the "CP" was 60-bits, >had something like 70-odd instructions, and was a load-store/3-address >This machine *screamed*. It had, for the time, an incredibly fast processor >(the "CP"), which, even today, will outperform things like Elxsi's. With >For those of you who haven't guessed, the machine was the CDC Cyber, There were two rather distinct flavors of Cybers. The main distinguishing feature was the kind of peripheral processors the machine had. (A hybrid- the Cyber 176 could accept both kinds.) The PP's in the 7600 ("upper Cyber") did *not* write to any designated memory, but instead to dedicated memory locations, just like, in effect, the on-board buffers previously referred to in some postings. These PP's would interrupt the CPU *every few hundred words* of I/O to *copy* the data from one group of memory locations to another. Strangely enough, this gave the ~20 VAX MIPS (please, let's argue about the exact rating off-line) 7600 (which was about 2X the similar "lower Cyber" 760) the fastest I/O around of any commercial machine for many years. So, is DMA a "good idea"? - it depends. Overall, you may get more performance for your dollar that way. The PP's on the lower Cybers (6600-Cyber 760) could write to any memory location, and could make your coffee for you in the morning too. The NOS operating system had major pieces in the PP's, and PP saturation was the usual bottleneck, rather than CPU saturation. You could keep about twice as many people happy per CPU "MIP" (whatever that is) on the Cybers, because of all the hidden MIPS in the PP's, compared with, say, a VAX-11/7xx or similar machines. So, within the Cyber line, you had two examples of both extremes: A machine where the CPU had to copy data (7600), and, machines where not only smart I/O, but other major pieces of the system ran in the PP's. Both architectures worked successfully, especially with respect to performance, and only came to grief after many years over the wierd word size and lack of virtual memory and memory address space. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
martyi@sun.Eng.Sun.COM (Marty Itzkowitz) (06/06/89)
The 6600 and 7600 differed in a number of respects, one of which was the PPU architecture. On the 6600 all of the PPs were equivalent, although the OS ( at least the version developed at LBL) treated them differently. Every PPU could access every channel, and could read and write anywhere in central memory, and could exchange-jump (context switch) the CPU. On later, 20 PPU versions, each set of 10 PPUs shared a set of channels, and I don't believe one could talk to the other set's channels. The CPU on the 6600 could NOT do its own context switches, and system calls were handled by placing the request in a known location relative to the process (job, task) address space (word 1, actually). The monitor PPU, so designated by software, checked these words, and then assigned one of the other PPUs to process a request. Later versions of the machine did have a central processor exchange jump instruction. A two CRT display, with refresh done entirely in SW, was managed by one of the PPUs. On the 7600, there were several types of PPUs. PPU zero, also known as the MCU, or maintenance and control unit, could read and write anywhere in central (small core) memory, and could send stuff on channels to the other PPUs. It could also do (force) an exchange jump in the CPU. The other PPUs came in either high or low-speed versions. The high-speed ones worked in pairs, and shared a common external channel to a disk, for example, and a single channel to central memory. Each pair's channel went to a specific hard-wired buffer in cnetral memory, and generated a CPU interrupt (exchange jump) whenever the buffer was half full/empty, or when it executed a specific instruction to do so. The CPU managed copying the data out of the hard-wired buffer,m typically into large core memory, since that was the fastest path, and telling the PPUs when it was OK to send more data. The CPU could reset its buffer to a pair of PPUs and generate an interrupt to them. A high speed buffer was 400 (octal) words, and a disk sector was 1000 (octal) words. The CPU got 4 interrupts for each disk sector. High-speed PPUs also had channels connecting the pair, so that, with much cleverness, one could actually stream data at close to 40 Mb/s from disk, with one PPU reading the disk, and the other dumping a previously read sector to CM. On the 819 disks, one had about 8 microseconds between sectors to avoid missing revs, requiring a hand-off between the PPUs of the pair, and the CPU. On LBL's system, we could do it in time. Slow PPUs, such as needed for a hyperchannel, worked as individuals, and had a half-size buffer, again with handshaking between PPU and CPU at the half-way mark. The CPU on the 7600 did have an exchange jump instruction. Non-privleged tasks could only exchnage to an address given in its XJ package (some 16 60-bit words); on error, the exchange went to a second address. (Normal Exch. Addr and Error Exch. Addr, respectively). Privileged tasks, i.e., the OS, could exchange to anywhere. A context switch took 28 clocks of 27.5 nanosec. each, counting from the time the XJ instruction was the next to issue to the time the first instruction from the new context could issue. For scalar arithmetic, the 7600 was the fastest machine in the world until the Cray-2. Marty Itzkowitz, Sun Microsystems
chris@softway.oz (Chris Maltby) (06/06/89)
In article <182@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes: > However, just remember that you are throwing MIPS away doing the copying. > I would rather have a $5 DMA controller spending the time than my high > powered CPU. Sure, it works to use the CPU to do the copying, but when > you realize the amount of time the CPU may be forced to spend because of > the copy (including extra interrupt service, context switches, polling > loops, cache flushes, etc.), it often turns out that a DMA controller > can provide the user with VERY CHEAP MIPS by freeing up the CPU. It > is this logic which allows people to avoid using graphics processors > in workstations by saying "the CPU is fast therefore I don't need one." Without rejecting anything you said, let me point out that the opposite logic can also apply though. Why install special purpose I/O intelligence if you can ony use it for I/O. A general purpose (extra perhaps) CPU can do all that I/O nonsense as well as other good things. I guess it all depends on what you want the machine to do best. Select the criteria - then design the machine. At this point we should adopt Mr Mashey's approach... measure then draw conclusions on actual data. -- Chris Maltby - Softway Pty Ltd (chris@softway.sw.oz) PHONE: +61-2-698-2322 UUCP: uunet!softway.sw.oz.au!chris FAX: +61-2-699-9174 INTERNET: chris@softway.sw.oz.au
chris@mimsy.UUCP (Chris Torek) (06/07/89)
In article <185@dg.dg.com> rec@dg.dg.com (Robert Cousins) writes: >Scenario two: Small dedicated buffer. >The buffer is 4K bytes long so the processor is no longer required >to make as timely response as above. The real issue is now the >copy time, of which there is two components: transfer time and >context overhead. The transfer time will be limited by the memory/ >cache/CPU bottleneck. Since the buffer is not cacheable (by implication), >half of the transfer will involve a bus cycle in all cases. Given a minor >penalty of 4 or 5 instruction periods for this half and assuming a cache >hit on the other side always, the code will look something like this: > > ld r1,4096 > ld r2,bufferaddress > ld r3,destaddress > loop: ldb r4,(r2) / byte load = 4 clocks for miss > ldb (r3),r4 / byte store = 1 for cache hit > addi r2,1 / 1 > addi r3,1 / 1 > addi r1,-1 / 1 > brz loop / 1 (code could be reorg'd) > > Total clocks required: 9*4096=36864 per block > * 50 blocks/ second = 1843200 clocks This is not an unreasonable approach to analysing the time required for the copies, but the code itself *is* unreasonable---it is more likely to be something like ld r1,4096/4 lea r2,dual_port_mem_addr lea r3,dest_addr loop: ld r4,(r2) / 4-byte load ... ld (r3),r4 / 4-byte store addi r2,4 addi r3,4 addi r1,-1 brz loop which is four times faster than your version. Still, 50 blocks/second is much too slow, especially if the blocks are only 4 KB; modern cheap SCSI disks deliver between 600 KB/s and 1 MB/s. With 8 KB blocks, we should expect to see between 75 and 125 blocks per second. So we might change your 9% estimate to 4.5% (copy four times as fast, but twice as often). Nevertheless: >Scenario three: Stupid DMA. > >Here, the CPU just sets up the DMA and awaits completion. The overhead >is approximately 0 compared to the above examples. The overhead here is not zero. It has been hidden. The overhead lies in the fact that dual ported main memory is expensive, so either the DMA steals cycles that might be used by the CPU (and it can easily take about half the cycles needed to do the copy in the Scenario two), or the main memory costs more and/or is slower. >Where does the DMA pay off given that all three examples have approximately >identical throughput? ... >DMA is preferable to the second choice whenever the cost of DMA is >less than 9% of the cost of the CPU or less than the cost of speeding >up the CPU by 9%. You have converted `% of available cycles' to `% of cost' (in the first half of the latter statement) and assumed a continuous range of price/ performance in both halves, neither of which is true. (I happen to like DMA myself, actually. But it does take more parts, and those do cost....) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
jhood@biar.UUCP (John Hood) (06/07/89)
In article <185@dg.dg.com> uunet!dg!rec (Robert Cousins) writes: >Where does the DMA pay off given that all three examples have approximately >identical throughput? > >DMA is preferable to the first choice whenever the cost of DMA is less >than 20% of the cost of the CPU or less than the cost of speeding up >the CPU by 20%. > >DMA is preferable to the second choice whenever the cost of DMA is >less than 9% of the cost of the CPU or less than the cost of speeding >up the CPU by 9%. > >I am the first to admit that these models are simplistic, but they >do represent valid considerations and reasonable approximations to >to the actual solutions. > >Comments? Robert has also ignored the cost of setting up the DMA controller, which can be significant, especially for scatter/gather type operation. Also note that with modern operating systems that do buffering or disk caching, there is going to be a bcopy or its moral equivalent in there somewhere. This doesn't reduce CPU time used during programmed I/O, but it does change the trade off from, say, 3 vs 10% to 13 vs 23% of CPU availability used for disk I/O. This makes the nature of the trade off different. My other thought is that regardless of the CPU cost, programmed I/O is often acceptable on single-user machines anyway. The situation often arises where the user is only concerned about the speed of the one process he's using interactively. That process will usually have to wait till its data arrives anyway, at least under current programming models. If the CPU has to sit and wait, it might as well do the data movement too. On the other hand, effective multi-channel DMA can be used to have several things going at once-- a bunch of disk drives, or as in the Macintosh and NeXT machines, sound in parallel with other stuff. I'm not about to make any pontifications about what I think is the best solution for the future, because I don't know myself ;-) --jh-- John Hood, Biar Games snail: 10 Spruce Lane, Ithaca NY 14850 BBS: 607 257 3423 domain: jhood@biar.uu.net bang: anywhere!uunet!biar!jhood "Insanity is a word people use to describe other people's lifestyles. There ain't no such thing as sanity."-- Mike McQuay, _Nexus_
slackey@bbn.com (Stan Lackey) (06/07/89)
In article <620@biar.UUCP> jhood@biar.UUCP (John Hood) writes: >Also note that with modern operating systems that do buffering or disk >caching, there is going to be a bcopy or its moral equivalent in there >somewhere. 1) Is it possible, if not now but possibly in the future, for programmed I/O to _eliminate_ some of the 'bcopy's? 2) This discussion brings to mind one that went around some time ago, which was, is it better to supply a bunch of specialized processors (then bitblt's, now including DMA controllers), or a bunch of identical processors connected together? Theory was, when the bitblt and DMA are done, the other processor(s) can be applied to a compute bound task. It seems to me this might make an interesting product; price/perf range is varied by the number of [identical] processors, and all I/O hardware is very very dumb. -Stan
stein@pixelpump.osc.edu (Rick 'Transputer' Stein) (06/07/89)
In article <41042@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >2) This discussion brings to mind one that went around some time ago, > which was, is it better to supply a bunch of specialized processors > (then bitblt's, now including DMA controllers), or a bunch of identical > processors connected together? Theory was, when the bitblt and DMA are > done, the other processor(s) can be applied to a compute bound task. > It seems to me this might make an interesting product; price/perf > range is varied by the number of [identical] processors, and all I/O > hardware is very very dumb. > >-Stan This sure sounds like an physically objective parallel i/o mechanism. A processor controlling some portion of the i/o stream as mapped to a specific device. Sounds like a job for Transputer Man! :-). -=- Richard M. Stein (aka Rick 'Transputer' Stein) Concurrent Software Specialist @ The Ohio Supercomputer Center Ghettoblaster vacuum cleaner architect and Trollius semi-guru Internet: stein@pixelpump.osc.edu, Ma Bell Net: 614-292-4122
rik@june.cs.washington.edu (Rik Littlefield) (06/08/89)
In article <185@dg.dg.com>, rec@dg.dg.com (Robert Cousins) writes:
< Given that a disk channel will be averaging 200K bytes/second ...
< [comparison of programmed I/O vs small dedicated buffer vs stupid DMA,
< evaluated against cpu cost and speed]
< I am the first to admit that these models are simplistic, but they
< do represent valid considerations and reasonable approximations to
< to the actual solutions.
<
< Comments?
The methodology seems sound, but I question the numbers. Just guessing,
but I suspect that workstation class systems have an *average* disk
throughput that is at least 10X lower than this number, even when
they are working full out. (Remember that 200K bytes/second is
720 Mbytes/hour.) If so, then the value of DMA is also 10X lower.
Would someone with real utilization numbers care to fill us in?
--Rik
jonasn@ttds.UUCP (Jonas Nygren) (06/08/89)
In article <8479@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes: >In article <185@dg.dg.com>, rec@dg.dg.com (Robert Cousins) writes: >< Given that a disk channel will be averaging 200K bytes/second ... <deleted> >< Comments? > >The methodology seems sound, but I question the numbers. Just guessing, >but I suspect that workstation class systems have an *average* disk >throughput that is at least 10X lower than this number, even when >they are working full out. (Remember that 200K bytes/second is >720 Mbytes/hour.) If so, then the value of DMA is also 10X lower. > >Would someone with real utilization numbers care to fill us in? > >--Rik I have performed a small test on a DECstation3100 with a RZ55-230 Mb disk. The test used 15 processes reading/writing 2Mb files each, with the following results: Write 15x2 Mb: 113 s, 265 kb/s Read 15x2 Mb: 117 s, 256 kb/s Read 15x2 + write 15x2 Mb (new and in parallell): 281 s, 213 kb/s Mean value: 234 kb/s /jonas
rec@dg.dg.com (Robert Cousins) (06/08/89)
In article <17925@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >In article <185@dg.dg.com> I write: >>Scenario two: Small dedicated buffer. > >>The buffer is 4K bytes long so the processor is no longer required >>to make as timely response as above. The real issue is now the >>copy time, of which there is two components: transfer time and >>context overhead. The transfer time will be limited by the memory/ >>cache/CPU bottleneck. Since the buffer is not cacheable (by implication), >>half of the transfer will involve a bus cycle in all cases. >> [ code fragment using byte loads and stores excerpted ] >> Total clocks required: 9*4096=36864 per block >> * 50 blocks/ second = 1843200 clocks > >This is not an unreasonable approach to analysing the time required for >the copies, but the code itself *is* unreasonable---it is more likely to >be something like > [ code excerpted -- uses 4 byte loads and stores ] > >which is four times faster than your version. Actually, you have created a scenario 2.5. I was making the assumption that cost was a driving factor here which will rule out the use of real two ported RAMs and 32 bit wide data paths. The increase in peripheral complexity is substantial (there aren't many 32 bit peripherals yet, but will be soon! :-)) along with the cost of RAM. However, this scenario should be treated as reasonably as the rest. The equation of reference is: CPU Cost + IO Scheme cost ------------------------- = $/deliverable compute unit CPU speed - IO overhead I use percentages simply to avoid arguements about what reasonable units are. For your suggestion to be true, the following inequality must hold: CPU Cost + 32-bit Buffer Cost CPU Cost + DMA Cost ----------------------------- < ------------------------ CPU Speed - Buffer Overhead CPU Speed - DMA Overhead Which is approximately equal to (when converting speed to percent): CPU Cost + 32-bit Buffer Cost CPU Cost + DMA Cost ----------------------------- < ------------------------ 95.5% ~100% or 1 * (CPU cost + 32-bit Buffer Cost) < .955 * (CPU Cost + DMA Cost) or .045 * CPU cost + 32-bit buffer cost < .955 DMA cost which is clearly dominated by the CPU cost. If the CPU cost is simply $100, DMA wins if it costs less than about $5 more than the buffer. >Still, 50 blocks/second is >much too slow, especially if the blocks are only 4 KB; modern cheap SCSI >disks deliver between 600 KB/s and 1 MB/s. With 8 KB blocks, we should >expect to see between 75 and 125 blocks per second. The purpose of the 50 blocks assumption was to estimate average CPU demand for support of I/O, not for peak situations. Relatively few machines of the low end class will be used at 1 mb/s continuously. >So we might change your 9% estimate to 4.5% (copy four times as fast, >but twice as often). Nevertheless: >>Scenario three: Stupid DMA. >>Here, the CPU just sets up the DMA and awaits completion. The overhead >>is approximately 0 compared to the above examples. >The overhead here is not zero. It has been hidden. The overhead lies in >the fact that dual ported main memory is expensive, so either the DMA >steals cycles that might be used by the CPU (and it can easily take about >half the cycles needed to do the copy in the Scenario two), or the main >memory costs more and/or is slower. Almost any product we are talking about will have a Cache (or two) with a reasonable hit rate which will allow DMA activity to take place with little or no performance impact. In fact, the major reason for speeding up RAM is to improve processor performance for cache line loads, not for improved peripheral performance. Anyway, few busses in the machines of this class have useable memory bandwidths less than 25 megabytes/second sustainable indefinitely. If the CPU is hogging 90% of this, there is still 2.5 megabytes per second available for I/O. This adds up to a continuously active ethernet (1.25 MB/s) along with healthy disk bandwidth (1.25 megabytes/second). Since both of these are bursty, in reality, there is a greater amount of instantaneously available bandwidth. In an earlier life, designing a 64 processor 80386 machine (there is a working prototype somewhere but the company is no more :-(), I hit upon the idea of predicting when a CPU will need bus cycles and using cycles which were predicted not to be needed so that they could be used for I/O. On an 80386, it is possible to 100% predict bus cycle requirements with a small amount of logic by cheating. My calculations showed that a 16 Mhz 80386 would leave almost 10 megabytes per second of bandwidth unused which this method could tap for non-time critical I/O operations such as SCSI. Time critical peripherals would have to take CPU cycles if "free" cycles were not available within their time frame which would not be very often. >>Where does the DMA pay off given that all three examples have approximately >>identical throughput? ... >>DMA is preferable to the second choice whenever the cost of DMA is >>less than 9% of the cost of the CPU or less than the cost of speeding >>up the CPU by 9%. >You have converted `% of available cycles' to `% of cost' (in the first >half of the latter statement) and assumed a continuous range of price/ >performance in both halves, neither of which is true. Actually, the true measure of a machine is the amount of work that it can do for an end user divided by the cost. The user must define the measure of work. Since I'm not able to define what the user will use to measure the machine, I must substitute a rough approximation -- deliverable CPU power in the form of MIPS, Dhrystones, or whatever. This value is directly tailorable by a number of factors in the system. Slowing down RAM can drop cost and performance. Sometimes it improves the ratio, sometimes it doesn't. While there is not a "continous" or even "twice differentiable" curve here, there are so many points on it that for the purposes of this discussion it can be assumed to be a line. For each price point, there is an associated performance level. Obviously, plotting each price point vs each performance point does not yield a line, but a cloud of points. However, these points are easily reduceable into a family of general lines based upon CPU clock speed, DRAM speed, peripherals and data path size among others. >(I happen to like DMA myself, actually. But it does take more parts, >and those do cost....) >-- >In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) >Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris I happen to like low cost myself and have been suprized when certain solutions turned out to be cheaper than others in counterintuitive ways. Robert Cousins Dept. Mgr, Workstation Dev't Data General Corp. Speaking for myself alone.
rec@dg.dg.com (Robert Cousins) (06/08/89)
In article <620@biar.UUCP> jhood@biar.UUCP (John Hood) writes: >In article <185@dg.dg.com> uunet!dg!rec (Robert Cousins) writes: >>Where does the DMA pay off given that all three examples have approximately >>identical throughput? >> >>DMA is preferable to the first choice whenever the cost of DMA is less >>than 20% of the cost of the CPU or less than the cost of speeding up >>the CPU by 20%. >>DMA is preferable to the second choice whenever the cost of DMA is >>less than 9% of the cost of the CPU or less than the cost of speeding >>up the CPU by 9%. >>I am the first to admit that these models are simplistic, but they >>do represent valid considerations and reasonable approximations to >>to the actual solutions. >>Comments? >Robert has also ignored the cost of setting up the DMA controller, >which can be significant, especially for scatter/gather type >operation. True. However I was assuming DUMB DMA which does not have these features. Clearly, Scatter/Gather has its associated costs and benefits. I was remiss in not including this as an additional scenario. >Also note that with modern operating systems that do buffering or disk >caching, there is going to be a bcopy or its moral equivalent in there >somewhere. This doesn't reduce CPU time used during programmed I/O, >but it does change the trade off from, say, 3 vs 10% to 13 vs 23% of >CPU availability used for disk I/O. This makes the nature of the >trade off different. True, however, many operating systems perform program loads and paging directly to user space bypassing a buffer cache. Since these make up a substantial portion of the actually performed disk operations, I don't think I was totally out of line, but your point is not only valid but important. I didn't include it because I was searching for a simple approximation. >My other thought is that regardless of the CPU cost, programmed I/O is >often acceptable on single-user machines anyway. The situation often >arises where the user is only concerned about the speed of the one >process he's using interactively. That process will usually have to >wait till its data arrives anyway, at least under current programming >models. If the CPU has to sit and wait, it might as well do the data >movement too. I strongly disagree here. In the modern Unix world, multitasking is of critical importance for simple system survival. Take the classic single user application: workstations. Sometime go look at the number of processes which are running on a Unix workstation some time. THen, use a single task heavily while keeping track of the time consumed by the other tasks. You will notice that the LAN related tasks will continue to use some amount of time. If you are running X Windows, you are then really using an application plus the X server -- two tasks. Unix depends upon being able to run more than one task to handle a variety of jobs up to and including gettys for each user's terminal. Programmed I/O is the enemy of multitasking since it effectively keeps the CPU from servicing tasks for a potentially long interval killing the ability for another task to service a request. Also, multitasking can provide a performance improvement even when only a single task is running. Clearly the task must wait while disk reads are being performed, but writes can be posted and flushed by a daemon in the background giving the program the effect of 0 wait writes. For some classes of applications, this can be a substantial win. >I'm not about to make any pontifications about what I think is the >best solution for the future, because I don't know myself ;-) > > --jh-- >John Hood, Biar Games snail: 10 Spruce Lane, Ithaca NY 14850 BBS: 607 257 3423 >domain: jhood@biar.uu.net bang: anywhere!uunet!biar!jhood >"Insanity is a word people use to describe other people's lifestyles. >There ain't no such thing as sanity."-- Mike McQuay, _Nexus_ The truth is that all alternatives need to be considered. Sometimes the answers will fool you. Robert Cousins Dept. Mgr, Workstation Dev't Data General Corp. Speaking for myself alone.
rec@dg.dg.com (Robert Cousins) (06/08/89)
In article <41042@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >In article <620@biar.UUCP> jhood@biar.UUCP (John Hood) writes: >>Also note that with modern operating systems that do buffering or disk >>caching, there is going to be a bcopy or its moral equivalent in there >>somewhere. > >1) Is it possible, if not now but possibly in the future, for programmed > I/O to _eliminate_ some of the 'bcopy's? Some already do for paging and program loads. Already some Unix DBMS products bypass the file system and communicate through the raw character drivers straight to the disks for performance reasons (bypassing the sector cache). While I don't know the implementational details of this, I do know that it has been known to do substantial good for standard DBMS jobs. This is because many character drivers for disks do DMA directly into user space. >2) This discussion brings to mind one that went around some time ago, > which was, is it better to supply a bunch of specialized processors > (then bitblt's, now including DMA controllers), or a bunch of identical > processors connected together? Theory was, when the bitblt and DMA are > done, the other processor(s) can be applied to a compute bound task. > It seems to me this might make an interesting product; price/perf > range is varied by the number of [identical] processors, and all I/O > hardware is very very dumb. In an earlier life, I headed up a development of just such a machine, the CSI-150 which supported up to 32 V30 CPUs, each of which could be connected to a private SCSI channel and capable of doing about 1.25 mb/s on each. It didn't catch on, but boy could it handle some classes of I/O based jobs!. Each CPU ran in its own private memory and sent messages to other CPUs. The operating system was designed so that the file systems were locally managed and cached in each CPU so the messages were higher level requests similar to NFS or RFS today. We did have one additional problem: the system supported exactly 1 user per CPU. This meant that the CRTs could be driven at 38.4 Kbps all day long since they effectively had a dedicated CPU to drive them. There were very few CRTs which could keep up with 19.2 Kbps much less 38.4. We found that at 38.4, most CRTs couldn't manage to send ^S out to shut off transmittion! >-Stan Robert Cousins Dept. Mgr, Workstation Dev't. Data General Corp. Speaking for myself alone.
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (06/08/89)
In article <1213@ttds.UUCP> jonasn@ttds.UUCP (Jonas Nygren) writes: >In article <8479@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes: >>In article <185@dg.dg.com>, rec@dg.dg.com (Robert Cousins) writes: >>< Given that a disk channel will be averaging 200K bytes/second ... >I have performed a small test on a DECstation3100 with a RZ55-230 Mb disk. >Write 15x2 Mb: 113 s, 265 kb/s >Read 15x2 Mb: 117 s, 256 kb/s >Read 15x2 + write 15x2 Mb (new and in parallell): 281 s, 213 kb/s >Mean value: 234 kb/s I have performed some single user tests. 200KBytes/sec reading rate is typical for small workstations with SCSI or similar disks, etc. With SMD on one of the newer controllers you can do ~600 KBytes/sec. On mainframes, I have seen single applications which *averaged* 3 MB/sec on 4.5 MB/sec channels on 8 simultaneous data streams. So, the ratios quoted seem reasonable to me. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
sandrock@uxe.cso.uiuc.edu (06/08/89)
Yes, it would be fantastic to have *real* numbers which represented the overall *system* performance, in particular, expressing *multi-user* cpu & i/o throughput, similar perhaps to the transaction processing benchmarks used on certain types of systems. I have recently seen some multi-user benchmarks called MUSBUS 5.2 and AIM III for various SGI machines, but these tests do not yet appear to be in wide use, nor do I have any feeling for how valid a measurement they would provide. Anyone care to comment about this? Mark Sandrock UIUC Chemical Sciences
rik@june.cs.washington.edu (Rik Littlefield) (06/09/89)
In article <1213@ttds.UUCP>, jonasn@ttds.UUCP (Jonas Nygren) writes: < In article <8479@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield -- that's me) writes: < < In article <185@dg.dg.com>, rec@dg.dg.com (Robert Cousins) writes: < < < Given that a disk channel will be averaging 200K bytes/second ... < < < < I suspect that workstation class systems have an *average* disk < < throughput that is at least 10X lower than this number, even when < < they are working full out. < < < < Would someone with real utilization numbers care to fill us in? < < I have performed a small test on a DECstation3100 with a RZ55-230 Mb disk. < The test used 15 processes reading/writing 2Mb files each, with the following < results: < <stuff deleted> < Mean value: 234 kb/s Sure, but how much of the time does your workstation run 15 processes reading and writing the disk as fast as it can? Program loads and file copies run at 200 Kb/sec, program builds do maybe 10X less I/O, SPICE just crunches. Whether DMA (or any other feature) is worthwhile depends on what the machine spends its time doing. Apparently my question was not clear, so I will restate it. Does anybody have numbers that reflect actual usage over an extended period? If so, please tell us what sort of work was being done, and how much I/O was required to do it. --Rik
hammondr@sunroof.crd.ge.com (richard a hammond) (06/09/89)
One might want to look at "An Analysis of TCP Processing Overhead" by Clark, Jacobson, Romkey, and Salwen in the June 1989 issue of IEEE Communications Magazine (Vol. 27, No. 6). They point out that the CPU has to do a checksum of the data during processing and propose (as one alternative) removing the DMA controller from the network controller and letting the CPU do the byte copy and move together in a single loop. Also, the disk throughput numbers of interest are not peak numbers but avverage use over applications. In previous jobs I've helped collect the processing and disk I/O requirements of months of jobs on Convex computers (i.e. big enough to do lots of number crunching and I/O) and there were very few jobs which were both CPU intensive and I/O intensive at the same time. So - DMA on RISC based workstations might have a different cost function than simply the cost of the CPU lost, one has to consider whether the CPU cycles lost to doing the data movement would be of use by another process. I estimate that the answer is yes only some fraction of the time, and so the numbers given are upper bounds on the costs. Note that I'm talking about workstations and not large mainframes or number crunchers, where the load or applications may be different. Definte numbers for applications which do both CPU crunching and require some sort of I/O AT THE SAME TIME would be interesting, since the Convex wasn't a personal workstation. Rich Hammond
mlord@bnr-rsc.UUCP (Mark Lord) (06/09/89)
In article <8499@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes: >In article <1213@ttds.UUCP>, jonasn@ttds.UUCP (Jonas Nygren) writes: >< In article <8479@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield -- that's me) writes: >< < In article <185@dg.dg.com>, rec@dg.dg.com (Robert Cousins) writes: >< < < Given that a disk channel will be averaging 200K bytes/second ... >< < ... >< < Would someone with real utilization numbers care to fill us in? >< >< I have performed a small test on a DECstation3100 with a RZ55-230 Mb disk. >< The test used 15 processes reading/writing 2Mb files each, with the following >< results: >< <stuff deleted> >< Mean value: 234 kb/s > >Sure, but how much of the time does your workstation run 15 processes reading >and writing the disk as fast as it can? Program loads and file copies run at >200 Kb/sec, program builds do maybe 10X less I/O, SPICE just crunches. > ... Er uhm.. excuse me.. but I think there may be two issues here. One is quantity of I/O, and the other is rate of I/O. This experiment with 15 processes doing lots of I/O probably (IMHO) comes close to determining the rate of transfer which is maintained when the system is actually reading from disk. Thus, for brief intervals, the system is doing transfers at the rate of 234kb/s, and it is this rate which the CPU/DMA_device must keep up with, IN ADDITION to keeping up with all other events/interrupts at the same time. Sure, so it's only busy for a second once every 30 seconds, but it still ought to be able to handle the load when it comes. Now imagine a system with several users running BIG simulations, with the associated paging going on as their tasks (and the 30 or so daemons which are always running) get swapped. Personally, I'd like the I/O to be fast, and I'd also like not to have to type slowly as I am doing right now (about a two second pause between hitting a key and seeing the result at times). DMA might be appropriate for such a system. Especially since the CPU could easily be running something else from its huge caches, leaving lots of idle bus cycles for the DMA. This does not require dual port memory, but does require some sort of snooping (h/w or s/w) to maintain cache consistancy. As processors continue to become much faster than the memory bus, DMA seems to begin looking better and better for bulk data transfers with slower I/O devices. -Mark
ken@hugi.ACA.MCC.COM (Ken Zink) (06/10/89)
Without getting totally mired in ancient history, I think it is germane to the discussion to point out that the "I/O processors" (called "peripheral processors" by CDC) had sufficient intellegence to do more than I/O. In fact, for the first ten years or so of the 6000 architecture (6600, 6400, Cyber 70, Cyber 170, lower), the ENTIRE operating system resided in the domain of the "lowly" PP's. It seems that very few operating system functions really require floating point capability or 60 bits of significance. One of the PP's was dedicated to "system monitor" functions and was staticly allocated; one other PP was staticly allocated and assigned to driving the system console (dual 15 inch CRT's). The rest of the PP's were dynamically assigned, by PP Monitor, to perform I/O or some OS function as necessary. [Note: On the 20 PP systems, any PP could access any I/O channel or else it caused too many problems in the OS to determine which available PP could handle a given I/O request.] Since the early systems were limited to 131K words of central (60 bit) memory, having the OS reside in the PP world preserved the critical memory resource for user job space. As larger memory configurations became available, a few of the most often used OS functions were migrated to CP code - for performance. The 7000 architecture (7600, Cyber 170/Model 176) with the PP's hard-wired to external channels and to central memory buffers dictated that the OS be CPU (60 bit) based. In summary, there is a spectrum of system architectures with "intellegent" I/O ranging, perhaps, from the dumbest of DMA-like to clusters of identical processors, equally capable of performing an I/O function or executing user provided code. An optimal architecture would trade-off all of the options in that spectrum against the available-technologies, cost-to- manufacture, price-in-the-market, performance-in-the-desired-target- application(s) and market-acceptability variables to determine which of the options is "optimal." In short, as we know, there is no single, best architecture. Ken Zink zink@mcc.com MCC Austin, TX
elg@killer.DALLAS.TX.US (Eric Green) (06/10/89)
in article <26636@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) says: >>I have performed a small test on a DECstation3100 with a RZ55-230 Mb disk. > >>Write 15x2 Mb: 113 s, 265 kb/s >>Read 15x2 Mb: 117 s, 256 kb/s >>Read 15x2 + write 15x2 Mb (new and in parallell): 281 s, 213 kb/s >>Mean value: 234 kb/s Note that this is probably not an accurate account of disk drive bandwidth at all. Unix (at least older AT&T versions) DMA their data into the disk cache, then has the CPU manually copy it into the user's own buffer. With a plain-jane ST157N and a non-DMA SCSI controller pushed by a plain old 8mhz 68000, I get 550K/second (at least until my disk gets fragmented). And there are still visible pauses where the 68000 takes a while to digest the data. Another (DMA) disk controller gets 650K/second out of the same disk drive (of course, a 68020 or faster processor wouldn't have run out of steam like my 68000, so this isn't really an argument of DMA is better than CPU driven). Strangely enough, I have never seen anything on preferential caching schemes for file systems. You'd want to cache small I/O requests, as is currently done... but what about the scientific types who want to stream in a few megabytes of data, crunch on it, then stream it back out -- as fast as possible? That'd blow any reasonable cache to pieces. You'd want to DMA it straight into the user's memory. Or even use CPU-driven IO straight into the user's memory... you'd still come out at least as well as the traditional DMA-it-to-cache-then-copy-it. Thinking on it a bit, seems you'd want to cache only small I/O requests that don't overwhelm the amount of cache you have, while DMA'ing large I/O requests straight into the user's memory ASAP. That way crontab, whotab, and other small files hit fairly often would stay cached longer. An interesting problem... I suppose it irritates the designers of these disk subsystems that all their beautiful bandwidth is chewed to shreds by OS overhead. > On > mainframes, I have seen single applications which *averaged* 3 MB/sec on > 4.5 MB/sec channels on 8 simultaneous data streams. Which particular mainframes? Sounds like something a Cray could do... very little overhead there at all (don't have to cope with memory protection, can DMA straight into the user's data space without worrying about how "real" memory maps into the user's "virtual" memory, etc.). Sounds to me like another speed reason for Crays to not have virtual memory :-) (for the old veterans of past comp.arch discussions). Have to consider all aspects of the architecture, including disk subsystem performance, not just what it looks like from a user or CPU point of view. > So, the ratios quoted seem reasonable to me. Yes, seems reasonable to me too. But somewhat sad, considering the performance that the hardware is capable of. -- Eric Lee Green P.O. Box 92191, Lafayette, LA 70509 ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg (318)989-9849 "I have seen or heard 'designer of the 68000' attached to so many names that I can only guess that the 68000 was produced by Cecil B. DeMille." -- Bcase
elg@killer.DALLAS.TX.US (Eric Green) (06/10/89)
in article <188@dg.dg.com>, rec@dg.dg.com (Robert Cousins) says: > In article <620@biar.UUCP> jhood@biar.UUCP (John Hood) writes: >>>Where does the DMA pay off given that all three examples have approximately >>>identical throughput? >>My other thought is that regardless of the CPU cost, programmed I/O is >>often acceptable on single-user machines anyway. The situation often > I strongly disagree here. In the modern Unix world, multitasking is > of critical importance > of jobs up to and including gettys for each user's terminal. Programmed > I/O is the enemy of multitasking since it effectively keeps the CPU from > servicing tasks for a potentially long interval killing the ability > for another task to service a request. I have a non-DMA hard disk controller on the multitasking machine that I use (so multitasking that each filesystem and device driver runs as seperate tasks, albeit high-priority tasks). I'll agree that programmed I/O sucks the wind out of any other task (of lower priority) that is running, but it's not quite as bad as you put it. The CPU can fetch data from buffer on the disk controller faster than it can do anything with that data, so most things, e.g. compiles, consist of long moments of Deep Thought by the process, punctuated by occasional disk hits. This is hardly the "disaster" that you claim, although I do wish I had been able to get a DMA controller (I couldn't trust one with the bus expander I'm using, alas). And for the person who thought that the buffer on the disk controller would have to be double-ported -- nope. Just bus oriented, with access from either the disk side or the bus. I'm no hardware wiz (beware of programmers with slaughtering irons!), but I can think of a couple of ways to do it with standard TTL. E.g., use '273s as the "data port", and loadable counters to address the RAM buffer sequentually (without having to do math on it for each additional byte). > The truth is that all alternatives need to be considered. Sometimes > the answers will fool you. Yep. And the "answer", in this case, is that while programmed I/O is certainly nothing to be proud of (at least with a slow processor like my 8mhz 68000), it's not the big disaster to multitasking that you might expect. -- Eric Lee Green P.O. Box 92191, Lafayette, LA 70509 ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg (318)989-9849 "I have seen or heard 'designer of the 68000' attached to so many names that I can only guess that the 68000 was produced by Cecil B. DeMille." -- Bcase
jonasn@ttds.UUCP (Jonas Nygren) (06/10/89)
In article <8499@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes: >In article <1213@ttds.UUCP>, jonasn@ttds.UUCP (Jonas Nygren) writes: >< In article <8479@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield -- that's me) writes: >< < In article <185@dg.dg.com>, rec@dg.dg.com (Robert Cousins) writes: >< < < Given that a disk channel will be averaging 200K bytes/second ... >< < >< < I suspect that workstation class systems have an *average* disk >< < throughput that is at least 10X lower than this number, even when >< < they are working full out. >< < >< < Would someone with real utilization numbers care to fill us in? >< >< I have performed a small test on a DECstation3100 with a RZ55-230 Mb disk. >< The test used 15 processes reading/writing 2Mb files each, with the following >< results: >< <stuff deleted> >< Mean value: 234 kb/s > >Sure, but how much of the time does your workstation run 15 processes reading >and writing the disk as fast as it can? Program loads and file copies run at >200 Kb/sec, program builds do maybe 10X less I/O, SPICE just crunches. >Whether DMA (or any other feature) is worthwhile depends on what the machine >spends its time doing. > >Apparently my question was not clear, so I will restate it. Does anybody have >numbers that reflect actual usage over an extended period? If so, please tell >us what sort of work was being done, and how much I/O was required to do it. > >--Rik My figures was intended to show the upper limit of througput on a comercial available workstation. I have been informed by people inside Digital that it would be possible to achieve figures at least twice what I stated if you care to look for disk-drive sources outside Digital. /jonas
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (06/12/89)
In article <8327@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes: >in article <26636@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) says: >> mainframes, I have seen single applications which *averaged* 3 MB/sec on >> 4.5 MB/sec channels on 8 simultaneous data streams. >Which particular mainframes? Sounds like something a Cray could do... This exact performance figure is from a Cyber 205, but I have seen similar performance on Crays (not quite as good *then*, but should be better now because of faster disks - newer disks run at ~100 Mbits/sec transfer rate as opposed to the older 36 Mbits/sec disks) Also, I expect large IBM mainframes to do almost as well. Although the disk transfer rate is not as high, the disk controller to channel connection runs at 4.5 MBytes/sec on some models. (*Aside*) These I/O rates are not particularly high by mainframe standards, just by Mini/Micro standards. There used to be a rule of thumb that for balance a system should have a constant ratio of 1 MIPS/1 MByte/1 Mbyte/sec I/O. The latter was slightly nebulous, but usually interpreted as channels capable of it and disks capable of reading at that rate sustained. It was also considered a "good idea" if disk and channel utilization was less than 5% of raw aggregate capacity in order to guarantee that the disk subsystem was not the bottleneck. I actually did a study once and found that the ratio on one heavily used (i.e. many users) system here actually used 15KB/sec/MIP *average*. This (mainframe) system was capable of at least .5 MB/sec/MIP I/O. This 3% utilization helped make the CPU the bottleneck. Disk I/O is the usual bottleneck on mini/micro systems. This is not necessarily a "problem", it is just a system design and configuration tradeoff. (*end Aside*) On a Cray, if you have an SSD, your I/O rate can run a *lot* faster than the above disk rates. >very little overhead there at all (don't have to cope with memory >protection, can DMA straight into the user's data space without Yes, this is part of the reason such rates can be sustained. These rates were always with data copied directly into user memory. I note that there is a way to do this some Unix systems: a facility to map virtual memory to files. Then "paging" can potentially move the data directly into memory without copying. This is the case where virtual memory actually helps. Most of the time it doesn't matter one way or the other for this problem. >worrying about how "real" memory maps into the user's "virtual" >memory, etc.). Anyway, the Cyber 205 is a virtual machine. VM has nothing to do with it specifically. The cost of copying large blocks of data is much less on a Cray or Cyber 205/ETA machine because block data copies are done at vector rate, and there is enough memory bandwidth available to sustain such rates. Crays have memory protection, and the Operating System still has to figure out what real memory addresses user memory buffers are in. It takes a few microseconds to do this either way, virtual or not. These operations were actually faster on the Cyber 205 than on the Cray X-MP/48, for various reasons. The cost of an I/O operation has generally been in figuring out where the data is in disk and initiating and sustaining the transfer. The Cyber 205 did this quickly because the hardware had *very* capable controllers which did all the cylinder/track/sector mapping, and presented a simple blockserver interface to the operating system. (The 205 did not have the complicated "channel program" problem that IBM's have because this overhead was all done in the controllers.) >Sounds to me like another speed reason for Crays to not have virtual >memory :-) (for the old veterans of past comp.arch discussions). Have It sounds like a reason for systems to support fast I/O to me :-) 1) parallel I/O paths to memory (aka "channels") 2) fast disks 3) low overhead to do a raw disk operation 4) lots of memory bandwidth 5) operating systems which support multiple asynchronous I/O requests 6) operating systems which support transfer of data directly into user memory without being buffered elsewhere ********************************************************************** I have an actual number to present here: I have seen a significant number of applications which can only do about 20 floating point operations per word of I/O, unless the entire problem can be memory contained. The memory required for the entire problem is in the range of 1 Million Words for every 1 to 10 MFLOPS. So, a single job running at ~100 MFLOPS may need about 800 MBytes, *or* the ability to do I/O at a rate of 40 MBytes/sec. The single job referred to earlier was running at about 200 MFLOPS on a Cyber 205 and needed about 50 Mbytes/sec of I/O (it didn't get it - it only got ~24 MBytes/sec) I do not remember exactly how much memory was needed, but it was significantly more than 32 MW (256 MBytes). You have to look at the requirements of the entire problem before you can say what your system requirements are. ********************************************************************** Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
gws@Xylogics.COM (Geoff Steckel) (06/12/89)
One concern that doesn't seem to be addressed on the DMAC vs. CPU data movement question is transfer latency: the maximum delay between individual data items moved. Restrictions on latency (due to small buffers somewhere or a small decision time) can override the otherwise attractive choice of using the CPU for data movement. If the CPU is not interruptable (or restricts interrupts) during the data transfer other services (communications, screen, mouse, etc.) may experience an unacceptable latency as well. Sufficient buffering or other hardware to `smooth over' large latency times can become more expensive (choose your metric) than putting in real or simulated multiport memory and a dedicated DMAC; it entirely depends on the desired performance. For example, if the CPU must move data between every disk transfer, you may miss revolutions on the disk... or maybe not. Read, read, measure, and read again. geoff steckel (steckel@alliant.COM, gws@xylogics.COM)