schow@leibniz.uucp (Stanley Chow) (04/15/89)
Since Micheal Slater has already posted a good summary of the '040 and 486; there is not much point in starting a new battle in the 68K vs i86 war. (At least wait till there is more public information). In the mean time, I offer a new topic to burn up bandwidth (that is, net.bandwidth, not bus.bandwidth). In a recent series of articles about address modes and other topics, some posters claim that memory bandwidth is not a problem - to quote Brian Case, "bandwidth can be had in abundance". I happen to think that we do not enough bandwidth now. What to other people think? Just to make sure there are enough pieces so that everyone can post a different answer, I will start with a list of pieces and you can fill in the interfaces. Please try to at least type what you mean and by all means, put in a couple of real (or maximum) numbers. [Feel free to talk about parallel/multi-processing.] Piece of system Execution Core (possibly many) On-chip Cache (possibly split) chip --------------------------------- Off-chip Cache (possibly multi-level) board Main memory (possibly multi-level) Bulk memory (for lack of a better term) Specific interfaces that may be of interest: 1) Execution Core to On-chip I-Cache. It seems people can already build cores that are faster than the on-chip cache. One can always throw silicon at a multiplier to make it faster (I know, there are limits with loading, ..). 1a) Straight line execution Even in this simpler case, I understand that most chips are limited by the cache, not the core. Any chip designers want to comment? 2) Execution Core to On-chip D-cache. A very hard problem by all accounts. Everyone (almost) adds delay slots one way or another. 3) On-chip to off-chip. A well know problem. How wide do you think buses will get? How fast? 4) How should off-chip cache be controlled? By the cpu chip? 5) Invent the interface problem of your choice. This can be made as hard or as easy as you want. Stanley Chow ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public Please don't tell my boss I am starting this discussion, he thinks I am working hard on software!
davis@clocs.cs.unc.edu (Mark Davis) (04/15/89)
In article <407@bnr-fos.UUCP> schow@leibniz.uucp (Stanley Chow) writes: >In a recent series of articles about address modes and other topics, >some posters claim that memory bandwidth is not a problem - to quote >Brian Case, "bandwidth can be had in abundance". I happen to think that >we do not enough bandwidth now. What to other people think? > ... > It seems people can already build cores that are faster than the > on-chip cache. One can always throw silicon at a multiplier to make > it faster (I know, there are limits with loading, ..). You can always improve bandwidth with silicon (and wires). To double bandwidth, double the data bus size. You can also use interleave or special chip modes (static column or page mode access) to improve bandwidth. As I remember, Brian Case's statement was indeed referring to bandwidth. On the other hand, latency (roughly the number of cycles to get the data after you figure out its address), is a much more difficult problem. Making the latency twice as good (50% as long) can be very tough. Some latency's are not possible with current technology ( 1 ns latency for a 1 Megaword system for example). Can you rephrase your questions to discriminate between bandwidth and latency? Thanks - Mark (davis@cs.unc.edu or uunet!mcnc!davis)
mccalpin@loligo.cc.fsu.edu (John McCalpin) (04/16/89)
In article <7766@thorin.cs.unc.edu> davis@cs.unc.edu (Mark Davis) writes: >You can always improve bandwidth with silicon (and wires). To double >bandwidth, double the data bus size. > >On the other hand, latency (roughly the number of cycles to get the >data after you figure out its address), is a much more difficult problem. >Making the latency twice as good (50% as long) can be very tough. >Mark (davis@cs.unc.edu or uunet!mcnc!davis) One place where the distinction between latency and bandwidth shows up very clearly is in the CDC/ETA line of supercomputers. These machines (the Cyber 205 and ETA-10) use a memory-to-memory vector architecture. The machine being installed now at FSU (an ETA-10G) supports a sustained bandwidth of 6.85 GByte/s from each CPU's 32 MByte (soon 128 MByte) local memory through the vector pipes and back to memory. This bandwidth consists of two 64-bit loads and one 64-bit store for each of 2 vector pipes on each CPU every 7 ns cycle. Total: 6 words/clock * 8 Bytes/word * 143 M clock/s = 6.85 GB/s. I think that this memory is implemented in 35 ns SRAM. Latency on the 10.5ns machine is about 6-8 cycles, or 60-80 ns. I don't know how this will scale with the faster CPU's. The second-level memory consists of 1 GByte of DRAM. It has 8 ports capable of sustaining one 64-bit word/clock transfers. The aggregate transfer rate is thus 9.1 GByte/s. The hardware setup time for a transfer is supposed to be about 256 cycles, but I don't know what fraction of this is the actual memory latency. Disclaimer: I don't work for CDC/ETA. In fact, I don't work much at all.... -- ---------------------- John D. McCalpin ------------------------ Dept of Oceanography & Supercomputer Computations Research Institute mccalpin@masig1.ocean.fsu.edu mccalpin@nu.cs.fsu.edu --------------------------------------------------------------------
mccalpin@loligo.cc.fsu.edu (John McCalpin) (04/17/89)
In article <592@loligo.cc.fsu.edu> I wrote: >One place where the distinction between latency and bandwidth shows up >very clearly is in the CDC/ETA line of supercomputers. These machines >(the Cyber 205 and ETA-10) use a memory-to-memory vector architecture. I then went on to discuss the bandwidth, but not the latency. I guess that I didn't make a very clear distinction. :-) Recap: we have a machine with 4 7 ns CPU's, each with 32 MB of SRAM and a 6850 MB/s memory channel. The CPUs share another 1 GB of DRAM, with 4 1140 MB/s channels currently installed (one to each CPU). The latency is important because the overhead of setting up a memory-to-memory vector operation includes the memory latency plus the pipe length, plus other stuff relating to decoding the instruction, etc. The latency of the SRAM on the ETA-10 is about 6-8 cycles, and the pipe length is 5. So even if the instruction took zero time to decode (don't we all wish!), there should be an overhead of 11-13 cycles on each vector operation. In fact, the hardware overhead on the ETA-10 is down to about 16-23 cycles, depending on how the banks are aligned for the input and output vectors. This allows very good performance on fairly short vectors. The latency tends to be more of a bother in the random gather/scatter instructions. The ETA-10 (like the Cray machines) uses banked memory, set up so that sequential accesses come at full (6850 MB/s) speed. Random accesses can be MUCH slower. Repeated accesses to the same bank (typically resulting from a stride through an array which is a multiple of 8 or 16) result in a full latency delay on each access. Most ETA-10 users would really like to see the latency go down so that bank conflicts would be less trouble on random gathers/scatters. Disclaimer: I don't work for CDC/ETA. In fact, I don't work much at all.... -- ---------------------- John D. McCalpin ------------------------ Dept of Oceanography & Supercomputer Computations Research Institute mccalpin@masig1.ocean.fsu.edu mccalpin@nu.cs.fsu.edu --------------------------------------------------------------------
schow@bnr-public.uucp (Stanley Chow) (04/18/89)
In article <7766@thorin.cs.unc.edu> davis@cs.unc.edu (Mark Davis) writes: >You can always improve bandwidth with silicon (and wires). To double >bandwidth, double the data bus size. You can also use interleave or >special chip modes (static column or page mode access) to improve >bandwidth. > Within a chip, yes, one can widen the bus. Even there, routing problems will restrict it. The 128 bit bus on the recent Intel chips seems to be a pratical limit for now. 512 bit buses probably need triple metal levels even in sub-micro processes. Outside of a chip, I would have seriouse doubts about a very wide bus unless you have lot's of money. A 128 bit bus with 32 bit address comes to 160 pins before control line, add in power and ground, double it for I & D, and we are looking at a packaging problem. Come to think of it, ground bounce will probably make the packaging look easy. My view is that even with interleave and page-mode, etc, we can make execution cores and on-chip caches that are much faster than any bus. Even in terms of raw bandwidth, but especially in latency time required. >Can you rephrase your >questions to discriminate between bandwidth and latency? > Actually, this question on bandwidth is my followup on the recent discussion. I am fustrated by bandwidth and latency at very turn; yet all the people on the net seem to think bandwidth and/or latency is not a problem. Is this because everyone know something I don't? Do I have a particularly difficult problem? Basically, I like get a feel for what other poeple think are the bottle-necks. So comments on bandwidth *and* latency are appreciated. The real hidden agenda is the old RISC-CISC war. If bandwidth is a real problem, then RISC is not a good solution. [Oh no, did I just restart a religous war?] Stanley Chow ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-pulic What opinion? Did I say something? Come on, you wouldn't fire me just because I didn't put in a disclaimer?
jesup@cbmvax.UUCP (Randell Jesup) (04/21/89)
In article <418@bnr-fos.UUCP> schow@bnr-public.UUCP (Stanley Chow) writes: >Outside of a chip, I would have seriouse doubts about a very wide bus >unless you have lot's of money. A 128 bit bus with 32 bit address comes >to 160 pins before control line, add in power and ground, double it for >I & D, and we are looking at a packaging problem. Come to think of it, >ground bounce will probably make the packaging look easy. ... >Actually, this question on bandwidth is my followup on the recent >discussion. I am fustrated by bandwidth and latency at very turn; yet >all the people on the net seem to think bandwidth and/or latency is not >a problem. Is this because everyone know something I don't? Do I have >a particularly difficult problem? Well, my opinions on this are pretty well known here, I think. To restate: I agree bandwidth could well become a problem. Bandwidth problems come in several flavors: packaging is a big one, ram speed gets in there also (especially if you don't want to pay astronomical prices for it). I won't even go into bus bandwidth. The traditional ways to improve bandwidth are running out of steam, or at least starting to. It's getting harder to keep adding pins to these (very large) packages, while still running them at reasonable rates. Also, the signals are getting fast enough that capacitive pad loads from static protection (combined with fan-out) are limiting the speed at which you can run the lines. However, there are interesting non-traditional solutions to these peoblems that may save our bacon. Ram speed is also an issue: you can get 50Mhz '030's, but the ram to keep up with it is EXPENSIVE. Processor speed has been increasing faster than ram access time has been decreasing (for both CISC and RISC). -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
erskine@dalcsug.UUCP (Neil Erskine) (04/21/89)
In article <6658@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes: > > The traditional ways to improve bandwidth are running out of steam, >or at least starting to. It's getting harder to keep adding pins to these >(very large) packages, while still running them at reasonable rates. Also, >the signals are getting fast enough that capacitive pad loads from static >protection (combined with fan-out) are limiting the speed at which you can >run the lines. > I'm no engineer, but if the capacitive pad loads are restricting the speed of off-chip signalling, why not dispense with them, and provide the static protection at the board level? This might make board assembly more costly (due to the increased care required), and the board itself more costly (it might have to be encased in metal), but if it gives a significant degree of additional speed, the bother and expense seem worth it. Alternatively, there may be some reasons why board level protection can't do the job; in which case what are those reasons?
matloff@crow.Berkeley.EDU (Norman Matloff) (04/27/89)
In article <7766@thorin.cs.unc.edu> davis@cs.unc.edu (Mark Davis) writes: >In article <407@bnr-fos.UUCP> schow@leibniz.uucp (Stanley Chow) writes: >>In a recent series of articles about address modes and other topics, >>some posters claim that memory bandwidth is not a problem - to quote >>Brian Case, "bandwidth can be had in abundance". I happen to think that >>we do not enough bandwidth now. What to other people think? >You can always improve bandwidth with silicon (and wires). To double >bandwidth, double the data bus size. You can also use interleave or >special chip modes (static column or page mode access) to improve >bandwidth. These measures, e.g. wider buses, may just shift the bottleneck to something else. There is still a strong limitation on a chip's number of pins, right? The area of a rectangle grows much faster than the perimeter, and of course there are mechanical reasons why pins can't be too small. Thus the ratio of number of I/O channels of a chip to bits stored in the chip will probably get worse, not better. We are developing an optical interconnect which has plenty of bandwidth, since it bypasses the pins and reads from the chip directly [see 1988 ACM Supercomputing Conf.] But it does indeed seem to us -- at this stage, at least -- that huge bandwidth can not be exploited fully in many, maybe most, applications. I certainly would like to hear what others have to say about this. Norm
jps@wucs1.wustl.edu (James Sterbenz) (05/02/89)
In article <23649@agate.BERKELEY.EDU> matloff@heather.ucdavis.edu (Norm Matloff) writes: >These measures, e.g. wider buses, may just shift the bottleneck to >something else. There is still a strong limitation on a chip's >number of pins, right? The area of a rectangle grows much faster >than the perimeter, and of course there are mechanical reasons why >pins can't be too small. This is a problem with packages that have pins only on the perimeter (such as DIPs), but not for PGAs. Of course, pin limitation is still a problem, but not quite as bad for PGAs. >Thus the ratio of number of I/O channels >of a chip to bits stored in the chip will probably get worse, not >better. In spite of all the other things that most of us think as important, packaging remains one of the most important limitations to system performance. This is one of the reasons that micros and workstations will have trouble reaching the performance of mainframes and supercomputers; current high performance packaging and cooling is just too expensive. It will be very interesting to see what happens when a cheap, easy chip interconect allowing close 3-D stacking becomes available (assuming corresponding cooling). -- James Sterbenz Computer and Communications Research Center Washington University in St. Louis 314-726-4203 INTERNET: jps@wucs1.wustl.edu UUCP: wucs1!jps@uunet.uu.net
schmitz@fas.ri.cmu.edu (Donald Schmitz) (05/03/89)
>In spite of all the other things that most of us think as important, >packaging remains one of the most important limitations to system >performance. This is one of the reasons that micros and workstations >will have trouble reaching the performance of mainframes and supercomputers; >current high performance packaging and cooling is just too expensive. Just saw a blurb in one of the trade papers about a new inter-connect technology, developed by Cinch, called "Cinapse". I'm still waiting on info, but from the description, they use silver butt contacts - I'm guessing the trick is to somehow make the contacts be springy to make sure all of them in an array touch. From memory, the density was about twice that of current PGAs (240 contacts/in^2 sticks in my mind). They are pushing this as a bus connector technology, but it seems possible to use it for chips too. Interesting question, if packaging allowed you to have twice as many pins per CPU (pick your favorite existing design), what would you do with them? Don Schmitz --
jesup@cbmvax.UUCP (Randell Jesup) (05/04/89)
In article <833@wucs1.wustl.edu> jps@wucs1.UUCP (James Sterbenz) writes: >This is a problem with packages that have pins only on the perimeter >(such as DIPs), but not for PGAs. Of course, pin limitation is still >a problem, but not quite as bad for PGAs. But then again PGA's have thermal expansion coefficient problems due to mismatch with the coefficient of board they mount on (or so I was told). That's why the RPM-40 in in a leadless chip carrier instead of a PGA. (Perhaps PGA's with sufficient pins weren't rated for 40 MHz, either.) -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
mark@mips.COM (Mark G. Johnson) (05/04/89)
In article <6759@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes: > But then again PGA's have thermal expansion coefficient problems > due to mismatch with the coefficient of board they mount on (or so I > was told). That's why the RPM-40 in in a leadless chip carrier > instead of a PGA. (Perhaps PGA's with sufficient pins weren't > rated for 40 MHz, either.) >Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup Seems to me that PGA's do just fine with respect to both temperature and frequency. The defacto industry-standard CMOS video DAC (Brooktree 458) comes in an 84 pin PGA, dissipates 2.2 Watts, and runs at 125 MHz. Maybe Brooktree is more clever with "rating" their package than RPM40 was. Intel's i860 CMOS microprocessor comes in a 168-pin PGA, runs at 40 MHz, and dissipates 3.5 Watts. Maybe Intel ..... Hewlett-Packard's most recent HP-n000 series microprocessor, built in NMOS, dissipates 26 Watts and is mounted in a 408 pin PGA. Various ECL and GaAs RISC processors are about-to-be-introduced, from several sources, dissipating godzilla amounts of power and running at mucho Megahertz --- and several of them are in PGA packages. Looks like folks who really want to, can make PGAs go quite far indeed. -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 ...!decwrl!mips!mark (408) 991-0208
jesup@cbmvax.UUCP (Randell Jesup) (05/05/89)
In article <18753@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes: >Seems to me that PGA's do just fine with respect to both temperature >and frequency. The defacto industry-standard CMOS video DAC (Brooktree 458) >comes in an 84 pin PGA, dissipates 2.2 Watts, and runs at 125 MHz. >Maybe Brooktree is more clever with "rating" their package than RPM40 was. Maybe. We needed 140 or so pins, most of them running at 40Mhz, CMOS device, 2-3 watts. Also, this was in '85 or '86, and we wanted to use an out-of-the-box package (this wasn't a production part, so we didn't want to pay to develop/certify a package). -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (05/06/89)
In article <18753@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes: >Hewlett-Packard's most recent HP-n000 series microprocessor, built in >NMOS, dissipates 26 Watts and is mounted in a 408 pin PGA. That's an impressive number of pins. It's probably not very dense, though - on 100-mil centres, that's over four square inches, to hold a square centimeter (or so) of chip. Long paths! Also, pins take up board area, since they go through all signal layers, rather than just through relevant ones. One obvious improvement is to go surface mount, with a pad-grid array. The Motorola "Hypermodule" uses 88000's mounted in these. I've only seen photos so far, but the claim is 288 pads in 1.1 inch x 1.1 inch, using 60-mil centres. Actually, there's only 143 signal lines - the rest are power, ground, and thermal. It will be interesting to see the response from competitors. -- Don D.C.Lindsay Carnegie Mellon School of Computer Science --