aglew@ccvaxa.UUCP (11/21/87)
..> Superlinear speedups > > Just for drill, consider the fact that a 2-CPU S/370-168-MP ran a > > throughput of 2.1 to 2.25 over a 1-CPU system, depending on just what we > > gave it for workload. > If I understand correctly, you are saying that going from 1 to 2 CPUs > _more than doubled_ the throughput of the system? This is counter intuitive. > If you could expand on it, I would appreciate it. > For instance, what are we talking about as a measure of throughput, and > was the job mix similar? Many people and companies are finding superlinear speedups, where N processors are more than N times better than a single processor, especially on job mixes but also in single programs, on many different performance indices including the most important one, elapsed time. The reason is basically reduced context switching - in ways that you can't necessarily get by applying tricks such as interrupt stacking on a single processor. Now, it should be obvious how multiple processors reduce context switching wrt. a single processor when it's running a timeshared system, but how does this improve the performance of a single program? Well, programs typically have a lot of hidden, internal, context. Consider a program that has four sections, A1, A2, B1, B2, where both A1 and B1 have to run before either A2 or B2. But consider that A1 and A2 share a lot of context - eg. they manipulate a huge array - and B1 and B2 share a lot of context, but there is little context shared between the Ai and Bi families. Eg. A1 does enough processing on the huge array it shares with A2 that it will completely fill the cache, and flush out any data that B1 might have loaded, and so on. Now say that the dependency from B1 to A2 is just a small amount of data. On a single processor, the time for this program to run is P(A1) + P(B1) + P(A2) + P(B2) + 2*C(AB) where P(jobstep) = processing time, and C(AB) = time to change context between A and B (you can analyze this in more ways, but that's the basic idea). On a multi processor, the total run time is P(A1) +..+ P(B2) + 2*M(AB) but the elapsed time is max(P(A1),P(B1)) + max(P(A2),P(B2)) + M(AB) where M(AB) is the cost to send a message between the two processors containing the information from B1 that A2 depends on, and vice versa. If C(AB) is large, it is quite easy for the total elapsed time on the multiprocessor to be less than half that on the uniprocessor. The key point here appears to be context. Here we are comparing N processors with N*M memory to 1 processor with M memory, not N processors with small memories M/N. Or actually not memories, but "context" - caches can do the same role for shared memory machines, as can private register sets. The small memory multiprocessor can still win out in certain cases, but I suspect fewer than the large memory machine. So, not only has superlinear speedup been observed in practice, but a theoretical understanding of it is possible too. Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ihnp4!uiucdcs!ccvaxa!aglew 1101 E. University, Urbana, IL 61801 ARPAnet: aglew@gswd-vms.arpa I always felt that disclaimers were silly and affected, but there are people who let themselves be affected by silly things, so: my opinions are my own, and not the opinions of my employer, or any other organisation with which I am affiliated. I indicate my employer only so that other people may account for any possible bias I may have towards my employer's products or systems.
aglew@ccvaxa.UUCP (11/28/87)
..> John Mashey talking about barrel architectures Just making sure - barrel architectures are systems that multiplex each pipeline stage between different threads of execution? I assume this in my response. >Assume we're using split I & D caches. Assume that the cache line >is N words long, filled 1 word/cycle after a latency of L cycles. >One would expect that efficient cache designs have L <= N. I expect this is quite basic, but how do you show this for barrel architectures (the demonstrations I've seen have required regular architectures). >When filling an I-cache miss, you can do L more barrel slots, >then you must stall for N slots (or equivalent), because it doesn't >make sense to have the I-cache run faster than the chip (if it did, >you would run the chip faster). Why not have the I-cache run faster than the chip? I-caches are more regular structures than the cpu, and are probably that much easier to make run faster. Also, why stall for N slots? There are several schemes to deliver data from a partially filled cache line as soon as it is available. Finally, why not continue on a completely separate thread while the cache is filling for the thread that caused the cache miss? Barrel need not imply round robin. I think that I've missed part of John's point here. Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@mycroft.gould.com ihnp4!uiucdcs!ccvaxa!aglew aglew@gswd-vms.arpa My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
mash@mips.UUCP (11/30/87)
In article <28200073@ccvaxa> aglew@ccvaxa.UUCP writes: >..> John Mashey talking about barrel architectures >>Assume we're using split I & D caches. Assume that the cache line >>is N words long, filled 1 word/cycle after a latency of L cycles. >>One would expect that efficient cache designs have L <= N. >I expect this is quite basic, but how do you show this for barrel>.. This wasn't intended as a proof or design, just background for a first-order back-of-the-envelope analysis. >>When filling an I-cache miss, you can do L more barrel slots, >>then you must stall for N slots (or equivalent), because it doesn't >>make sense to have the I-cache run faster than the chip (if it did, >>you would run the chip faster). >Why not have the I-cache run faster than the chip? I-caches are more regular >structures than the cpu, and are probably that much easier to make run >faster. The original assumption, perhaps not stated strongly enough, is that the I-cache was built from ordinary SRAMS (to ride the SRAM cost curve). (There are other ways to do this, however, this is The Right Way :-) see below). > Also, why stall for N slots? There are several schemes to deliver data >from a partially filled cache line as soon as it is available. There are, of course, all sorts of schemes. All I meant was that if you want to fill N words of I-cache, 1/cycle, you will chew up N cycles where either: a) No process will be making progress, because all of the bandwidth will be eaten up. OR b) At best, the process that caused the I-cache miss gets to progress, up to N cycles (but this looks suspiciously like a single pipelined processor). However, you cut it, given the original assumpotions, I think you end up eating N cycles of I-cache bandwidth. > Finally, why not continue on a completely separate thread while >the cache is filling for the thread that caused the cache miss? >Barrel need not imply round robin. Again, the assumption was that each access to write a word into the cache blocked either all threads, or at least, all threads except the one that caused the fill. re: The Right Way (with SRAMs) I'm no expert in this either, but there are a lot of people here who are world-class experts in this, and maybe one will correct me if the following summary of what they've said is too wrong: 1) CMOS VLSI CPU and SRAM technologies are very closely coupled: a) SRAMs are what VLSI folks often use to debug new processes. b) SRAMs are usually a VLSI-process generation ahead of CPUs. 2) One design process is to drive the CPU design process entirely by your model of projected SRAM performance curves (sort of like surfboarding): a) Aim a CPU design to come out about when the SRAMs have debugged the technology you want to use. The first CPUs work with some expensive SRAMS obtained in small quantities. b) By the time the CPU is yielding reasonably, the SRAMS you need are getting to be reasonably available. c) By the time you want to ship lots, the SRAM prices have started dropping reasonably. MEANWHILE: e) The SRAM guys are working on their next generation and f) You're working on your next generation. Goto a) Note: although the CPU is more irregular: - Some critical things, like register files, are pretty similar to SRAMs. - The CPU is all there on one chip, and it doesn't have chip-crossing time penalties. The cycle thru the SRAMs costs you delays, hence in some sense, the SRAMs need to be faster just to stay even with the CPUs. This strategy assumes that one can track the SRAM performance curves, and generally get better cost/performance by using standard SRAMs, whose cost rapidly drops, and whose technology is constantly being pushed by multiple vendors. It also assumes that the highest-performance systems of the future will want large caches. Of course, this strategy remains to be verified, although the early evidence seems in favor of it. Of delivered RISC machines, the ones that use standard SRAMs {MIPS, SPARC} outperform those that have special-purpose cache-mmu chips {Clipper}. The AMD29000 (no special RAM chip designs) and Motorola 78000 {cache/mmu chips} will add a few more data points. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086