brooks@maddog.llnl.gov (05/07/89)
A fellow from SUN has asked "What is so magic about register scoreboarding?" Register scoreboarding allow you to change the pipeline delays of the functional units in the processor, and tolerate stochastic pipeline delays for memory, without any hassles. A consequence of this, for the Cray line, is that the same binaries can be run on the Cray-1, Cray-XMP, Cray-YMP series. The pipeline delays are different for these three machines. On all of these machines accesses to main memory are fully pipelined, and to not stall the processor on a "cache miss" as is common for their little microchip brethren. Pipelining cache misses, providing full memory bandwidth which can keep the functional units of the processors fed, is a trick that the micros have yet to exploit. Register scoreboarding is again part of this game because memory latency is stochastic. brooks@maddog.llnl.gov, brooks@maddog.uucp
grunwald@flute.cs.uiuc.edu (05/09/89)
Hennessy recently gave a talk here & in response to a question concerning either scoreboarding or register windows, posited that object-code recompilers would be able to solve the same set of problems (binary compatibility across architectural changes), cheaper. In his picture, when you have a new architecture with, say, more registers, different delay costs or deeper pipelines, you translate your .o files and/or your final binaries. This is essentially what scoreboarding is doing, albeit dynamically. There are some limits to this approach, some obvious, some not so obvious. Comments? -- Dirk Grunwald Univ. of Illinois grunwald@flute.cs.uiuc.edu
andrew@frip.WV.TEK.COM (Andrew Klossner;685-2505;61-201;;frip) (05/10/89)
On using object-code recompilers to port to new implementations of a non-scoreboarding architecture: The notion that the problems addressed by scoreboarding can be resolved at compile time is intellectually appealing, but not completely true -- some latencies just cannot practically be predicted at compile time. The usual example is the added latency involved in a data cache miss. Clamping the whole CPU on cache miss isn't a technique that can survive into the 1990s. I'm very curious to see what the non-scoreboard folks will do. -=- Andrew Klossner (uunet!tektronix!orca!frip!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
henry@utzoo.uucp (Henry Spencer) (05/12/89)
In article <3288@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM (Andrew Klossner;685-2505;61-201;;frip) writes: >Clamping the whole CPU on cache miss isn't a technique that can survive >into the 1990s. I'm very curious to see what the non-scoreboard folks >will do. Probably pretty much what they do now: let access and execution proceed in parallel, assuming the data isn't needed right away. That technique didn't survive far into the 80s, never mind the 90s. Are you assuming that it's done only by the people who loudly trumpet it as an advantage? -- Mars in 1980s: USSR, 2 tries, | Henry Spencer at U of Toronto Zoology 2 failures; USA, 0 tries. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
mash@mips.COM (John Mashey) (05/12/89)
In article <3288@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM (Andrew Klossner;685-2505;61-201;;frip) writes: >On using object-code recompilers to port to new implementations of a >non-scoreboarding architecture: > >The notion that the problems addressed by scoreboarding can be resolved >at compile time is intellectually appealing, but not completely true -- >some latencies just cannot practically be predicted at compile time. >The usual example is the added latency involved in a data cache miss. > >Clamping the whole CPU on cache miss isn't a technique that can survive >into the 1990s. I'm very curious to see what the non-scoreboard folks >will do. First, we need to get some definitions clear, and then some facts, and then do some quick back-of-the-envelope numbers. 1) Of the currently popular RISCs, scoreboarding could be used or not, on any of them. Scoreboarding, or lack thereof, seldom shows its head into the architecture of these machines. [The MIPS load-delay slot being the one obvious exception.] 2) Most of the existing RISCs obtain at least some, and perhaps, substantial concurrency of FP operations. As far as I can tell, most of them have architectures that require interlocks or scoreboarding, as the latencies and issue rates of FP ops are permitted to move around from implementation to implementation, and FP ops tend to want to be multi-cycle, and tend to be more amenable to more silicon. 3) I'm not sure about some of the other RISCs, but R3000s only clamp the main pipeline on a cache miss, not the independent units, like: integer mul/div; FP add; FP mul; FP div all of which will keep cranking away if they're already started. 4) In the analysis below, it is important to note that typical speed RISC pipelines try to have a very regular flow of data in and out of the registers, with as few ports as they can get away with. Note for example, that if you're doing out of order execution, and 3 operations finish together, wanting into a 1-write-port register file, it will take 3 cycles for them to do that... This is one of the reasons that most of the RISCs went with separate integer and FP regs, and maybe even 2-write-port FP regs, even when using 1-write-port integer regs. More genericly, suppose you design a machine so that you can start operations even while others are still pending. You'd better make sure that what you gain at the front-end (by not stalling), that you don't lose mostly back at the back-end (by then stalling waiting for register access.) 5) Back-of-the-envelope: how much would an R3000 gain by continuing execution beyond a cache miss? (Assume M/2000-style machine; 20% loads, 10% stores; about 25% memory system degradation on top of instruction cycles, on real programs.) Most of this will concern integer programs. a) I-cache-miss: it seems pretty hard to get beyond an I-cache miss; we already do the instruction-streaming gimmick to do what's possible to cover the latency. b) Load-misses (R3000s use write-thru caches, so there aren't really store misses). using the typical rule-of-thumb: 20% loads 70% of 1st load-delay slot fillable* (MIPS) 30% of 2nd load-delay slot fillable (Stanford) 10% of 3rd load-delay slot fillable (Stanford) i.e., these give you an idea of how far you can get before you REALLY need that data, because the compilers are trying hard already. I think that what this says, is that on the average, you get .7 * 1 + .3 * 2 + .1 * 1 = .7 + .6 + .1 = 1.4 instructions beyond the load before you want that data. c) Suppose you get 90% hit-rate on data (not unreasonable), and you make a bunch of favorable assumptions: 90% x 20% loads = 1.8% : % instructions that are cache-missing loads. Assume 1.4 instruction cycles gained by non-stalling. This gives 1.4 * 1.8% = 2.52 % that might be gained. If you consider multi-cycle operations that could be fired up, than you might gain a little more. As a cross-check, in this machine, data-cache misses have something like 12 cycles of latency, then transfer 16 words of data, and NOT doing the scoreboarding is like adding 1.4 cycles to this. If you (real approximate) figure that every extra cycle of latency costs you 1% of performance, then this gives you 1.4 x 1% = 1.4%, which (for the level of enveloping, is not bad). d) Now, why are these favorable assumptions? The instructions that follow the cache-missing loads might be MORE cache-missing loads, and even if you're willing to have multiple outstanding requests, you still have to get the data into the regs sooner or later, which probably means more write ports than you would have had. [mixing integer & FP will help] I.e., it is hard to make a generic description of the added stalls on the back-end of the process, as it is very dependent on the system partitioning, read/write ports, memory interface, etc. If you had a lot of long-cycle-count operations, it would help you to start them while waiting for a cache-refill [i.e., long FP ops; of course, most of the frequent ones you want to be quick anyway...] Bottom line: the part of scoreboarding that lets you continue beyond a load-cache-miss gets you 1-2%, in an R3000-like architecture, in back-of-the-envelope style, which of course really needs a lot of simulation to confirm. Maybe other people can post some real DATA about this, as this was a quick guess for one machine. This is not to say scoreboarding is good or bad: it is a reasonable way to design things. Note, of course, that one must be extremely careful to think thru any error-handling scenarios. For example, if state is actually committed beyond a load-cache-miss [rather than register-relabeling games], what do you do if you get a transient bus-error when the loaded-data actually arrives? You cannot just back up the PC and try again, unless you track any side-effects and undo them, or skip those instructions, or something. Of course, out-of-order initiation of FP ops usually implies the possibility of imprecise exceptions, which is OK, but does cost you somewhere. Out-of-order initiation of anything implies a little trickier exception-handling in general. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (05/12/89)
In article 10198 henry@utzoo.uucp (Henry Spencer) writes: >>Clamping the whole CPU on cache miss isn't a technique that can survive >>into the 1990s. I'm very curious to see what the non-scoreboard folks >>will do. > >Probably pretty much what they do now: let access and execution proceed >in parallel, assuming the data isn't needed right away. Ok, now I'm curious. What DOES a R[23]000, an AMD29000 or Sparc really do if there's a data cache miss on a load but there are several instructions behind it which aren't interested in the result. I guess machines with a common I/D cache can't do anything as the cache will be busy getting the missed data, so Sparc (current implementations) seems out (except for the floating point coprocessor queue). I also know that at least the multiply and divide units in R[23]000 will continue to work, but what about other independent instructions past the load delay slot ? Burkhard Neidecker-Lutz, Digital CEC Karlsruhe, Project NESTOR neideck@nestvx.dec.com
mpogue@dg.dg.com (Mike Pogue) (05/12/89)
In article <19463@winchester.mips.COM> mash@mips.COM (John Mashey) writes: > >Bottom line: the part of scoreboarding that lets you continue beyond a > load-cache-miss gets you 1-2%, in an R3000-like architecture, John, I think you have missed the point here. The performance improvement due to register scoreboarding is only of minor interest. The real point from an architectural point of view is that all binaries continue to run in a predictable way, even when implementation details change. Mike Pogue Data General Corp. These are my opinions, of course....
rpw3@amdcad.AMD.COM (Rob Warnock) (05/12/89)
In article <8905120628.AA08299@decwrl.dec.com> neideck@nestvx.dec.com writes: +--------------- | In article 10198 henry@utzoo.uucp (Henry Spencer) writes: | >>Clamping the whole CPU on cache miss isn't a technique that can survive... | >> I'm very curious to see what the non-scoreboard folks will do. | >Probably pretty much what they do now: let access and execution proceed | >in parallel, assuming the data isn't needed right away. | Ok, now I'm curious. What DOES a R[23]000, an AMD29000 or Sparc really do | if there's a data cache miss on a load but there are several instructions | behind it which aren't interested in the result. +--------------- Assuming the natural split-I/D memory system (split caches, in this case), and burst-mode instruction memory/cache, an Am29000 will continue to execute instructions until either (a) the result of the load is needed, or (b) the pipeline stalls waiting to start another "primary access" [since there's only one address bus], which could be due either to a jump which misses in the Branch Target Cache or another load/store which wants to start. [While in theory, with really good load scheduling, you could thus avoid the stall entirely, in practice you typically run on for only a few more cycles before stalling, since in "general-purpose" code it's hard to schedule a load far enough ahead to absorb a cache miss.] Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun}!redwood!rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403
tim@crackle.amd.com (Tim Olson) (05/12/89)
In article <8905120628.AA08299@decwrl.dec.com> neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) writes: | Ok, now I'm curious. What DOES a R[23]000, an AMD29000 or Sparc really do | if there's a data cache miss on a load but there are several instructions | behind it which aren't interested in the result. I guess machines with | a common I/D cache can't do anything as the cache will be busy getting the | missed data, so Sparc (current implementations) seems out (except for the | floating point coprocessor queue). I also know that at least the multiply | and divide units in R[23]000 will continue to work, but what about other | independent instructions past the load delay slot ? Since loads are fully interlocked on the Am29000, the processor will continue to execute instructions until 1) The results of the missed load are required 2) The channel is required for another load or store -- Tim Olson Advanced Micro Devices (tim@amd.com)
marc@oahu.cs.ucla.edu (Marc Tremblay) (05/13/89)
In article <19463@winchester.mips.COM> mash@mips.COM (John Mashey) writes: > b) Load-misses (R3000s use write-thru caches, so there aren't really > store misses). Since the R3000 uses write-thru caches and since the depth of the write buffer is 4, I assume that "stores" stall the processor only when more than 4 consecutive writes occur or whenever the buffer overflows. Saving the register file on a context switch could certainly generate stalls if it is done sequentially (i.e without interleaving register stores with non-store intructions). From your comment I suppose that those situations do not occur very often? Marc Tremblay marc@CS.UCLA.EDU Computer Science Department, UCLA
les@unicads.UUCP (Les Milash) (05/13/89)
In article <8905120628.AA08299@decwrl.dec.com> neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) writes: >In article 10198 henry@utzoo.uucp (Henry Spencer) writes: >>>Clamping the whole CPU on cache miss [sucks] >> >>Probably pretty much what they do now: let access and execution proceed >>in parallel, assuming the data isn't needed right away. > >Ok, now I'm curious. What DOES a R[23]000, an AMD29000 or Sparc really do >if there's a data cache miss on a load but there are several instructions >behind it .... i'm curious too, but about scoreboarding. will somebody please point me at some article that shows what kind of logic implements scoreboarding? i have questoins like * can existing scoreboards handle chained dependencies? * does an instruction (like R1 OP R2, where R1 ain't ready yet, but where the next instruction could use the same alu and is ready-to-go) allocate an alu? or do ya queue the decoded instruction and wait for all of {r1 OP R2} to be available simultaneously? * an assignment to R[12] must therefore be held off till all of the R[12]s contents are either in the alu or in yet * a bad scenario: Rg = some constant [lots of time] mem = R1 (perhaps a "many"-cycle thing) r2 = R1 op Rg r3 = R2 op Rg (perhaps this is "too hairy") r4 = R3 op Rg r5 = R4 op Rg r6 = R5 op Rg r7 = R6 op Rg (perhaps generate a "hostile user" trap :-) so seems like if you had finite (not a dataflow machine) scoreboarding you'd need to teach your compiler how to not exceed it, or maybe you could quit issuing instructions if things got "too hairy", and expect that that'd be infrequent. i kind of see scoreboarding as "idiotproofing", (where a MIPSCo compiler is an example of a non-idiot), in that on "machines that have to have a load delay slot filled with an independent instruction by the compiler" can run (wrong) programs that compute with numbers that aren't fetched yet. i can't myself imagine what kind of logic could be built to handle all the pathological cases i could think up, unless you can depend on the compiler to not generate too bad a case. i can imagine that some finite logic could help in the "M% of cases that get you the easiest N%", and give up otherwise by quitting issuing more instructions. although Mr. Mashey had an argument that you wouldn't get that much (except of course for being able to R2000 code on the GaAs R7000) that kind of stuff led me to my attraction with transputers. i think of 'em as the ultimate scoreboarding; but it's at the system level, not in the processor chip. so i realize that doesn't help us run many mips out of slow dram. but the whole program, in all its millions of bytes of glory, is one big dependency graph, and the processor assures you that it will be executed in a "correctness-preserving" order, even if multiple processors are executing on your program.
mash@mips.COM (John Mashey) (05/13/89)
In article <23884@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes: >In article <19463@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >> b) Load-misses (R3000s use write-thru caches, so there aren't really >> store misses). >Since the R3000 uses write-thru caches and >since the depth of the write buffer is 4, >I assume that "stores" stall the processor only when more >than 4 consecutive writes occur or whenever the buffer overflows. >Saving the register file on a context switch could certainly >generate stalls if it is done sequentially (i.e without interleaving >register stores with non-store intructions). >From your comment I suppose that those situations do not occur very often? Stores stall the processor whenever the write-buffer is full. it actually takes a bit longer than 4 consecutive stores, in the simple case, as (for example, on an M/2000), writes are retired to memory 1 word / 2 cycles. You actually get something like 6 consecutive writes before going into a stall-every-other-cycle state. The register-save sequence typically has just a few stalls in it, as there are other instructions interleaved, and it first saves the registers that C won't, while letting those that C would be saved later. I think the FP-regs-save sequence stalls a little more, as there's less to do in the middle of it, but then it happens less often also. The issue is statistically nonexistent for function calls, as relatively few (dynamic) function calls save 6 or more registers. Block-copies do their best to interleave the loop overhead and reads and writes, so they don't see it either. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
aglew@mcdurb.Urbana.Gould.COM (05/13/89)
>In article <19463@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >> b) Load-misses (R3000s use write-thru caches, so there aren't really >> store misses). >Since the R3000 uses write-thru caches and >since the depth of the write buffer is 4, >I assume that "stores" stall the processor only when more >than 4 consecutive writes occur or whenever the buffer overflows. >Saving the register file on a context switch could certainly >generate stalls if it is done sequentially (i.e without interleaving >register stores with non-store intructions). >From your comment I suppose that those situations do not occur very often? > > Marc Tremblay > marc@CS.UCLA.EDU > Computer Science Department, UCLA What about block copies (like, copying I/O data from kernel to user)? (SUNOS 4 mapped files reduce this a bit) Sometimes there's no substitute for having a good, interleaved, memory system (which I believe MIPS does(?))
aglew@mcdurb.Urbana.Gould.COM (05/13/89)
>i'm curious too, but about scoreboarding. >will somebody please point me at some article that shows what kind of >logic implements scoreboarding? i have questoins like Mark Johnson pointed out an Intel ISSCC (I always get those initials wrong) paper on circuit implementations of scoreboarding for the '960. Here are a few references at a slightly higher level of abstraction: Keller, "Look-Ahead Processors", Computing Surveys Vol 7 No 4 December 1975, pp 177-195 (good bibliography). Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units", IBM Journal Resarch and DEvelopment, 11, 1 (Jan 1967) pp 25-33 Hwu, "Exploiting Instruction Concurrency to Acheive High Performance in a Single-chip Microprocessor", PhD thesis, Report No. UCB/CSD 88/398, University of California, Berkeley, January 1988 (sinilar papers in COMP ARCH 13 and IEEE Trans Comp) The above references are for varieties of "Tomasulo" scheduling, which is a bit more extreme than scoreboarding in, say, CDC or Cray processors, or the 88K. I can't find any "regular" scoreboarding references at hand, but Sieworik, Bell, and Newell, _Computer Structures, Principles and Examples_ contains papers on the CDC 6600 and CRAY-1. The distinction between "regular" scoreboarding and "Tomasulo" scoreboarding is important; roughly described, dumb interlocking halts the entire processor if a later stage's result isn't ready (at some fixed, designated time) even if the next instruction doesn't need the output; regular scoreboarding only halts an instruction if its operands aren't ready, but doesn't necessarily provide execute past; and Tomasulo scheduling allows execution of subsequent instructions. That's really rough; and I have also learned that "Tomasulo" scheduling means different things to different people. IE. to some people it means Tomasulo's implementation with reservation stations and a common data bus, but to me it means, basically, a limited form of dataflow computation in a small window. Plus, of course, you don't need to implement a scheduling technique everywhere. I would call the AMD29000's load scheduling a limited form of scoreboarding, but in the rest of the pipeline they don't bother. And, of course, nobody's pure - below, when I describe "regular" scoreboarding, I do not mean the scoreboarding on the 88K. The 88K is a bit more aggressive than that, but not quite as aggressive as Tomasulo. >* can existing scoreboards handle chained dependencies? Tomasulo: yes, limited only by resource constraints (the number of reservation stations) Regular scoreboarding: exactly what do you mean? A->B->C type dependencies? Tomasulo would dispatch all instructions to wait for their result; regular scoreboarding would halt at the second until the first is ready, then halt at the third until the second is ready. Or do you mean vector chaining? See how Cray does things. The only reason for not applying scoreboarding to vectors is that there are a *lot* of registers => a lot of scheduling hardware. But, things like N. Jouppi's pseudo-vector operations for TITAN can help. >* does an instruction (like R1 OP R2, where R1 ain't ready yet, but where > the next instruction could use the same alu and is ready-to-go) > allocate an alu? or do ya queue the decoded instruction and wait > for all of {r1 OP R2} to be available simultaneously? In Tomasulo scheduling, you queue the instruction in a reservation station. Typically, you also snarf the inputs as they become ready, so as not to prevent other instructions that may write to those from dispatching. (I use a notation like O->I dependencies satisfied, with execute past; O->O dependencies satisfied, in the extreme with execute past, in less extreme cases without) Of course, if ALUs are cheap enough then your ALU input latches become your reservation station. That's OK for integer units, but floating point units and multipliers still are complicated enough to warrant multiple reservation stations. BTW, reservation stations do not necessarily imply the serial dependencies of queues. >* an assignment to R[12] must therefore be held off till all of the > R[12]s contents are either in the alu or in yet >* a bad scenario: > Rg = some constant > [lots of time] > mem = R1 (perhaps a "many"-cycle thing) > r2 = R1 op Rg > r3 = R2 op Rg (perhaps this is "too hairy") > r4 = R3 op Rg > r5 = R4 op Rg > r6 = R5 op Rg > r7 = R6 op Rg (perhaps generate a "hostile user" trap :-) > so seems like if you had finite (not a dataflow machine) scoreboarding > you'd need to teach your compiler how to not exceed it, or maybe > you could quit issuing instructions if things got "too hairy", and > expect that that'd be infrequent. You can't dispatch a instruction unless it has a reservation station ready -- that's your throttling. It doesn't happen too often (see Hwu). There is still benefit from teaching your compiler about your machine. I think of Tomasulo sheduling as the route to realistic dataflow machines. The bottleneck in dataflow machines has always been the routing of results to inputs; with Tomasulo we are learning how to do that easily and efficiently in VLSI, and then, hopefully, it will scale (note that I do not consider the common data bus a defining feature). For this I had better sign myself (I better get used to being a student again) Andy Glew IMPACT - Illinois Microprocessor Project using Advanced Compiler Technology "The Hwuligans" aglew@urbana.mcd.mot.com Disclaimer: neither my advisor nor my employer share the opinions above.
mash@mips.COM (John Mashey) (05/14/89)
In article <170@dg.dg.com> mpogue@dg.UUCP (Mike Pogue) writes: >In article <19463@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >> >>Bottom line: the part of scoreboarding that lets you continue beyond a >> load-cache-miss gets you 1-2%, in an R3000-like architecture, > > John, > I think you have missed the point here. The performance improvement >due to register scoreboarding is only of minor interest. The real point >from an architectural point of view is that all binaries continue to run >in a predictable way, even when implementation details change. I MUST NOTE BE COMMUNICATING RIGHT, OR ELSE PEOPLE AREN'T HEARING WHAT I'M SAYING, OR THE NET IS LOSING POSTINGS: My posting: <19162@winchester.mips.COM> included: -In particular, Motorola & co are persistent in claiming that the world -will fall apart for MIPS if the timings of the floating-point operations -change, despite the fact that it has clearly been stated many times -that we have complete interlocking on ALL of the multi-cycle operations. -Really, the only things that don't have interlocking are loads and -equivalents (i.e., move-from-coprocessor), and they all have a 1-cycle -delay that is predictable to the compilers. The (Without) in -Microprocessor (Without) Interlocking Pipeline Stages, which may have -been appropriate for the Stanford MIPS, is pretty much irrelevant -when it comes to MIPSco MIPS. -As I've said here before, if we ended up with loads that had another -cycle of latency, we;d build a machine with an interlock on the extra -cycle. If we decided to put in load interlocks, that would -be upward-compatible, although we'd likely compile 3rd-party executables -with R3000-style forever. (Of course, if we did add load interlocks at -some point, and if there got to be more of those machines around, at some -point maybe we'd start advising peopel to compile for that, and then do -a reverse-translate on R3000-machines!) -If the timings of floating-point operations -are different (and they are) in forthcoming products, the existing object -code works fine. However, even with completely interlocked and/or -scoreboarded code, you STILL want the compilers to be as aggressive -as possible. Fortunately, the way most of these things work, if you -try to optimize for the version with the longest latencies, it usually -works pretty well for ones with shorter latencies as well. To see this, -suppose you had a 5-cycle FP multiply, and so you'd been generating code -that tried to issue 4 more instructions before using the result of the -multiply. IF the multiply expanded to 10 cycles, the compiler folks -would try to work harder and find more things to do while the multiply -were running, which wouldn't usually hurt the machine with the 5-cycle -multiply. It's just a question of the number of stall cycles, and -it's obvious that it almost always pays to spread the computation of a -multi-cycle result, and the use of that result as far apart as possible. - -This, of course, is not remotely a new issue: any of the long-lived -computer product lines has faced this, especially those that -cover a range of implementation technologies, such as VAXen or S/360s. -The solutions are the same, except that the simplicity of RISC-style -instructions makes it marginally easier to manipulate object code. -Our experience with these methods tends to make us more willing to -consider object code translation as one more trick to use when it makes -sense, and it's really not that weird once you get used to it. So now, I'll try again. Here are some assertions that ahave appeared: ASSERTION 1: scoreboarding lets you modify latencies for different operations while using the same object code. ASSERTION 2: scoreboarding is the ONLY way to do this; if a fully-interlocked machine doesn't use scoreboarding, somehow, it is deemed impossible for the object of ASSERTION 1 to be accomplished. ASSERTION 3: well, if interlocking works after all, then scoreboarding is better for performance reasons. 3A: in supercomputers 3B: in current microprocessors ------ ASSERTION 1 is clearly true. ASSERTION 2 is silly; there are too many machines implemented without scoreboards, but with interlocks, that let you modify the latencies. I'll say, one more time, there is exactly one kind of user-visible latency in a MIPS R3000 that is not interlocked, and that's the {load, move-from-coprocessor}, and it's one cycle that needs to be covered by the compiler. The the likely-to-be-variable-and/or-long latencies [FP, int mul/div] are all fully-interlocked, even though the compilers work hard to do their best to schedule the pipelines well. If we did an implementation where or simulations claimed that out-of-order instruction initiation was a sensible overall design choice, then we might use scorebaording as an implementation choice. My back-of-the-envelope analysis showed why we weren't yet overly excited about that. It is CRAZY to build a machine with multiple-overlapped-functional-units, and NOT do pipeline-scheduling compilers [which, after all, were present in early CDC compilers, for example] whether one uses scoreboarding (to permit out-of-order initiation) or interlocking (that expects in-order initiation, but freezes the instruction-issue unit (not necessarily the other functional units) until the needed result is ready. The amount of code in MIPS compilers to put a nop after a load if no other instruction ca be found is trivial [when I looked last, I found 3 lines of code that were doing that] compared to the rest of the reorganizations. People who worry about this being a cause of bugs or complexity in the compilers (compared with everything else) should NEVER, EVER fly on a Boeing 747: after all, going from SanFrancisco to London, you might fall out of your seat and break your leg due to a defective seat-belt buckle :-) ASSERTION 3 3A: supercomputers Probably true 3B: current microprocessors Seems unlikelym until they start having memory systems like 3A. I'm out of time & getting the SPEC bench-a-thon together is going to occupy most of my time for the next couple weeks, so maybe somebody else can continue this. Specifically, maybe brooks@maddog.llnl.gov (or anybody else really familiar with supercomputer architecture) could describe the memory systems of such things. There are some fairly sensible reasons why the answers on 3A & 3B might be opposite.... Finally, maybe somebody from 88K-land could describe how far into out-of-order execution the 88K goes, i.e., assuming no scoreboard block, 1) how many instructions can be issued beyond a load that cache-misses, or tlb-misses, or both? 2) how many instructions beyond a stalled-FP-multiply (for example) can you execute? (I haven't seen anything that said definitely what the current 88K's do). Like both 88K and MIPS, SPARC is defined to allow different-latency FP implementations, and in fact, 3 different ones are already extant. Perhaps the SPARC guys would care to join the fun and talk about differences in latencies, overlap, etc. [If you haven't noticed it, SUn-4s recent got the FPU2 that raised FP performance in the same systems.] -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.COM (John Mashey) (05/15/89)
In article <28200311@mcdurb> aglew@mcdurb.Urbana.Gould.COM writes: >What about block copies (like, copying I/O data from kernel to user)? >(SUNOS 4 mapped files reduce this a bit) > >Sometimes there's no substitute for having a good, interleaved, >memory system (which I believe MIPS does(?)) The older machines had various flavors of interleaving. The M/2000 uses 2-way interleaving per memory board to sustain read traffic of 1 word/cycle, or write traffic of 1 word/2 cycles. Bcopy works fine; there's enough reading to do to not have an issue with write stalls. The only case that write-stalls much is bzero, because there isn't much else to do; the only answer to that is a more expensive memory system, which of course is possible. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
khb%chiba@Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS) (05/16/89)
In article <8905120628.AA08299@decwrl.dec.com> neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) writes: >In article 10198 henry@utzoo.uucp (Henry Spencer) writes: > > >Ok, now I'm curious. What DOES a ... or Sparc really do >if there's a data cache miss on a load but there are several >instructions .. A good instruction scheduler places the "don't cares" after the load so the load miss is simply irrelevant. I can't comment on how well the current production SPARC compilers do at this; but the Cydra 5 compiler did quite well at the equivalent (keeping the memory pipe busy). Since the Cydra 5 had a minimum read latency of 17 cycles this indicates (to me at least :>) that keeping the current SPARC's from stalling should be acheivable. Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist ! kbierman@sun.com I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)
khb%chiba@Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS) (05/16/89)
In article <19661@winchester.mips.COM> mash@mips.COM (John Mashey) writes: > >ASSERTION 3: well, if interlocking works after all, then scoreboarding >is better for performance reasons. > 3A: in supercomputers > 3B: in current microprocessors >------ .... >ASSERTION 3 > 3A: supercomputers > Probably true ??? I can't think of a supercomputer which _does_ use scoreboarding, at least not as I understand the term. Supercompuers tend to not have data caches... instead they have very fast high bandwidth memory systems (using many banks of memory) and rely on the fact that real scientific programs tend to have well behaved, in some sense, (i.e. sort of vectorizable) key loops. The fact that these are typically FORTRAN machines, alters the program statistics somewhat. As has been demonstrated by many vendors, lots of registers, software and sometimes hardware pipelines, loop transformations (percolation scheduling, etc.) and other unnatural acts are key to getting good performance. Note that these remarks apply to machines like the CDC6600, the various Crays, misc. Japanese vector machines, the Cydra 5 and the Multiflow machines (which is why I said "sort of vectorizable"). These machines typically have interlocks; but nothing like the scoreboard scheme of the 88K (unless my memory is very leaky this week). > 3B: current microprocessors > Seems unlikelym until they start having memory systems > like 3A. Scoreboarding probably works; but there seems to be a certain lack of evidence that it is necessary. Seems overly complex to me .... but what do I know... I studied math and grew up working in Kalman filtering applications ... :> > > >Like both 88K and MIPS, SPARC is defined to allow different-latency >FP implementations, and in fact, 3 different ones are already extant. >Perhaps the SPARC guys would care to join the fun and talk about >differences in latencies, overlap, etc. [If you haven't noticed it, >SUn-4s recent got the FPU2 that raised FP performance in the same >systems.] I don't know what to say... other than that it works just fine. All the binaries I tested (and it was more than a few) worked up and down the line. It is true that it is possible to tickle the compiler into generating code which is better for one chip or another and this is likely to continue. There will always be a compatible mode (which most people will use exclusively) which will have good performance for the most popular implementations and different implementations will have some special code ordering/other implementor neat stuff ... As John pointed out in an ealier posting, SUN has chosen to give implementors more leeway (the N-design teams notion), and folks at Prisma probably will have lots of neat stories to tell around Jan 1990 (4nsec SPARC ... design goal of 100Mflops) about how well SPARC really scales, and how the cleverly picked interlocks were just right (or caused them endless grief :>). Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist ! kbierman@sun.com I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)
andrew@frip.WV.TEK.COM (Andrew Klossner) (05/17/89)
> Finally, maybe somebody from 88K-land could describe how far into > out-of-order execution the 88K goes, i.e., assuming no scoreboard > block, > 1) how many instructions can be issued beyond a load that > cache-misses, or tlb-misses, or both? > 2) how many instructions beyond a stalled-FP-multiply > (for example) can you execute? [I'm a user, not a designer, of the 88k architecture.] The instruction fetch unit, the data fetch/store unit, and the two FPU pipelines (one multiply, one add/subtract/convert/divide) operate fairly independently. There is no inherent reason why a stall by one would force the others to stall, although, as John pointed out, if you stall on instruction fetch, the other units are going to starve pretty quickly. 1) Caching and MMU translation are done external to the CPU, in the CMMU (code/MMU), so cache- and TLB-misses are all the same to the CPU; they just mean that the CMMU will take a little longer to finish the load. If you don't use the load target and you don't issue another load or store, you can continue executing forever. On our system, the code- and data-CMMUs talk to the same memory bus ("M bus" in Motorola parlance), so if you take a code-cache miss during a load, the code CMMU will have to wait for the bus. 2) Again, if you don't use the FP-multiply target and you don't issue another FP-multiply, you can continue executing forever.
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (05/17/89)
In article <104999@sun.Eng.Sun.COM> khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes: >I can't think of a supercomputer which _does_ use scoreboarding, at >least not as I understand the term. Supercompuers tend to not have I don't know much about the Japanese machines, but all the CDC and Cray machines I am familiar with do register scoreboarding. They do not do the full "Tomasulo" scoreboarding, in that instruction issue stalls if registers are busy. Note that what scoreboarding buys you is the ability to continue issuing instructions after a memory operation without knowing how long a particular memory access will take. This is very important in a machine which has 8 processors, 32 memory ports, 256 memory banks, and memory latencies of at least 4-5 clock periods, and sometimes longer. Some architectures permit a certain number of instruction issues to continue if registers are busy, and this is a sort of local dataflow, as someone mentioned previously. I don't think any of the current crop of machines do that, although I assume that the IBM 360/91 did since that was the machine that Tomasulo wrote his paper about (it is reprinted in Siewiorek, Bell, and Newell's "Computer Structures" that probably most people have). Specifically, current machines do not seem to have "reservation stations" which permit instruction issue to continue (on the 360/91 it was floating point add and multiply only, I believe) if the register is busy. Anyway, with instruction scheduling in the compilers (for the last 15 years ...) the claim is that this sort of multiple outstanding instruction issue capability buys you very little over the current technique (stall if busy). I believe that the level of scoreboarding used plus instruction scheduling is believed to be "locally optimal" for a pipelined machine with segmented functional units. Reservation stations may increase the parallelism slightly at some cost in functional unit latency. The main point is that if you want to have multiple outstanding loads, indeterminate memory delays due to bank busy conditions, multiple outstanding instructions (all or almost all the ALU operations on these machines are fully segmented- typically at one per clock cycle, though not always with divide etc.), and, fairly long memory latencies (e.g. 5-13 clock periods) scoreboarding is a nice way to handle it. >These machines typically have interlocks; but nothing like the >scoreboard scheme of the 88K (unless my memory is very leaky this >week). The first use of the term "scoreboard" that I know of was Thornton's 1964 paper on the CDC 6600, also in Siewiorek, Bell, and Newell. >Scoreboarding probably works; but there seems to be a certain lack of >evidence that it is necessary. Seems overly complex to me .... but >what do I know... I studied math and grew up working in Kalman >filtering applications ... :> I am not sure what is "overly complex" about it. If you have one path to memory on a one CPU system, it isn't necessary. If you have multiple paths on a multiple CPU system, it turns out to be just what you need. If you are concerned about increasing the critical path in instruction issue by by one level, you are probably right, though clever hiding of these things may occur. Anyway, as always, the question is: Do you get what you pay for out of it. The fact that it has been used for 25 years on machines with multiple paths between the CPU and memory should say something. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
kds@blabla.intel.com (Ken Shoemaker) (05/18/89)
Perhaps I'm missing something here, but I'd like to make a few observations: 1) a large register array seems to be useful as a user-managed data cache 2) caches, in general, don't freeze up instruction execution while doing replaces (or, at least, they don't need to) 3) register scoreboarding provides for very simple support for cache (i.e., register) reloads by an autonomous unit (kicked off by the processor, but not requiring exclusive use of the resource until the operation is complete). This requires that the "register reloader" has a write port to the registers aside from the processor write port, otherwise you have a resource constraint. I don't think this approach is appropriate to the MIPS machines because of their bus organization, i.e., they have a resource constraint in the use of the external data bus. But to paraphrase John's comments, such a machine, with the register reloader autonomous functional unit, splits the operation of loading the registers into two functional units whereas most current machines use a single functional unit (the execution unit). If you have two such units, it makes no sense to keep the execution unit frozen until you actually have the registers reloaded, much like it doesn't make sense to keep the floating point unit frozen while doing integer memory loads. It would also have consequences for code reorganization. Procedures, or maybe even whole programs (because, like Henry Spencer says, compilers are real smart these days), would have a preamble which would load the entire complement of variables into the registers. The concept of "load delay slots" really becomes a don't care, because the registers are loaded long before they are used. This is really just a smart data prefetch algorithm driven by software (imagine that). The same kind of thing can be done to force replaces in the external cache so that by the time you get around to using a piece of data, it will be a cache hit and not require the latency time to system memory. Or course, this assumes that your register space isn't sufficiently large to hold all the variables that are going to be used. I won't even worry about volatile variables. They are such infrequent cases as to be a don't cares. For the time being, assume that there are no problems with multi-processor data consistancy between the registers, ne user-managed data cache, of different processors or I/O devices. I'm sorry if this is obvious. It certainly seems so to me, so there must be something I am not seeing. I'd appreciate if someone could straighten me out. But as I am leaving for two months at the end of the week, and as we don't keep news around on the system that long, I'd appreciate a note in my mail instead. Thanks! ---------- I've decided to take George Bush's advice and watch his press conferences with the sound turned down... -- Ian Shoales Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California uucp: ...{hplabs|decwrl|pur-ee|hacgate|oliveb}!intelca!mipos3!kds
cik@l.cc.purdue.edu (Herman Rubin) (05/18/89)
In article <25484@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: > > In article <104999@sun.Eng.Sun.COM> khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes: > ......................... > The main point is that if you want to have multiple outstanding loads, > indeterminate memory delays due to bank busy conditions, multiple > outstanding instructions (all or almost all the ALU operations on these > machines are fully segmented- typically at one per clock cycle, though > not always with divide etc.), and, fairly long memory latencies > (e.g. 5-13 clock periods) scoreboarding is a nice way to handle it. Scoreboarding and related items can be important even if load time is not the bottleneck. Much mathematical code can use this, as operations need the result of previous operations. This is especially the case if operations do not take constant amounts of time, or if branches occur. It is sometimes even possible to arrange operations to take into account unusual timings, letting scoreboarding keep things efficient. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
bruceh@zygot.UUCP (Bruce Henderson) (05/20/89)
I am looking for a source of high bandwidth VRAMS. I am trying to remember who makes the chips with the built in BITBLTs? To put it short, I am trying to design a high speed graphics engine with some rather interesting [aka. untried] architecture, and I have found a scaleable hardware method that is elegant in its design. Therefore the only limit I have right now is how fast I can stuff Pixels into memory. Any suggestions [I know this is a classical Engineering Problem] would be greatly appreciated!