tom@nud.UUCP (Tom Armistead) (06/18/88)
On RISC processors without a scoreboard*, how are the results of memory references guaranteed to be available before they are used in subsequent computations? I thought through several scenarios and couldn't find a very good solution and being being unfamiliar with RISCs other than the 88k, thought the net might have some answers. 1) If the processor waited until the load was complete before launching the next instruction, that would guarantee the correct result was available before an attempt was made to use it. This obviously is very poor from a performance standpoint though. 2) If the compiler/assembly writer is required to wait a certain number of ticks after a load before using the results of the load, how is the required amount of delay determined? Can you assume you always get a cache hit (I doubt it)? Do you have to use worst case (then why use a cache at all)? What if the access is across a shared bus and the tape controller is currently hogging the bus and will be for the next 10ms (you certainly wouldn't want to use a 10ms delay after each load)? Does the memory system have to give absolute top priority to CPU requests so that the required delay is fixed? What if memory refresh were also pending - it seems here you would have a nasty choice of letting memory be trashed or letting the CPU use bogus results? Does #2 mean that a binary compiled for a particular system won't run on a different system (having a different memory speed) even though they have the same CPU chip and that instead, the source must be recompiled for each different system the program is to run on (or if the current system is sped up/slowed down)? This seems like it would be clumsy and difficult to manage from a software distribution aspect. 3) Is there some internal register that can be referenced that indicates when the result is available and software is required to interogate the register to determine if the result is ready? This would add overhead to every load instruction. 4) Is there some other method I haven't thought of that is the method of choice? Which RISC processors don't have scoreboarding and what method do they use? * For those unfamiliar with what scoreboarding is, the scoreboard is merely an internal register which keeps track of which registers have stale data (for example the destination register of a load instruction which has been launched but not completed). If the CPU encounters an instruction which uses a register which is stale, it waits until the scoreboard indicates the register is valid again before it begins the instructions execution thus assuring the correct operands are used by the instruction. Note this is a hardware feature - software is not required to examine the scoreboard to accomplish this. As an example, on the 88k, you could do the following sequence: ld r2,r0,0 ; Assume this ld takes a long time. add r3,r2,1 ; Add 1 to result of ld and put in r3. and the result would be correct (for any length of ld time). Without a scoreboard, the result in r3 would be based on the stale data in r2 (which is probably the wrong value). -- Just a few more bits in the stream. The Sneek
tim@amdcad.AMD.COM (Tim Olson) (06/19/88)
In article <1082@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes: | On RISC processors without a scoreboard*, how are the results of memory | references guaranteed to be available before they are used in subsequent | computations? I thought through several scenarios and couldn't find a very | good solution and being being unfamiliar with RISCs other than the 88k, | thought the net might have some answers. The scoreboarding on the 88k (correct me if I'm wrong) associates a bit for each register in the register file to tell whether the value is stale or not. During the issuing of an instruction, the operand registers are checked for staleness, and the pipe is stalled if required. The bit-per-register is one way of detecting dependencies. Another way (used on the Am29000) is to associate a destination register pointer (and a "ready bit") with each functional unit. Forwarding logic matches the register operand fields of the instruction being issued with the destination register pointers of the functional units, stalling if it finds a match and the value is not available, yet. When the value is available, it is forwarded to the correct operand bus (A or B, or perhaps both) so that the instruction does not have to wait for the value to be written into the register file. This method has the benefit of integrating closely with the forwarding logic. -- Tim Olson Advanced Micro Devices (tim@delirun.amd.com)
mash@mips.COM (John Mashey) (06/19/88)
In article <1082@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes: > On RISC processors without a scoreboard*, how are the results of memory >references guaranteed to be available before they are used in subsequent >computations?.... This was discussed a while back. Sounds like it's time to do it again. NOTE: caveats: there are certainly architectures for which some ofthe following doesn't fit. It is intended to be mostly applicable to "mainline" RISC micros. >2) If the compiler/assembly writer is required to wait a certain number >of ticks after a load before using the results of the load, how is the >required amount of delay determined? Can you assume you always get a >cache hit (I doubt it)?.... Following are: 1) How the MIPS R2000/R3000 does it. 2) Some thoughts on when scoreboarding does/doesn't make sense, and why. HOW THE R2000/R3000 WORK 1) A load requires one cycle to initiate, plus one "architectural" cycle of latency, which must be covered by the software, i.e., it must insert a nop if it can't find anything useful. 2) If we ever built a machine in which the primary cache had 2 or more cycles of latency, the software would be responsible for the 1st cycle of latency, and the hardware responsible for cycles 2,...n. (This preserves the ability to run the existing object code unchanged.) However, given the domain of problems we worry about, this is probably irrelevant, as we'll probably never build a system whose first-level cache has more than 1 cycle of latency, given the performance impacts. Of course, it's hard to build a machine that has a zero-cycle load latency :-) I.e., there's a reason for having exactly 1 cycle of latency, not 2 or 0. 3) A lot of statistics say: a) Reorganizers get to fill 70-80% of load delays in 1-cycle latency machines. b) Roerganizers get to fill (as I recall, it's been awhile) 30% of delays in 2-cycle latency machines. c) After that, it gets bad (10% or less). [What does this mean? ans: if you load something, you expect to use it pretty soon.] 4) All of that assumes a cache-hit (which is, of course, the most frequent case). If there is a cache-miss, the main pipeline freezes while the cache refill is done: a) One or more words of data are brought into the D-cache. b) The stalled instruction resumes after the fixup. Thus, the software could care less how far away memory is, whether or not DRAM refresh is running, etc. 5) There are split I & D-caches, and a 1-cycle latency branch is used, with software also being required to fill the branch delay slot. Approximately similar numbers (1 slot rather than 0 or 2) apply as in the load case. 6) As it happens, when I said the main pipeline freezes, what I glossed over was that there are numerous independent functional units: a) The main integer unit (i.e., that really knows where the PC is), including all loads/stores. b) Integer multiplier/divider c) FP adder d) FP multiplier e) FP divider The results of b) - e) are interlocked, i.e., if you attempt to use the result of b) - e), and it isn't ready, it is interlocked by the hardware, rather than software. You can consider this "partial scoreboarding", although we don't particularly call it that, as the main pipeline does completely freeze, rather than trying to issue the following instruction. 7) Now, why should it be that we interlock some things, and not others? a) Note that the 2 things that are NOT interlocked (load and branch): Have small (1-cycle delays) that are unlikely to be shortened, and are undesirable to lengthen Have about a 70% chance of having a 1-cycle delay slot filled by software. b) Note that the things that ARE interlocked: Are multiple cycle operations (DP ADD = 2, rest are longer). Are operations whose performance might well change in a future version (i.e., you might throw more silicon at DP MULT to improve it from 5 cycles, etc) Are generally difficult for compilers to handle completely, for both ofthe above reasons. SCOREBOARDING 1) If you look at where scoreboarding came from (I remember it first from CDC 6600 (see Bell & Newell book, for example), but there may be earlier ones), you had machines with: a) Many independent functional units b) Many multiple-cycle operations (integer add was 3 cycles in 6600, for example, and the FP operations were of course longer). c) Multiple memory pipes, with a long latency to memory, and no caches. 2) Note that RISC micros probably have a), but normally, only the FP ops, and maybe a few others have long multi-cycle operations, and they sure don't have multiple memory pipes, and they often have caches. 3) Whether or not a RISC machine has no scoreboarding, some, or complete, it had better be designed with a pipeline reorganizer in mind, because it's absolutely silly not to be using one. With one exception (in our case), a scoreboard can NEVER do any better than a reorganizer. I'll show the exception later, but here is an example why I say this. Consider: a = b + c; d = 0; 1 lw r1,b 2 lw r2,c 3 add r3,r2,r1 4 sw r3,a 5 sw zero,d On a machine with a simple load-interlock, and 1-cycle-latency loads, the machine would stall for 1 cycle before instruction 3. Suppose you do more complex scoreboarding, i.e., you continue attempting to issue instructions? Then, you might do 1, 2, try 3 and save it, and then either come back and do 3 again (probably) or go on to 4 and discover that it stalls also. If one looks at integer code sequences, one discovers that it is hard to discover many things that don't quickly depend on something else (barring, of course, Multiflow-style compiler technology and hardware designs not currently feasible in VLSI micros). Now, the point is that no simple scoreboarding hardware is going to be able to figure out that the best sequence is: a = b + c; d = 0; 1 lw r1,b 2 lw r2,c 5 sw zero,d 3 add r3,r2,r1 4 sw r3,a which ALWAYS wins (given the memory system described). Note that a 2-cycle latency adds 1 cycle more of delay, even with the reorganized code. Also, note that if you're willing to add mucho more hardware, you can lookahead further (really, keeping 1 independent PC or pipeline for each extra instruction, but compilers can always lookahead further). Similar things go on for floating-point, i.e., no matter how much scoreboarding you do, the compiler MUST do some code-spreading or you will lose performance. Note also that streamlined RISC micros usually do 2 register reads and 1 write/cycle, it's hard to get anything else sque As one last example, one of the MIPers here once wrote a reorganizer for a machine architecture that was fully-interlocked, and improved the performance at least 5% on some benchmarks. SUMMARY OF THIS PART: any pipelined machine ought to be done expecting to have a good reorganizer, even if fully scoreboarded. The exception I mentioned: function(x) => lw r2,x jal function nop We can't move the lw to cover the jal-delay slot, because we're not sure the function's first instruction doesn't use it. It turns out that the number of jal-delay slots left unfilled is fairly low, and the fraction that might have been filled by a load is even lower, as there is almost always some argument-gathering that can be moved into the delay slot. (Most of the jal-delay slots come from code like this: if (something) func1(); else func2(); of course, one can turn on a linker-optimizer that looks at the targets of jal's and copies the target into the delay slot if it can. Anyway, I think the bottom-line is that full scoreboarding: Is not very relevant to the integer part of the CPU. Might be useful for long multi-cycle operations (such as FP), maybe (although Earl Killian has a long analysis that argues otherwise that I don't have time to condense & post; maybe Earl will), Might be helpful for compilers that can't do code scheduling, even though it is silly to do high-performance pipelined RISCs without having a good reorganizer. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
aglew@urbsdc.Urbana.Gould.COM (06/20/88)
> On RISC processors without a scoreboard*, how are the results of memory >references guaranteed to be available before they are used in subsequent >computations? I thought through several scenarios and couldn't find a very >good solution and being being unfamiliar with RISCs other than the 88k, >thought the net might have some answers. >... >2) If the compiler/assembly writer is required to wait a certain number >of ticks after a load before using the results of the load, how is the >required amount of delay determined? Can you assume you always get a >cache hit (I doubt it)? Usually by assuming a cache hit; the entire processor stalls on a cache miss. Details of the stall may vary.
kds@blabla.intel.com (Ken Shoemaker) (06/21/88)
But one very important constraint that John imposed was that the processor/ data memory system allow only one external data access at a time. This just isn't a requirement with the current level of integration possible on chips: having multiple pending external cycles is something that is certainly possible. I don't know if the 88000 supports this, but judging from some of the microarchitectural tradeoffs that they performed, it would seem that they had something along these lines in mind, maybe in a future chip? Another thing that register scoreboarding can be used for is to prefetch data into caches: if you know that you are going to use a particular vector at some time in the future, you can preload it into the cache so that it is available when you get back to it. This certainly is a more elegant solution than a "do the load, but don't wait for the data to ever come back" method of doing the same thing. This could be useful when working with very large data sets, i.e., those that don't fit into your cache in a typical system where the main memory latency is long, but transfer time is short. But what do I know? And why am I defending Mot? You don't have to break many eggs to hate omlets -- Ian Shoales Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California uucp: ...{hplabs|decwrl|amdcad|qantel|pur-ee|scgvaxd|oliveb}!intelca!mipos3!kds
mce@tc.fluke.COM (Brian McElhinney) (06/21/88)
In article <2438@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >Anyway, I think the bottom-line is that full scoreboarding: > > Is not very relevant to the integer part of the CPU. > > Might be useful for long multi-cycle operations (such as FP), > maybe > > Might be helpful for compilers that can't do code scheduling, > even though it is silly to do high-performance pipelined RISCs > without having a good reorganizer. Register scoreboarding is also useful to software types and other end users of this complicated compiler technology. No matter how mature the compiler is, it is still a large software program and will have bugs. It may not be worth the hardware trade offs, but the software costs should also be considered. Anything that makes the software "safer" seems like a win to me, but I'll admit I'm biased! Brian McElhinney Software Engineer mce@tc.fluke.com
tom@nud.UUCP (Tom Armistead) (06/22/88)
In article <22147@amdcad.AMD.COM> tim@delirun.UUCP (Tim Olson) writes: >The scoreboarding on the 88k (correct me if I'm wrong) associates a bit >for each register in the register file to tell whether the value is >stale or not. During the issuing of an instruction, the operand >registers are checked for staleness, and the pipe is stalled if >required. This sums it up very well. [ Concerning different method of handling stale registers ] >value to be written into the register file. This method has the benefit >of integrating closely with the forwarding logic. The scoreboard method can allow register forwarding also. -- Just a few more bits in the stream. The Sneek
tom@nud.UUCP (Tom Armistead) (06/22/88)
In article <28200166@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes: > >> On RISC processors without a scoreboard*, how are the results of memory >>references guaranteed to be available before they are used in subsequent >>2) If the compiler/assembly writer is required to wait a certain number >>of ticks after a load before using the results of the load, how is the >>required amount of delay determined? Can you assume you always get a > >Usually by assuming a cache hit; the entire processor stalls on a cache miss. >Details of the stall may vary. This makes sense for RISC processors with a common data and code bus. After all, if the data cache misses, the bus is being hogged by the data unit and the instruction unit then has to stall until the data transaction finishes (although a few instructions might have been prefetched). I had gotten used to the split instruction and data buses on the 88k which yields different behaviour in the above case. On the 88k, a data cache miss doesn't neccesarily create an instruction pipe stall. I think this is better from a performance standpoint but it does require some type of scoreboarding to handle the data dependencies properly. -- Just a few more bits in the stream. The Sneek
tom@nud.UUCP (Tom Armistead) (06/22/88)
In article <2417@mipos3.intel.com> kds@mipos2.intel.com. (Ken Shoemaker) writes: >But one very important constraint that John imposed was that the processor/ >data memory system allow only one external data access at a time. This >just isn't a requirement with the current level of integration possible on >chips: having multiple pending external cycles is something that is >certainly possible. I don't know if the 88000 supports this, but judging It does, up to 3 memory transactions (4 in a special case) can be pending simultaneously. 2 of the transactions can be in progress at the same time (although at different stages of completion) on the external bus (the PBUS). For example, the following sequence can be done ld r1,r2,0 st r3,r4,0 ld r5,r6,0 and all 3 transactions will be then be simultaneously in progress by the memory unit. The PBUS timing would look like (assuming cache hits): Clock PBUS address PBUS data PBUS R/W PBUS response 1 rise r2 - R - 1 fall r2 - R - 2 rise r4 data at *r2 W - 2 fall r4 data in r3 W SUCCESS to ld 3 rise r6 data in r3 R - 3 fall r6 - R SUCCESS to st 4 rise - data at *r6 - - 4 fall - - - SUCCESS to ld Briefly put, the address lines on the current clock tick correspond to the next ticks data lines and the current tick data lines corresponds to the previous tick address lines so 2 transactions are in progress at the same time. This allows a memory access to be completed each clock tick on a sustained basis. -- Just a few more bits in the stream. The Sneek
tom@nud.UUCP (Tom Armistead) (06/23/88)
In article <2438@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >Following are: >1) How the MIPS R2000/R3000 does it. >result of b) - e), and it isn't ready, it is interlocked by the hardware, >rather than software. You can consider this "partial scoreboarding", >although we don't particularly call it that, as the main pipeline does >completely freeze, rather than trying to issue the following instruction. This is similar (sort of) to what the 88k does although we do call it scoreboarding since we interlock on a per register stale flag rather than a per functional unit done flag. We use the scoreboard on memory loads as well the FP unit. >a) Note that the 2 things that are NOT interlocked (load and branch): > Have small (1-cycle delays) that are unlikely to be shortened, > and are undesirable to lengthen > Have about a 70% chance of having a 1-cycle delay slot filled > by software. But why not interlock on loads as well? It removes the *requirement* that a load be followed by nop if something useful can't be found to put there and yet still allows the compiler/coder to put a useful instruction in the delay slot if one exists. This provides the following benefits. 1) More compact code since the loads that don't have a useful delay slot instruction don't have to have a nop. Of course, this is a less common case (only 30% of loads) than those that do so the difference will likely be slight but nonetheless, is a slight improvement over not interlocking. 2) As one other poster pointed out, it makes the coding easier and more bug proof since the compiler or coder doesn't have to remember to insert a nop on the loads with no useful delay slot instruction. On the 88k, there is no way a user program can access stale registers regardless of how dumb/bug ridden the compiler or writer is. I see the above 2 items as improvements created by interlocking on load results and since the interlock circuitry is already partially in place (for the FP instructions) I don't see any additional disadvantages so why not interlock on loads? >SCOREBOARDING >1) If you look at where scoreboarding came from (I remember it first from >CDC 6600 (see Bell & Newell book, for example), but there may be earlier ones), >you had machines with: > a) Many independent functional units > b) Many multiple-cycle operations (integer add was 3 cycles in 6600, > for example, and the FP operations were of course longer). > c) Multiple memory pipes, with a long latency to memory, and no caches. >2) Note that RISC micros probably have a), but normally, only the FP ops, >and maybe a few others have long multi-cycle operations, and they sure don't >have multiple memory pipes, and they often have caches. ^^^^^^^^^^^^^^^^^^^^^ The 88k has this as well as a & b (on board FP unit). >On a machine with a simple load-interlock, and 1-cycle-latency loads, >the machine would stall for 1 cycle before instruction 3. >Suppose you do more complex scoreboarding, i.e., you continue attempting >to issue instructions? Then, you might do 1, 2, try 3 and save it, >and then either come back and do 3 again (probably) or go on to 4 and >discover that it stalls also. If one looks at integer code sequences, >one discovers that it is hard to discover many things that don't quickly >depend on something else (barring, of course, Multiflow-style compiler >technology and hardware designs not currently feasible in VLSI micros). The 88k scoreboarding logic doesn't perform this complex of a task. It merely monitors the instruction stream and stalls the instruction unit if an attempt is made to execute an instruction which uses stale registers and then lets the instruction unit continue once the correct operands are available. For example, in the following code sequence (assume the ld is a cache miss): 1) ld r2,r3,0 ; Get value. 2) add r3,r3,16 ; Bump pointer 3) add r2,r2,1 ; Increment value. 4) sub r4,r4,1 ; Dec count. the instruction unit will stall on instruction 3 since it attempts to use stale data. Even though instruction 4 theoretically could be executed (as it isn't dependent on the ld results), it won't be started until the ld is complete, and instruction 3 is completed. Efficient code on the 88k is just as dependent on a good compiler (or coder) as any other RISC micro. Scoreboarding on the 88k simply ensures you won't use stale data inadvertantly - it's up to the compiler to produce efficient code that minimizes stalls. > Might be useful for long multi-cycle operations (such as FP), > maybe (although Earl Killian has a long analysis that > argues otherwise that I don't have time to condense & post; > maybe Earl will), Consider the case where multiple FP operations can be in progress at the same time (as on the 88k). How could you handle this situation without a per register good/stale indicator (read scoreboard)? Without a scoreboard, the processor would seem to be required to process memory accesses and FP operations sequentially. This might be acceptable but it probably will impact performance adversely in FP or memory intensive applications. So it seems to me that scoreboarding is not required as long as the multi-cycle functional units are single threaded. Here, the per functional unit done flag seems like it would work ok. However, to obtain the performance boost from pipelined functional units (like on the 88k), register scoreboarding is a requirement and not an option. -- Just a few more bits in the stream. The Sneek
tim@amdcad.AMD.COM (Tim Olson) (06/24/88)
In article <1095@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes: | [ Concerning different method of handling stale registers ] | >value to be written into the register file. This method has the benefit | >of integrating closely with the forwarding logic. | | The scoreboard method can allow register forwarding also. | -- | Just a few more bits in the stream. | | The Sneek Can you elaborate a bit? Forwarding requires that you also know which functional unit(s) to forward from -- don't you need other hardware than the scoreboard bits for this? -- Tim Olson Advanced Micro Devices (tim@delirun.amd.com) -- -- Tim Olson Advanced Micro Devices (tim@delirun.amd.com)
tom@nud.UUCP (Tom Armistead) (06/25/88)
In article <22196@amdcad.AMD.COM> tim@delirun.amd.com (Tim Olson) writes: >| The scoreboard method can allow register forwarding also. >Can you elaborate a bit? Forwarding requires that you also know which >functional unit(s) to forward from -- don't you need other hardware than >the scoreboard bits for this? > -- Tim Olson Yes other hardware is required. My posting was intended to elaborate that scoreboarding doesn't *preclude* register forwarding - not that scoreboarding itself is sufficient to implement the feature. Why do you say that forwarding requires knowledge of which functional unit will be supplying the fed forward result? The instruction unit knows which register(s) it is waiting on and can enable feed forward circuitry so that when the result(s) is available, it is fed to the instruction unit simultaneously with the result is being written to the register file. With this method, the feed forward logic can be independent of the functional units. The instruction unit doesn't know, and doesn't care, what is supplying the result but just knows that it needs that particular result for the next instruction and snags it. Forwarding could also be implemented by connecting the input of the instruction unit to the output of the particular functional unit that will provide the result but this would require a seperate feed forward path for each functional unit. This would also work but as you say, requires knowledge of which functional unit the result(s) is expected from. -- Just a few more bits in the stream. The Sneek
earl@mips.COM (Earl Killian) (06/25/88)
In article <1098@nud.UUCP>, tom@nud.UUCP (Tom Armistead) writes: > >SCOREBOARDING > >1) If you look at where scoreboarding came from (I remember it > >first from CDC 6600 (see Bell & Newell book, for example), but > >there may be earlier ones), you had machines with: > > a) Many independent functional units > > b) Many multiple-cycle operations (integer add was 3 cycles in 6600, > > for example, and the FP operations were of course longer). > > c) Multiple memory pipes, with a long latency to memory, and no caches. > >2) Note that RISC micros probably have a), but normally, only the FP ops, > >and maybe a few others have long multi-cycle operations, and they sure don't > >have multiple memory pipes, and they often have caches. > ^^^^^^^^^^^^^^^^^^^^^ > The 88k has this as well as a & b (on board FP unit). John meant the ability to do multiple data references per cycle, not the ability to do an instruction fetch and data fetch per cycle (which any well designed RISC supports). So I don't think the 88k qualifies here. > Consider the case where multiple FP operations can be in progress at > the same time (as on the 88k). How could you handle this situation without > a per register good/stale indicator (read scoreboard)? Without a > scoreboard, the processor would seem to be required to process memory > accesses and FP operations sequentially. This might be acceptable but > it probably will impact performance adversely in FP or memory intensive > applications. You don't need a scoreboard to do what you say, but when the number of pending results is large, it is the most appropriate technique. The alternative, providing a register field comparator for each pending result, is appropriate when the number of pending results is small. But my question in all this is why did the 88000 choose to fully pipeline floating point and thus allow such a large number of pending results? I understand why the 6600 and its successors did it, but the same analysis for the 88000 suggests it is unnecessary. You don't have the cache bandwidth to make fp pipelining useful even on large highly vectorizable problems (i.e. 32b per cycle isn't enough). You can't feed the fp pipeline fast enough. For example, for linpack on a load/store machine with 32b/cycle you need only start a new add every 8 cycles and a new multiply every 8 cycles to run at memory speed. With fp latencies < 8 cycles, no pipelining is necessary in the individual fp units. Go through the 24 livermore loops and count the number of words loaded/stored per fp op and you'll see similar results. The price you paid for pipelining appears to be enormous: the 88100's fp op latencies average 2.5x longer than the R3010's when measured in cycles; 3x longer when measured in ns. -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
mash@mips.COM (John Mashey) (06/26/88)
In article <1098@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes: > But why not interlock on loads as well?.... >1) More compact code... This is a legitimate issue, although it's about a 5% effect. We considered doing load-interlocks (and can do so in the future in an upward -compatible way), but didn't for philosophical reasons, i.e., we had a rule that we wouldn't put features in that we'd have to live with forever if we couldn't prove they were worthwhile for performance (1%); there was some concern that for some design directions, there might be a cycle-time hit if this were in the critical path somewhere. Anyway, it is a legitimate topic for debate. >2) As one other poster pointed out, it makes the coding easier and more >bug proof since the compiler or coder doesn't have to remember to insert >a nop on the loads with no useful delay slot instruction. On the 88k, >there is no way for a user program can access stale registers >regardless of how dumb/bug ridden the compiler or writer is. This one, however, is completely a non-issue, although, to be fair, it's frequently asked by people running into the R2000 for the first time: a) It is trivial to make the assembler insert load-nops where needed. I doubt that our code took more than 10 minutes to write and debug, and I don't ever remember having problems with that in almost 4 years. b) Any software system that couldn't be trusted with this problem, shouldn't be trusted with ANY problem: really, there are zillions of harder problems that need to be solved. (well, maybe not zillions, but many :-) c) Figuring that a RISC compiler should do optimization, but worrying that this feature might be buggy, is like worrying about the safety of flying in a 747 and bringing your own seat-belt because you don't really trust Boeing to remember to include them :-) .... > The 88k scoreboarding logic doesn't perform this complex of a task. It >merely monitors the instruction stream and stalls the instruction unit if >an attempt is made to execute an instruction which uses stale >registers and then lets the instruction unit continue once the correct >operands are available. For example, in the following code sequence >(assume the ld is a cache miss): > >1) ld r2,r3,0 ; Get value. >2) add r3,r3,16 ; Bump pointer >3) add r2,r2,1 ; Increment value. >4) sub r4,r4,1 ; Dec count. > >the instruction unit will stall on instruction 3 since it attempts to >use stale data. Even though instruction 4 theoretically could be >executed (as it isn't dependent on the ld results), it won't be started >until the ld is complete, and instruction 3 is completed. Thanx: we weren't sure whether it had multiple streams or not. The example seems to indicate that the 88K indeed has a load with 2 cycles of latency (i.e., cycles 2 & 3 above). From the example in <1097@nud.UUCP> that gave cycles for the ld/st/ld code, one would have thought there was only 1 latency cycle. Can you say: a) Are there indeed 2 latency cycles (i.e., that instruction 3 will indeed stall above)? b) If so, what is the reason for the second latency slot? (I realize that you may not want to answer this one :-) Note that our numbers say that in our machines, it would cost us 10-15% in overall performance to go from 1 cycle latency to 2, and the similarity of machines probably means about the same amount for an 88K. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
ron@mucmot.UUCP (Ron Voss) (06/28/88)
In article <2465@winchester.mips.COM>, mash@mips.COM (John Mashey) writes: > c) Figuring that a RISC compiler should do optimization, but worrying > that this feature might be buggy, is like worrying about the safety > of flying in a 747 and bringing your own seat-belt because you > don't really trust Boeing to remember to include them :-) I have observed a few cases where hours were wasted searching for a programming bug, only to find that the CISC (68020) compiler's optimizer is buggy (so one turns on the optimizer only for a release, since it slows compilations anyway, thoughouly test again after turning on, etc.). Are there really *good* reasons to put more trust in RISC optimizers? Methinks the metaphor a bit overdone. -- ------------------------------------------------------------ Ron Voss Motorola Microsystems Europe, Munich mcvax!unido!mucmot!ron CIS 73647,752 my opinions are just that, and not necessarily my employer's ------------------------------------------------------------
colwell@mfci.UUCP (Bob Colwell) (06/29/88)
In article <318@mucmot.UUCP> ron@mucmot.UUCP (Ron Voss) writes: >In article <2465@winchester.mips.COM>, mash@mips.COM (John Mashey) writes: >> c) Figuring that a RISC compiler should do optimization, but worrying >> that this feature might be buggy, is like worrying about the safety >> of flying in a 747 and bringing your own seat-belt because you >> don't really trust Boeing to remember to include them :-) > >I have observed a few cases where hours were wasted searching for a >programming bug, only to find that the CISC (68020) compiler's optimizer >is buggy (so one turns on the optimizer only for a release, since it >slows compilations anyway, thoughouly test again after turning on, etc.). > >Are there really *good* reasons to put more trust in RISC optimizers? Ron, I couldn't let that one slide on by. I think it's a truism that good designers produce good designs, including compilers/optimizers, and not-so-good designers produce everything else. My experience has been that the quality of a piece of software is a far stronger function of the abilities of its creators than of its intrinsic complexity. One could perhaps make the case the a RISC optimizer is likely to be intrinsically less complex than its CISC counterpart due to the more-easily-understood machine spec, but it's probably impossible to prove, and I don't think it's as strong an argument as the above. I also have the feeling that more talented designers are turning their attentions to compilers than ever before, so perhaps the overall quality of this kind of software is improving, but I won't even try to defend that thesis if you don't like it. Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
phil@osiris.UUCP (Philip Kos) (06/29/88)
In article <318@mucmot.UUCP>, ron@mucmot.UUCP (Ron Voss) writes: > Are there really *good* reasons to put more trust in RISC optimizers? I have to agree. I once saw an old version of the Pyramid cc optimizer totally trash a for(i = 0; i < 7; i++) loop with a printf() in the middle. I guess that bit of code didn't really do anything. :-) (Talk about FUN to debug...) And we've been told to avoid the global optimizer like the plague when any of the code to be tuned is asynchronous, like the signal handlers we have in 99.99% of our applications... Phil Kos Information Systems ...!uunet!pyrdc!osiris!phil The Johns Hopkins Hospital Baltimore, MD
tom@nud.UUCP (Tom Armistead) (06/30/88)
In article <2459@gumby.mips.COM> earl@mips.COM (Earl Killian) writes: >But my question in all this is why did the 88000 choose to fully >pipeline floating point and thus allow such a large number of pending >results? I understand why the 6600 and its successors did it, but the >same analysis for the 88000 suggests it is unnecessary. You don't >have the cache bandwidth to make fp pipelining useful even on large >highly vectorizable problems (i.e. 32b per cycle isn't enough). You >can't feed the fp pipeline fast enough. Assuming the FP operands are not in cache, this is true. However, there will be some class of problems which can make effective use of the FP pipelining and assuming that FP pipelining has no bad side effects (see paragraph below), it only makes sense to provide the feature. >The price you paid for pipelining appears to be enormous: the 88100's I didn't design the 88100 (I am not a chip designer at all). However, I doubt that any speed difference is due to a tradeoff made to provide FP pipelining. Whether the FP unit was pipelined or not, I think the FP latencies would still be the same. Do you have some reason to think that not pipelining can decrease FP latency? If no increase in latency is incurred to provide pipelining, then providing it can only help performance. >fp op latencies average 2.5x longer than the R3010's when measured in >cycles; 3x longer when measured in ns. Would you post the R3010 FP stats? Not that I doubt you, I'm just interested in how each particular R3010 FP instruction compares to the equivalent 88K instruction. -- Just a few more bits in the stream. The Sneek
tom@nud.UUCP (Tom Armistead) (06/30/88)
In article <2465@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: > a) Are there indeed 2 latency cycles (i.e., that instruction 3 Yes, 2 on loads. 0 latency on stores. > b) If so, what is the reason for the second latency slot? Address calculation. The 88k provides a basic set of three addressing modes. The most frequently used construct for referencing memory is "register pointer + offset" (offset is rarely 0). Without providing this basic addressing mode, the code would have to do something like: add ptr,ptr,offset ; ld dest,ptr,0 sub ptr,ptr,offset ; If pointer must be preserved. for most loads and stores. This overhead is worse than a 2 tick latency on loads. Also note that there is 0 latency on 88k store instructions*. Since many lds have a corresponding st instruction somewhere, the averaged 88k latency will be < 2 ticks. * What is the latency on a R3000 store instruction? >Note that our numbers say that in our machines, it would cost us >10-15% in overall performance to go from 1 cycle latency to 2, Assuming the addressing modes and other aspects of the machine remain the same, this figure is in the ballpark (although a little high). However, I think the static machine assumption is not a valid one to make. -- Just a few more bits in the stream. The Sneek
guy@gorodish.Sun.COM (Guy Harris) (06/30/88)
> And we've been told to avoid the global optimizer like the plague when any > of the code to be tuned is asynchronous, like the signal handlers we have > in 99.99% of our applications... Umm, this has nothing whatsoever to do with RISC vs. CISC nor optimizer bugs; the problem here is that signal handlers (and other forms of asynchrony) violate assumptions that optimizers tend to make. For example, in the following case: extern int j; int i, l, a, b, c; ... i = j; k = a + b + c; l = j; if the value assigned to "i" happens to be lying around in a register (as a side effect of the code that performs that assignment) when the value of "j" is assigned to "l", you can just store that register into "l", right? Wrong. A signal could arrive between the point at which the value was loaded into the register and the point at which the value was stored into "l", and the signal handler can change the value of "j" in that interval. The same can happen with values stored in registers on I/O devices, or variables shared between processes. Unfortunately, optimizers tend not to know that certain external variables are "safe", and won't change asynchronously, while others are "unsafe" and may change asynchronously. Some optimizers may simply assume that all external variables are "safe", in which case you'd better disable the optimizer when writing code where not all external variables are "safe" (or, at least, cut the optimization down to a level where the optimizer no longer makes this assumption). To see a painfully elaborate discussion of ways to handle this problem, just search for the word "volatile" in all articles in "comp.lang.c" on your machine and print all the articles in question. :-)
mash@mips.COM (John Mashey) (06/30/88)
In article <318@mucmot.UUCP> ron@mucmot.UUCP (Ron Voss) writes: >In article <2465@winchester.mips.COM>, mash@mips.COM (John Mashey) writes: >> c) Figuring that a RISC compiler should do optimization, but worrying >> that this feature might be buggy, is like worrying about the safety >> of flying in a 747 and bringing your own seat-belt because you >> don't really trust Boeing to remember to include them :-) >I have observed a few cases where hours were wasted searching for a >programming bug, only to find that the CISC (68020) compiler's optimizer >is buggy (so one turns on the optimizer only for a release, since it >slows compilations anyway, thoughouly test again after turning on, etc.). >Are there really *good* reasons to put more trust in RISC optimizers? >Methinks the metaphor a bit overdone. Sorry, let me try again. I wasn't trying to say optimizers were easy or safe at all: they're not, in fact, they are like building a 747. The point was, in fact, that if you're doing an optimizer (building a 747), THAT's what you worry about, because it's hard. You don't worry about getting nops after loads ("this feature") (seatbelts), because if a somebody can't get THAT right, they have no business building optimizers (747s). [No disrespect to Boeing, I just picked the biggest plane I've flown in.] 1) I have marginally more trust in RISC optimizers than in CISC ones, on general principles [i.e., at least some parts are simpler.] 2) What I mostly trust/distrust is the optimizer's author(s); I'd much rather have an expert's CISC optimizer than a turkey's RISC one. (Fortunately, I get to use a super-expert's RISC optimizer. :-) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
aglew@urbsdc.UUCP (07/01/88)
> ... we've been told to avoid the global optimizer like the >plague when any of the code to be tuned is asynchronous, like the signal >handlers we have in 99.99% of our applications... > > Phil Kos > Information Systems >...!uunet!pyrdc!osiris!phil The Johns Hopkins Hospital > Baltimore, MD I'm interested... at first I thought you said that 99.99% of your applications were in the signal handlers, but that was wrong. Still, what do you do in the signal handlers - set flags that other steps can use? Sounds like you need volatile.
smryan@garth.UUCP (07/01/88)
In article <1639@osiris.UUCP> phil@osiris.UUCP (Philip Kos) writes: >In article <318@mucmot.UUCP>, ron@mucmot.UUCP (Ron Voss) writes: >> Are there really *good* reasons to put more trust in RISC optimizers? Hey! Why not ask somebody who has to write the compiler? Don't you trust us? No? Smart idea. Seriously, the reason given for delayed branches is that (1) the compiler can do it, (2) nobody uses assembly anyway. I would like to think that (1) is true, but compilers have just as many problems as anybody else. (2) I have not yet heard of any machine in general production without an assembler. Even if that were the case, because (1) is false, somebody has to debug the compiler. It is hard enough to backtranslate assembly through the compiler to the source language on a conveniential machine. Given the current ratio of software cost to hardware cost, I feel the more extreme proposals of RISKs are misguided.
earl@mips.COM (Earl Killian) (07/01/88)
In article <1110@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes:
TA# I didn't design the 88100 (I am not a chip designer at all).
TA# However, I doubt that any speed difference is due to a tradeoff made
TA# to provide FP pipelining. Whether the FP unit was pipelined or not, I
TA# think the FP latencies would still be the same. Do you have some reason
TA# to think that not pipelining can decrease FP latency? If no increase
TA# in latency is incurred to provide pipelining, then providing it can only
TA# help performance.
.........................................................................
The answer to this is, yes, I believe the lack of pipelining does
partially explain why the R3010 has smaller latency than the pipelined
88k. Do you have an alternate suggestion?
For example, non-pipelined functional units can reuse hardware
resources. If it takes 2 shifts to do an fp add, a non-pipelined adder
can have just one shifter but a pipelined adder needs 2 shifters.
This is especially nasty for things like the divider which reuse some
hardware entities (e.g. quotient digit lookup table) many, many times.
An N-stage-pipelined divider would have needed N of these lookup tables.
By avoiding the need for extra hardware, the non-pipelined implementation
frees up valuable chip area. In the R3010 we used that area for
(i) 64-bits x 16-words x 4-ports register file; (ii) separate divider
independent of multiplier; (iii) delay-optimized physical layout and hand
tweaked circuit design employing LARGE transistors; (iv) control unit's
resource scheduler that permits 4 instructions to execute concurrently.
.........................................................................
TA# Would you post the R3010 FP stats? Not that I doubt you, I'm just
TA# interested in how each particular R3010 FP instruction compares to the
TA# equivalent 88K instruction.
.........................................................................
88100 R3010 cycle
Operation cycles cycles ratio
======================================
sp add 5 2 2.5
dp add 6 2 3.0
sp mul 5 4 1.2
dp mul 10 5 2.0
sp div 30 12 2.5
dp div 60 19 3.2
sp convert 5 1-3 1.7-5
dp convert 6 1-3 2-6
abs/neg/mov 5? 1
.........................................................................
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
earl@mips.COM (Earl Killian) (07/01/88)
In article <1110@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes:
eak> You don't have the cache bandwidth to make fp pipelining useful
eak> even on large highly vectorizable problems (i.e. 32b per cycle
eak> isn't enough). You can't feed the fp pipeline fast enough.
TA# Assuming the FP operands are not in cache, this is true. However,
TA# there will be some class of problems which can make effective use of
TA# the FP pipelining and assuming that FP pipelining has no bad side effects
TA# (see paragraph below), it only makes sense to provide the feature.
.........................................................................
I *was* talking about when the operands are in cache. Let me repeat the
example in my original posting:
For example, for linpack on a load/store machine with 32b/cycle you
need only start a new add every 8 cycles and a new multiply every 8
cycles to run at cache speed. With fp latencies < 8 cycles, no
pipelining is necessary in the individual fp units. Go through the 24
livermore loops and count the number of words loaded/stored per fp
op and you'll see similar results.
What's going on here? The DAXPY inner loop of linpack
DY(I) = DY(I) + DA*DX(I)
has a fp add, fp multiply, two loads, and one store (ignoring loop
overhead, which can be effectively eliminated by unrolling). With a
32b interface to the data cache, the two loads and one store take 6
cycles. The two fp ops take 2 cycles to issue. Total 8 cycles for
2 flops. So if your add latency and multiply latency are <= 8 cycles
you can execute DAXPY without any pipelining, running at 75% of your
cache bandwidth.
Let's formalize this:
Let m be the ratio of memory reference cycles to floating point ops
for a computation. Let f1, f2, ... fn be the frequencies of the
different floating point ops (they sum to 1). Let t1, t2, ... tn be
the latencies of the different floating point ops. Let r1, r2, ... rn
be the pipeline rates of the different op units. Assume the
computation is infinitely parallelizable (the most favorable
circumstance for pipelining).
Consider a load/store architecture with one instruction issue per
cycle. This means the time to do one flop is bounded below by m+1
cycles. A given functional unit will require pipelining (ri < ti) to
run at this rate iff ti*fi > m+1. The pipelining required is a new
op every ri=(m+1)/fi cycles. Alternatively, the latency required for
no pipelining is ti=(m+1)/fi.
Example: linpack with one 32-bit access per cycle (e.g. r3000 or
88000): n=2, f1=0.5, f2=0.5, m=3. Thus r1=r2=8.
Example: linpack with one 64-bit access per cycle, or half-precision
linpack with one 32-bit access per cycle: m=1.5. So 5-cycle adds and
multiplies are sufficient (R3010 latencies work fine). Or pipelining
to start new ops every 5.
So what you need to do to refute my claim is to show important
problems with low m values and with fi almost one.
One such computation is something like livermore loops 11 and 12 in
single precision, for which n=1, f1=1, m=2, so you need a new add
every 3 cycles. But of course the R3010's 2 cycle add handles that
just fine without pipelining.
Another such comutation is something like the multiple-vector
techniques for LU-decomposition. E.g. for 8-way vectors, m=0.625,
f1=0.5, f2=0.5, so you need 3.25 cycle add and multiply to run at peak
rate for sp, 4.5 cycle add and multiply for dp.
Still, I would not consider these to prove the need for pipelining
with 32b cache interfaces. Do you know a computation which does?
What are its parameters in the above terms?
.........................................................................
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
mash@mips.UUCP (07/01/88)
In article <1111@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes: >In article <2465@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >> a) Are there indeed 2 latency cycles (i.e., that instruction 3 > Yes, 2 on loads. 0 latency on stores. >> b) If so, what is the reason for the second latency slot? > Address calculation. The 88k provides a basic set of three addressing >modes. The most frequently used construct for referencing memory is >"register pointer + offset" (offset is rarely 0). Without providing this basic >addressing mode, the code would have to do something like: > add ptr,ptr,offset ; > ld dest,ptr,0 > sub ptr,ptr,offset ; If pointer must be preserved. >for most loads and stores. This overhead is worse than a 2 tick latency >on loads. Also note that there is 0 latency on 88k store instructions*. >Since many lds have a corresponding st instruction somewhere, the averaged 88k >latency will be < 2 ticks. We agree on the desirability of non-zero offsets 100% (although our friends down the street at AMD may not :-) For this kind of design, I can't imagine that people would do add/ld/sub, but would use a temp register: add temp,ptr,offset ; > ld dest,temp,0 Anyway, maybe I'm missing something: I still don't understand where the second latency cycle comes from in the hardware design. You compute the address and get it on the bus, and the third instruction following can actually use the data. Moto info says that the CMMU returns data in one cycle, so it almost sounds like it's costing a cycle on the CPU to make the data available (i.e., forwarding logic is not fast enough, or something). Anyway, I still don't understand what address-mode computation has to do with it, at least not from the given example [which is identical to what R2000s do]. So please say some more. >* What is the latency on a R3000 store instruction? 0. >>Note that our numbers say that in our machines, it would cost us >>10-15% in overall performance to go from 1 cycle latency to 2, > Assuming the addressing modes and other aspects of the machine remain >the same, this figure is in the ballpark (although a little high). >However, I think the static machine assumption is not a valid one to make. Can you say what your numbers are, i.e., why is this high? My reasoning is as follows: a) Loads are about 20% of the instructions. b) You get to schedule approx. 70% of the 1st delay slots. c) You get to schedule approx 30% of the 2nd delay slots, i.e., the penalty for a second delay slot is the 70% that you don't get to schedule: .7 * 20% = 14% Now, in this area, R2000s and 88Ks are similar enough that I think the statistics carry over, but counterarguments would be useful. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mpaton@oakhill.UUCP (Michael Paton) (07/02/88)
In article <2465@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >> b) If so, what is the reason for the second latency slot? The reason for the additional delay slot is twofold. First, the M88100 has allotted .2 clocks to the data path, .7 clocks to a 32 bit adder, and more than .25 of a clock for driving a signal off chip with .1 clock setup margins. Since we desired all signals to be synchronous, we rounded this 1.25 clocks up to 1.5 clocks. These 1.5 clocks are spent performing the addition and delivering the address to the M88200 cache. The cache has 1 clock from valid address to valid data (1.2 valid address to hit). Then the processor uses the remaining .5 clock to perform byte extraction, sign extension, delivering the result to the result bus. Therefore the memory pipeline is 3 clocks in length. The MIPS Co. designers sliced this into 2 clocks and integrated it into their basic (only) pipeline. This means that when the data cache misses or is not available (e.g. bus snooping) that their entire processor skips a beat. Since the M88100 data pipeline is decoupled from the instruction pipeline, the beat is not skipped unless a data dependency occurs. This is more beneficial in the case of outbound store traffic on write-through pages than on the nominal load traffic. In two clocks they must perform a 32+16 addition and the mmu function before driving the pins. Moreover, the multiplexed bus requires that the addresses be latched (externally on both the R2000 and the R3000). Memory must still be accessed and delivered to the result bus/register file/forwarding unit. Therefore, the time must come from somewhere, and it seems to have come out of the memory SRAM speed in the R3000 and the circuit speeds and margins. According to the data sheet (using a F373 transparent latch), the MIPS design requires 20ns SRAMs. Our design (assuming a glue part cache implementation: F373 latch and 3ns t-sub-OHA* SRAM with spec sheet timing) allows 32ns SRAM parts at 40ns cycle or 42ns SRAM at 50ns. The MIPS design requires 0ns SRAM at 40 MHz, while we do not hit this brick wall until 120 MHz. Since the two SRAM arrays' data pins are connected together and to three other interfaces, they must be sampled at 2/3 of the phase period by all interested parties and then rapidly removed from the multiplexed data bus (the data is only valid for 3.5ns on the data bus). This requires very strict timing control over this vital feature. In addition, the Cypress spec's I checked had t-sub-HZOE** of 15ns maximum while the R3000 requires this in 8ns. Note that this is not a problem as the other bank of SRAMs has a minimum output enable time of 9ns while the Cypress parts (CY7C161 16K*4) did not. The MIPS processors do not snoop their bus and therefore leave memory coherence to the write-through mechanism. In multiprocessing applications, the memory bus can become saturated with a few processors on the bus (~4?). Write-back caches cause a sufficient reduction in memory bus traffic to allow twice the number of processing ensembles to utilize the bus. Therefore, the answer to the original question is: we were more conservative in allotting nanoseconds to functions, and, in particular, we attempted to beat on the SRAM technology less hard. If we are correct, this should be more scalable in the future (read ECL/GaAs) as off-chip delays approach .4 cycle. Alternatively, our design costs less to manufacture in high volume and allow less costly SRAM parts than the MIPS Co. design. This should allow us to push the clock frequency farther than MIPS Co., but not at the current mask set. Remember the recent announcement of 33 MHz M68020 and M68030: a little shrink and a lot of tuning can obtain many MHz. We have just begun to tune these M88k parts. >>Note that our numbers say that in our machines, it would cost us >>10-15% in overall performance to go from 1 cycle latency to 2, Our numbers show that with the decoupling of the pipeline and earliest load scheduling we suffer 6-8% cycle-to-cycle performance degradation when comparing the MIPS design to the M88100 design. This is not too bad when one considers the relaxation of external RAM speed requirements and allowing the hit logic to run in arrears of the data delivery. Independent of this current discussion on the length of load pipelines, one might want to ask the folks at MIPS Co. this question: Why did you multiplex your memory bus? Consider factors related to power dissipation. The current M88100 processors are running between .25 and .5 watts @ 20Mhz. If we were to multiplex the 2 memory ports as did MIPS Co., our worst case power consumption would be 4 watts. The problem is that the AC power dissipation is given by: 2*C*V**2*F*N, where: C = load capacitance (F), V = voltage swing (V), F = frequency (Hz), and N = number of pins which make transitions. (For the MC88100, V = 3.8 volts (TTL logic levels) and F = 20 MHz) The addresses from the instruction port are very highly correlated (about 1.4 bits per cycle change). The addresses from the data port are only partially correlated (less so with better compilers). Mixing these two streams results in almost uncorrelated address streams and therefore a bigger N, resulting in more power dissipation. Notice that the pin counts on the two packages are not that much different (144 for the R3000 vs. 180 for the MC88100) and neither are the power/grounds pin counts (30 for the R3000 vs. 36 for the MC88100). So why multiplex the memory bus? Your pin count isn't that much lower, your power dissipation suffers greatly and you tend to create difficulties in interfacing to your memory system. * t = time of output hold from address change OHL ** t = time of output enable high to high Z (impedance) HZOE /\ /\ Mitch Alsup //\\ //\\ Manager of Architecture for the M88000 ///\\\ ///\\\ // \\ // \\ Remember: SPARC spelled backwards is .....! / \/ \ / \
earl@mips.COM (Earl Killian) (07/06/88)
In article <1362@oakhill.UUCP> mpaton@oakhill.UUCP (Michael Paton) writes:
mp> The MIPS processors do not snoop their bus and therefore leave
mp> memory coherence to the write-through mechanism. In
mp> multiprocessing applications, the memory bus can become saturated
mp> with a few processors on the bus (~4?). Write-back caches cause a
mp> sufficient reduction in memory bus traffic to allow twice the
mp> number of processing ensembles to utilize the bus.
Write through to a 32b bus does indeed limit you to about 4 processors
(the max supported by the 88000). If you want more than that, use a
64b bus (~8 processors), or use a secondary cache (which has its own
benefits) and make it write-back. Both approaches have already been
implemented in R2000-based systems.
When you build a R3000-based MP, you don't have to limit the amount of
cache per processor (unlike, e.g., the 88000, which allows only 16KB
per processor in a 4-processor system). If you're building an MP,
presumably you're interested in performance, so it seems strange to
cripple each processor with a small cache.
mp> ...in particular, we attempted to beat on the SRAM technology less
mp> hard. If we are correct, this should be more scalable in the
mp> future (read ECL/GaAs) as off-chip delays approach .4 cycle.
It's hard to envision ECL output drive-times approaching 0.4 clocks;
modern ECL parts (e.g. Sony's CXB1100Q 3-NOR) can receive signals from
off-chip, do the NOR function, and drive off chip again (using 100K
levels) in 390 picoseconds. {EDN 6-23-88, p. 97} So (0.4 * Tclock) =
390ps giving Tclock = 975 picoseconds (1.03 GHz) !!! A more
"believable" clock period might be 4ns (a la Cray-2), in which case
the drive-off time is 0.1 clock.... a smaller fraction of the cycle
than you quote for your CMOS design.
mp> Alternatively, our design costs less to manufacture in high volume
mp> and allow less costly SRAM parts than the MIPS Co. design.
An R3000 may require faster SRAMs, but these are multiple-sourced,
off-the-shelf, commodity devices, and the price of 30 16Kx4 20ns SRAMs
is actually lower than that of eight 88200s (sole-sourced).
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
earl@mips.COM (Earl Killian) (07/06/88)
In article <1362@oakhill.UUCP> mpaton@oakhill.UUCP (Michael Paton) writes:
mp> Independent of this current discussion on the length of load
mp> pipelines, one might want to ask the folks at MIPS Co. this
mp> question:
mp> Why did you multiplex your memory bus?
mp> Consider factors related to power dissipation. The current M88100
mp> processors are running between .25 and .5 watts @ 20Mhz. If we
mp> were to multiplex the 2 memory ports as did MIPS Co., our worst
mp> case power consumption would be 4 watts. The problem is that the
mp> AC power dissipation is given by:
mp> 2*C*V**2*F*N,
mp> N = number of pins which make transitions.
mp> The addresses from the instruction port are very highly correlated
mp> (about 1.4 bits per cycle change). The addresses from the data
mp> port are only partially correlated (less so with better
mp> compilers). Mixing these two streams results in almost
mp> uncorrelated address streams and therefore a bigger N, resulting
mp> in more power dissipation.
Your observation is interesting. But I hope the 88100's packaging
isn't based on an _average_case_ analysis. Malicious programmers (who
write, for example, a branch-to-branch infinite loop) can't melt the
88100, can they? So the worse case is really 4 watts, right? What
does the datasheet say?
Also the cpu subsystem power consumption isn't much reduced by your
observation, since it only applies to the address outputs on the cpu,
which is a small fraction of the total pins in the system (the I-cache
SRAM pins change every cycle even if the address doesn't change much).
Hmm, I guess the address outputs in an 88000 system are a significant
fraction of the pins, unlike the R3000, because you send the full 32b
virtual address off chip, instead of only 18b as in the R3000. That
wastes as much power as you save on average from the demultiplexing,
doesn't it?
mp> Notice that the pin counts on the two packages are not that much
mp> different (144 for the R3000 vs. 180 for the MC88100) and neither
mp> are the power/grounds pin counts (30 for the R3000 vs. 36 for the
mp> MC88100).
MIPS designed its first RISC chip (the R2000) 4 years ago, and at that
time the difference between 180 pins and 144 pins was significant.
The i80386, designed at roughly the same time, actually has fewer
pins -- 132. The 2nd generation (R3000) could have changed the
interface, but we felt it was desirable to provide an upgrade path for
R2000 designs.
Back when RISC vs. CISC was still an issue, people complained RISC
required more bandwidth. We just provided the necessary bandwidth.
When we need more, we'll add more. If that means modifying the
R3000-style cache interface, so what?
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
oconnor@nuke.steinmetz (Dennis M. O'Connor) (07/08/88)
An article by earl@mips.COM (Earl Killian) says:
] In article <1362@oakhill.UUCP> mpaton@oakhill.UUCP (Michael Paton) writes:
] mp> [...] one might want to ask the folks at MIPS Co. this
] mp> question:
] mp> Why did you multiplex your memory bus?
] mp> Consider factors related to power dissipation. The current M88100
] mp> processors are running between .25 and .5 watts @ 20Mhz. If we
] mp> were to multiplex the 2 memory ports as did MIPS Co., our worst
] mp> case power consumption would be 4 watts. The problem is that the
] mp> AC power dissipation is given by:
] mp> 2*C*V**2*F*N,
] mp> N = number of pins which make transitions.
] mp> The addresses from the instruction port are very highly correlated
] mp> (about 1.4 bits per cycle change). The addresses from the data
] mp> port are only partially correlated (less so with better
] mp> compilers). Mixing these two streams results in almost
] mp> uncorrelated address streams and therefore a bigger N, resulting
] mp> in more power dissipation.
]
] Your observation is interesting. But I hope the 88100's packaging
] isn't based on an _average_case_ analysis. Malicious programmers (who
] write, for example, a branch-to-branch infinite loop) can't melt the
] 88100, can they? So the worse case is really 4 watts, right? What
] does the datasheet say?
Seems to me a branch-to-branch i-loop would execute entirely
out of an onboard I-Cache if there is one. But, if not, a
5V 20MHz signal driving, say, 30pf dissipates 30mW. Multiply
by 32 and you get about 1 Watt power increases. During this
branch-to-branch, the data address bus would, of course, be idle.
Nastier would be something like :
7FFFFFFF : STORE-0-TO-H55555555
80000000 : BRANCH-TO-HFFFFFFFF
. . .
FFFFFFFF : STORE-FFFFFFFF-TO-HAAAAAAAA
00000000 : BRANCH-TO-H7FFFFFFF
This sequence changes, on average, 64 output-pins per
cycle : 96 during the cycles with the stores, 32 on the branches.
Assumimg the buses do the most conservative thing ( float ) during
their non-driven periods. This produces 3 Watts power dissipation.
Unless, of course, there's an on-board I-Cache, in which case
only 32 pins per cycle, average ( 64 in the store cycles, 0 in
the branch cycles ) for a dissipation of 1 Watt.
] Also the cpu subsystem power consumption isn't much reduced by your
] observation, since it only applies to the address outputs on the cpu,
] which is a small fraction of the total pins in the system (the I-cache
] SRAM pins change every cycle even if the address doesn't change much).
How to handle an extra Watt of power dissipation on a chip is a
more difficult question than how to handle it in a board or
hybrid, or how to add it to your power supply.
] Hmm, I guess the address outputs in an 88000 system are a significant
] fraction of the pins, unlike the R3000, because you send the full 32b
] virtual address off chip, instead of only 18b as in the R3000. That
] wastes as much power as you save on average from the demultiplexing,
] doesn't it?
A 16-bit processor would consume less power than a 32-bit processor,
all else being equal. People build the 32-bit ones anyway. But I
digress. So, what can you address with 18 bits ? 256KWords ? Not
enough. I'm kinda surprised if MIPS's address bus is 18 bits,
unless they have some kind of clever external mapping scheme.
] mp> Notice that the pin counts on the two packages are not that much
] mp> different (144 for the R3000 vs. 180 for the MC88100) and neither
] mp> are the power/grounds pin counts (30 for the R3000 vs. 36 for the
] mp> MC88100).
] MIPS designed its first RISC chip (the R2000) 4 years ago, and at that
] time the difference between 180 pins and 144 pins was significant.
] The i80386, designed at roughly the same time, actually has fewer
] pins -- 132. The 2nd generation (R3000) could have changed the
] interface, but we felt it was desirable to provide an upgrade path for
] R2000 designs.
Big-pin-count packages are a neccesary evil. They cost more, take
up more real estate, have lower packaging yeilds, are more difficult
to reliably attach to boards, have more problems from thermal-
expansion-coefficients, et cetera. PGAs can't be surface mounted,
which means you can't put them on both sides of the board and
multi-layer routing gets painful. Fine-pitch pins have their
own problems.
The RPM40 uses a 132 pin LCC. It only outputs new instruction
addresses on branches, traps, et cetera. It uses the same pins
for instruction addresses as for operand addresses, even though
the data buses for each are separate. The instruction memory
system is therefor neccesarily a pipelined look-ahead system
much like those used on vector processors for vector fetc
( after all, instruction memory is essentially a set of vectors,
one vector per basic block ).
We felt we had to go this route, no choice : at the time ( late '85 )
the 132-pin LCC was the biggest thing we knew of that was fully
qualified at 40MHz, and trying to multiplex the instruction and
operand addresses at 80MHz was considered beyond CMOS capability.
] Back when RISC vs. CISC was still an issue, people complained RISC
] required more bandwidth. We just provided the necessary bandwidth.
] When we need more, we'll add more. If that means modifying the
] R3000-style cache interface, so what?
] --
] UUCP: {ames,decwrl,prls,pyramid}!mips!earl
] USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
"RISC vs. CISC" is, I think, not so much an absolute issue
as an issue of which fits a particular technology better.
When memory speeds are up there with adder speeds, RISC wins.
When adders are much faster than memory, perhaps CISC does.
Different horses for different courses, I think.
Woops, gotta go.
--
Dennis O'Connor oconnor%sungod@steinmetz.UUCP ARPA: OCONNORDM@ge-crd.arpa
"Never confuse USENET with something that matters, like PIZZA."
hankd@pur-ee.UUCP (Hank Dietz) (07/09/88)
A while back, my brother and I sent the following abstract to ICCD88... we never heard anything as to how they liked it, but it seems a very reasonable thing to post here, given the recent discussion of power dissipation.... Limiting Switching Transients in High-Speed Digital Processors Henry G. Dietz Paul H. Dietz hankd@ee.ecn.purdue.edu phd@speech1.cs.cmu.edu Abstract Pin counts on CMOS VLSI processors are currently very high and will probably continue to grow. This causes a variety of problems, not least of which is the possibility of encountering unacceptable switching transients when many output pins change state simultaneously. These transients can drastically reduce the noise immunity of internal gates, severly limiting performance. To limit the number of output pins simultaneously changing state, we propose to directly manage output requests on the basis of predictions of the switching tran- sients implied in each output request. Each chip would be designed assuming a well-specified parameter, the Maximum Allowable Switching Transient (MAST), and an output request which could exceed the MAST would be serialized so that the MAST is not exceeded. This direct control of switching transients can be implemented in either a hardware-intensive or software-intensive style. The overall effect is that a processor chip may incorporate many pins, yet need not be designed to survive the worst case of all output pins attempting to change state simultaneously. 1. Background There are a number techniques that have been used to limit switching transients. These can be grouped into two major categories: reduction in the number of output pins that are active at any one time or reduction of the observed transient itself. The number of output pins can be reduced by transmit- ting data serially or by time multiplexing data buses to serve multiple functions. Alternatively, output times for various signals can be slightly skewed so that the outputs are not set simultaneously. Unfortunately, the quest for higher operating speeds often precludes the obvious application of these techniques. To reduce the switching transient generated per output pin, some manufacturers have devoted large die areas to decoupling capacitors; but this is not practical for designs which are already pushing die-size constraints. Other manufactures use off-chip capacitors mounted in the same package as the die, which can provide much larger decoupling capacitances. However, the series inductance inherent in going off-chip is greater, limiting the effectiveness. Another approach, perhaps more generally applicable, is to maintain separate power buses for output buffers and inter- nal state logic [Car88]. Also, by careful design of the output buffer [GaT88], one can make buffer power consumption more consistent, hence reducing the worst-case values and achieving significant improvement. It is reasonable to assume that next generation devices will incorporate some combination of these methods, yet, all of these techniques require that the chip be designed for the worst case: additional performance gains can be made by restricting simultaneous output operations only when the MAST otherwise would be exceeded. 2. Approach There are two difficulties in directly controlling out- put state transitions based on potential MAST violation. The first problem is how to detect or predict when a MAST violation may occur; this may be done placing the main bur- den either on hardware (detection) or on software (predic- tion). The second problem is, given that a particular out- put request would exceed the MAST if done simultaneously, how can the hardware arrange to perform the output pin state transitions without exceeding the MAST. We will discuss this second problem first. 2.1. Output Serialization Given that a particular logically-simultaneous output operation would exceed the MAST, hardware must intervene to insure that the limit is observed. This can be done by per- mitting only a fraction of the output pins to change simul- taneously in one cycle and performing the rest of the output on successive cycles which are inserted just for that pur- pose. We say that such an output operation has been serial- ized. Although output serialization requires only relatively simple hardware, some care must be taken. For example, strobe/ready bits must change state only once the corresponding data bits are in the correct state. When the MAST is not exceeded, the requested output is performed in a single cycle. (In this case, the additional circuitry has no effect.) This is an efficient technique because, for example, localities in instruction address space often correspond to minor bit changes in the address outputs.[1] In some cases, an optimizing compiler/linker/loader can significantly enhance this kind of locality - these code transformations are discussed in detail in the full version of the paper. A simple example of the type of optimization possible is to generate code so that jump and call targets (labels and function/procedure entries) are placed at addresses which differ from the invoking-instruction's address in only a few bits (more precisely, causing changes in fewer than MAST bits). Another example is that a loop whose code would normally span a high-order memory address bit change could be moved to a portion of the address map where fewer address bits change. Even data-related outputs sometimes can be transformed to minimize simultaneous bit changes, either by careful layout of data structures or by recognition of properties of operations being performed. 2.2. MAST Violation Detection/Prediction As discussed in section 2.1, compiler technology (e.g., flow and other static analyses [AhS86] [DiC88]) can be used to predict, and hence to alleviate by code motion, etc., possible violations of the MAST. This same compiler tech- nology can be used to predict when the MAST will be violated and to directly encode that information in the instructions it generates; hardware would simply serialize any operation which the compiler tagged as suspect.[2] Of course, the compiler must conservatively assume that any operation which it can't prove is less than the MAST, is actually greater than the MAST. This isn't always true - some output changes are always unknown until runtime, and the compiler must assume that all of these change. The more hardware-intensive alternative is to simply use a circuit to detect, at runtime, when a proposed output would actually exceed the MAST, and to invoke serialization only then. In the full-length paper, several techniques are presented for constructing such a circuit. Compared to the software prediction, hardware detection insures that all outputs that can be done in a single cycle are so accomplished, whereas the compiler tagging may cause some to be unnecessarily serialized. The trade-off is that the hardware is fairly complex and that the compiler cannot know precisely how long each instruction will take to exe- cute (which reduces the effectiveness of many conventional compiler optimizations). 3. Conclusion Using either the software-intensive or the hardware- intensive technique proposed, the concept of directly manag- ing output pin state changes can provide substantial perfor- mance increases with only minor impact on the processor design. Typically, a circuit using these techniques will be running at or near its MAST, thus making the best possible use of the available bandwidth. _________________________ [1] Although it might not be practical, use of Gray coded rather than 2's-complement integers to represent addresses would insure that sequential addresses differ by only a single bit. [2] For those who would rather not place such faith in the compiler, a simple circuit can detect a glitch on the power bus, thereby detecting an instruction which the compiler failed to tag but which exceeds the MAST. The circuit would simply initiate a cold-start. References [AhS86] Aho, A. V., Sethi, R., and Ullman, J. D., Com- pilers: Principles, Techniques, and Tools, Addison Wesley, Reading, Massachusetts, 1986. [Car88] Carley, L. R., Personal communication, Jan. 31, 1988. [DiC88] Dietz, H. G. and Chi, C-H., "A Compiler-Writer's View of GaAs Computer System Design," IEEE Proc. of HICSS-21, pp. 256-265, Jan. 1988. [GaT88] Gabara, T. and Thompson, D., "Ground Bounce Con- trol in CMOS Integrated Circuits," to appear in IEEE Proc. of International Solid-State Circuits Conference, 1988.
andrew@frip.gwd.tek.com (Andrew Klossner) (07/12/88)
Earl Killian wrote: "When you build a R3000-based MP, you don't have to limit the amount of cache per processor (unlike, e.g., the 88000, which allows only 16KB per processor in a 4-processor system)." Presumably Earl meant "data cache" where he wrote "cache." Just to keep the record straight: in a 4-processor 88k system on a single M-bus (the bus implemented by the CMMU pinouts), each processor has 16KB data cache and 16KB instruction cache, for a total of 32KB cache. If you want more than that, you need more than one M-bus. Some system designers are doing this. -=- Andrew Klossner (decvax!tektronix!tekecs!andrew) [UUCP] (andrew%tekecs.tek.com@relay.cs.net) [ARPA]