mash@mips.COM (John Mashey) (03/11/88)
Randall Jesup says the RPM40 stuff has been beaten to death, and I agree, but I do have some data that may be useful, which I'd have posted before, but has taken a while to write up, given that I've been busy. As requested a while back by Mr. O'Connor, an analysis of the "40-MIPS" RPM-40 for workstation/general-purpose use (and only that): Summary: 40MIPS peak -> 20MIPS-cycles -> 14-15 VUP -> 12 VUP: (matches 7-9X 68020, or 2-6X Sun-3/260 estimates given by GE folks) 40MIPS (peak native mips, no load delays, pipeline hazards, etc) 20MIPS-cycles (assuming no MMU, cache, or other memory system degradation), i.e., what a 20MHz MIPS R2000 would be without those effects. 14-15VUP (assuming MMU, cache, memory degradation same as MIPS R2000, and compiler technology identical to MIPSco's) [VUP: VAX 11/780 with good optimizers = 1 VUP, integer+fp]. 12 VUP realistic expectation of what one would run like in an actual general-purpose system, with the more vanilla compiler technology that is likely to be available. (correct me if wrong) The rest of this long posting: 1. Issues of general-purpose UNIX systems 2. Doing an analysis of a machine you only have a spec of 3. Detailed analysis 4. Conclusion ------STOP NOW UNLESS YOU ARE A GLUTTON FOR DETAILED ARCHITECTURAL ANALYSIS--- 1. Fundamental issues, in which general-purpose UNIX systems may differ from dedicated, small-memory, embedded systems: a) Programs are much bigger, i.e., they may easily have many megabytes of code, and 100's of megabytes of data, for single processes. (lots already = 16-32MB, at least) Since they have lots of data, they also do many loads and stores. b) The UNIX kernel is real nasty for high-performance machines: it has many sequences that are very hard to pipeline/reorg, it touches little bits of data here and there, and it sometimes has behavior that causes trouble for on-chip assists of otherwise reasoanble size (like branch-target buffers & register windows). The kernel also uses byte and halfword data, (both signed and unsigned), heavily-intermixed, because it uses heavily-packed data structures, which, for various reasons are difficult to change. (Most user programs use halfwords less often). c) For general use, you sometimes have to make worst-case assumptions on the sizes of addresses, for example. You usually must assume that addresses are "large", rather than "small", because you have no idea how big something is until it's linked, and because you may be compiling code for libraries where you can't know how big the final objects will be. Hence, shortcuts often good in dedicated environments (i.e., "short" addresses are OK) don't work. 2. Doing the analysis. How do you get an estimate of performance for a machine you have no actual measurement data for? Here's a possible process: 1) Find one or more machines that: -you have substantial data for (in this case, MIPS R2000) -are related to the machine to be estimated, where related means that it's fairly easy to do direct conversions to the machine to be estimated, and compute the architectural impacts. 2) Feature by feature, do the recalibration, assuming the same software. I.e., start with a base cycle count, and degrade as appropriate, or improve as appropriate. 3) Having done the instruction-cycle analysis, degrade for cache, MMU, and RISC->VAX ratio effects. 3) Finally, make some guesses on the impact on software of the architecture [after all, RISC design is half software.] In general, we'll make best-case assumptions, then come back and do guesses to recalibrate the optimism. 3. Detailed Analysis The following lists features, with assumptions about them (if I'm wrong about the assumptions, I'm happy to be corrected, and if somebody else wants to use the data here to redo an analysis, that's fine. I'm just using what's published, plus the additional comments). This analysis looks at the raw cycle counts of the RPM40 as compared with a MIPS R2000, (as the latter is the only one that I have detailed enough data for; I might have used Stanford MIPS, as it is somewhat closer, but I don't have the data). I used a set of substantial user programs (as, ccom, doducd, espresso, gnuchess, hspice, nroff, optimizer, code generator, timberwolf), written in C, PASCAL, or FORTRAN. I don't have kernel numbers handy, but I'll occasionally note how the kernel differs. Note that we first analyze the instruction cycle counts, without regard to MMU and cache-miss behavior. The "impact" number is the cycle-count degradation expected, i.e., the "expansion" in number of cycles (over 1) due to the feature, as compared with a "mythical" 40MHZ R2000. Items marked * have a little more (but educated) guesswork than the others, which are computed with high confidence from extensive data. Items marked "+" are relevant to the 16-vs-32 issue. A1. .336 <= 3 cycles of load latency R2000 data: 21% of instruction cycles are loads (16-29%) 68% of the (single) load-delay slots are filled (48-88%) (The usual numbers that I remember, but can't find the reference (and it wasn't bcase's posting), are that the expected fill rates for load-delays are: 1: 70% (we got 68% on this bunch, and it contained one nasty program that only filled 48%). 2: 30% 3: 10% A1A: assuming the RPM has no more loads than an R2000 A1B: assuming that similar-quality optimizers and reorganizers are used cycle-expansion = .21 * ( (1-.3) + (1-.1)) = .336 (remember, we already counted the R2000's load-nops: the RPM40 would have equivalent stalls, in the same places. These are the extras. (kernel is worse on this) A2.+ .279 <= Loads/stores use 4-bit immediates R2000 data: 30.1% of instructions are loads+stores (22-38%) 7.1% of the loads/stores could use a 4-bit offset (3.2-18%) R2000 load/stores have 16-bit offsets, which require the RPM40 2 cycles to obtain: cycle-expansion = .301 * (1-.071) = .279 Note: one might argue that there are clever code generation methods to reduce the number of places that addresses need to be calculated, and there are. The MIPS compilers already do most of them, such as computing addresses, or parts of addresses, and saving them in registers, moving them outside of loops, etc. The extra bits for larger offsets show up as LUIs (later). A3.+ .035 <= Add/sub immediates [.050] R2000 data: 8.4% (5.8-12.6%) are add/subtract immediates 3.5-5% of the total instructions are add/sub immediates whose constants don't fit into 4 bits. The 3.5% assumes there are both add/sub immediate (i.e., 4 bits + implied sign). If there is really just add-immed, and you get 3 bits + sign, use the number 5% cycle-expansion = .035 * 1 = .035 A4.+ .013 <= Compare immediates (COND) R2000 data: About 1.7% [.2 - 4.8%] are compare-immediates or equivalent About 1.3% (of the total instructions) need >4 bits. About 1.3% fit in 16-bits, but not 4-bits, and thus require 1 PFX: cycle-expansion = .013 * 1 = .013 A5.+ .013 <= Logical immediates R2000 data: about 1.7% [.4-5.7%] are logical-immediates About 1.3% fit in 16-bits, but not 4-bits, and thus require 1 PFX: cycle-expansion = 0.013 * 1 = 0.013 Note: kernel or other systems-type programs have higher rates of this. A6.+ .011 <= Load-immediate This is used to stick constants in registers for arguments, compares, etc. (it's actually an add to zero, or something like that). I assume the RPM-40 has an equivalent. R2000 data: about 2.1% (.5-4.7%) are load-immediates About 1.1% (of total) fit in 16-bits, but not 4-bits, and thus require 1 PFX: cycle-expansion = 0.011 * 1 = 0.011 A7.+ .018 <= Load-upper-immediate This puts zeroes in low bits, and 16 bits in the top of a register. It's often used for setting up 32-bit address, for example, or a long constant. R2000 data: 1.9% (.5-6.9%) are LUIs About 1.8% (of the total) need the top 4 bits, and thus would need an 2 PFXs, rather than 1 (or 4, rather than 3): cycle-expansion = .018 * 1 = .018 A8. .024 <= Jump-and-link JAL is done in the RPM-40 by MOV (PC) somplace; BRA... R2000 data: 1.2% of the instruction cycles (.3->2.0%) are calls (JAL, JALR), which give 28-bits of byte addressing. On the RPM-40, a JAL is done by a 2-instruction pair, which I assume is a MOV of the PC to the return-address register, followed by a branch, or something equivalent, although I haven't seen the exact sequence. I assume that the return jump is a single cycle. Assuming programs of the size above, the natural code to generate would be: MOV (to save PC) PFX BRA i.e., because 12-bits of displacement are enough for local branches, (99+% on R2000), but not for globals. This adds 2 cycles/call. Assuming that branches get an extra bit of addressabiltiy (halfword alignment), the RPM gets 13 (BRA) + 12 (PFX) bits, for 32MB of code, which is adequate for all of the cited programs, although not so for others. If by some chance the RPM uses a byte offset, getting only 12+12, (16MB), then it would, in normal use in general environment, probably take a hit of another 1.2% for a 2nd PFX, since there are already programs bigger than that floating around. cycle-expansion: .012 * 2 = .024 A9*. .010 <= Load/store hazard The RPM40 is supposed to have a load/store hazard if a store occurs at the exact right number of cycles after a load. This is not data we keep (since we don't have this hazard), and it's hard to compute. Here's a quick guess: R2000 data: about 10% of the instructions are stores, and 20% loads. Assume a 1-cycle hazard and a random distribution of instructions. Then, on the average, a store would find a load in the wrong spot 20% of the time, i.e. cycle-expansion: .1 (fraction of stores) * .2 (fraction of loads) = .02 Let's assume the reorganizer can fix half of these (optimistic, given that we've found loads/stores are the hardest to move around). Thus, we get the cycle penalty as .01. If the hazard is multi-cycle, this goes up, or if the reorganizer has a hard time, it goes up. A10*.+ .098 <= Conditional branch Since the RPM40 has no conditional branch that has an address, the R2000's equivalents would be done with COND/BRA pairs. This requires a little estimation, as it is not something we track directly, i.e., those cases where the COND is able to skip just 1-instruction-equivalent (in the RPM40 sense), without having to change into the COND/BRA pair. R2000 data: 11.4% (6-17%) of instruction cycles are conditional branches. Approximately 30% of branches are backwards branches, which of course cannot be done with just COND. To get a quick estimate of the rest, I disassembled a few programs, and eyeballed the forward conditional branches, and assumed that conditional forward branches that jumped no more than 8 instructions might end up being CONDs that just skipped. A few quick looks convinced me this was less than 10%. However, I'll be conservative and allow 20%, i.e., .8 (of the 70% forward branches) cost a cycle. cycle expansion: .114 * (.3 + .7 * .8) * 1 = .098 A11*.+ .010 <= Partial-word load/store The RPM40 does load/store of partial-word by setting a register (SR2) with a field that gives the type of access, then doing the load/store with "non-word" type. R2000 data: 6% of instruction cycles [0-19% !!-varies heavily] are partial-word load/stores. Systems-type code is usually heavier, numerical code lighter on these. Assuming that it costs just 1 cycle to set SR2 (and it might be 2, if it requires an OR/XOR pair or equivalent), we can guess how often this has to happen: Simple code generator: every time: 6% hit. Smarter code generator: track within basic block Global optimizer: track everywhere in a function. Under the best case, a function must treat SR2 as a register to be saved/restored if modified (they may well already do this), or else, it can be considered "dead" across function calls, and reset appropriately. Systems code, especially with lots of small functions, will feel a hit from this more than numeric code. I'll guess that there is a cycle hit (either from doing unnecessary save/restore, or having to reset this 1/6 times): cycle expansion: .01 * 1 = .01 Note: this would definitely be worse in the kernel. A12*. .033 <= Misses due to branch-target-cache misse According to the talk, there was supposed to be a hit-rate >90%. I have no data on the kinds of programs used to calibrate that. Big programs are clearly worse than little programs in this regard; for example, a call thru the UNIX kernel (like read) misses most of the time, because it often goes 10 levels deep (10 function entries + 10 return-sites = 20, leaving only 12 others for all other branches). R2000 data: over a set of similar benchmarks (including most of those above), 11.4% of the instructions are non-sequential, i.e., taken conditional branches plus all unconditional branches. Assuming a 3-cycle penalty for a miss (10% of the time), this gives: cycle expansion = .114 * (.1 * 3) = .033 A13*. .050 <= Lack of ALU forwarding From all of the discussion, I can't tell how much bypassing the RPM40 does, or doesn't do. From the various hints, it sounds like: a) you can store the output of an ALU op with no delay b) you can't otherwise use the result of an immediately-preceding ALU op. It would be nice if people said what is really going on, but given the sprited defense of non-bypassing, this might be a reasonable guess. This is another one we don't keep detailed statistics on (since we bypass everything). I disassembled several programs, picked randomly- chosen sequences of code (typically about 200 instructions), and eyeballed them to see what % of them would take 1-cycle hits if there were no bypassing, and otherwise following what would be the RPM40 rules, i.e., using PFX instead of R2000's immediates, etc. I eliminated cases where it looked like a reorganizer could get rid of them (not many, this is already heavily reorged code, and since many programs have only 5-10 instructions per basic block, and it is difficult to rearrange this stuff thru basic blocks, it is fairly easy to make a good guess on local information). Of course the numbers are static, not dynamic. On the other hand, they're probably a reasonable guess (and note: this is not the bypassing hit an R2000 would take, it's the hit it would take if its code were transformed into an RPM40's): The 3 samples I took gave 2.5, 5.6, and 7% hits from this, mostly because code doesn't easily migrate amongst basic blocks, other than at the boundaries. NOTE: after I did this, I see Earl Killian got the dynamic numbers for the R2000, which are as expected, much higher, given that we have forwarding and use it everywhere. I'll guess 5%, which I think is reasonably conservative: cycle expansion: .05 A14*. .020 <= Multiply-divide I'm not exactly sure how this is implemented on the RPM, although it probably doesn't have a fast multiplier on the CPU. (Maybe it's on the FPU, in which case some of this would go away). R2000 data: across the benchmarks, about 2.7% of the time was spent in multiply/divide interlock cycles [we have a 12/35 cycle multiple/divider], which includes the effects of having some instructions scheduled into the latency of an asynchronous unit. Assume that the interlock cycles are split 50/50 (faster mults, but more mults than divides, and that RPM's mults take about 3X, i.e., +2 factor, you get: cycle expansion: .013 * 2 = .026, which I'll round down to 2. A15*.+ .040 <= 2-address registers, rather than 3-address ones This is another one that took some guessing, and is especially hard to know, given the willingness of our optimizers to keep things in registers rather than refetching them. From a static eyeballing of dis-assembled code, I saw 3-7% of 3-address ALU ops that looked like they needed to be that way, i.e., where a 2-address op would take a 1-cycle hit. I'll guess conservatively, 4%. cycle expansion: .04 A16*. .039 <= Less registers (21 instead of 32) This definitely takes some guessing, as it has been a while since we bounced the compiler's register allocation around. We do know we were getting good use of the extra registers up through 24-28, which is the number we actually left for normal code. (We hear that similar results were gotten at IBM & HP, but have no reference, as the "hearing" was from bar conversations.) Guess: assume having less registers (and not as symmetric) costs 5% more loads/stores, which may well be (the R2000 has just enough registers that it seldom saves/restores any on a call to a leaf). Assuming 20% loads and 10% stores (close), we get: cycle expansion = .05 * (.2 + .1) = .015 (basic instrs) plus (the load-delays as in A1, except now the .3 unfilled 1st delay slots are extra): .05 * (.3 + .7 + .9) = .01 plus the 93% of ld+st that need >4 bits offset: .015 * .93 = .014 Total: .015 + .01 + .014= .039 Note: this is more of an effect for large, complex programs than small ones. A cross-check is that you get the same number if you assume even 1 more register is saved/restored on the average per function call, (PFX+SW, PFX+LW for 4 cycles), with an average of 100 instructions/call (representative). A17*. .020 <= Architectural Reorganization issues Several of the times above related to reorganization:(A1, A9, A13). A number of factors appear to make the RPM40 more difficult to reorganize for: 1) PFX instructions are difficult, if not impossible to move around away from the instructions prefixed (unlike the R2000's style of using a bypassed GP register). 2) The instruction that sets the SR2 to get partial-word operations is hard to move "too far" away from the instruction(s) that need it. 3) The load/store pipeline hazard must be taken care of. 4) If there is no forwarding, that has to be reorganized also. Reorganization is very important for many RISC processors. The RPM40 has a some extra things to worry about, and one less (R2000 branch delay slot). I'd guess the overall hit to be 2%. (Note: this is the ripple effect of all of this together, not of the individual pieces, which we've already counted.) A18*. ?? <= Coprocessor issues I haven't really touched on this very much, as we don't know much, except that even a few cycles extra latency getting to a floating-point unit can hurt a lot, except in applications that naturally pipeline very well, or if the FPU has long cycle-count operations in the first place. XPLD without XPST is a little puzzling. A19*. -.067 <= Contraction issue (R2000 branch nops) The R2000 loses 6-7% to unfilled branch delay slots, which the RPM40 does not. (of course, the RPM40 takes hits in other areas of branching, but we've already included them). A20* .010 <= Miscellaneous There are a bunch of integer-related issues that I can only guess at, but observing that there are 4 bits in the opcode field for ALU ops, (not the R2000's 5), I'd guess that not all of the R2000's ops are found in the RPM, although I don't know which ones they might be. Also, if the immediate field encodes 16-bit shifts, that will help, and hurt, if not. Bottom line, given everything I know: A1. .336 <= 3 cycles of load latency @ A2.+ .279 <= Loads/stores use 4-bit immediates @ A3.+ .035 <= Add/sub immediates @ A4.+ .013 <= Compare immediates @ A5.+ .013 <= Logical immediates @ A6.+ .011 <= Load-immediate @ A7.+ .018 <= Load-upper-immediate @ A8.+ .024 <= Jump-and-link @ A9*. .010 <= Load/store hazard A10*.+ .098 <= Conditional branch A11*.+ .010 <= Partial-word load/store @ A12*. .033 <= Misses due to branch-target-cache misse A13*. .050 <= Lack of ALU forwarding A14*. .020 <= Multiply-divide A15*.+ .040 <= 2-address registers, rather than 3-address ones A16*.+ .039 <= Less registers (21 instead of 32, sort of A17*. .020 <= Architectural Reorganization issues A18*. ?? <= Coprocessor issues A19*. -.067 <= Contraction issue (R2000 branch nops) A20* .010 <= Miscellaneous Total 0.992 cycle expansion As can be seen, a few items account for big pieces, but the little ones add up to a lot of cycles if there are enough of them (as Everett Dirksen once said, referring to the US budget, "a billion here, a billion there, sooner or later it adds up to real money.") (or real cycles) Grossly, what this means is that an RPM40, with equivalent compiler technology, takes about (1+.99) (i.e. about 2, well within the accuracy of this analysis!) instruction cycles more than an R2000, RUNNING LARGE PROGRAMS. Some of the analysis may be wrong, but the fundamentals are there. Thus, for cycle counts (ignoring cache-miss & MMU overhead), a 40MHz RPM would act more-or-less like a 20MHz R2000, i.e., it would run twice as many (instruction cycles + delay cycles). Now, a 16.7Hz R2000 [the fastest we've actually shipped], with 64K I + 64K D-caches, acts like a 12-VUP (VAX-mips, versus 11/780 with good compilers, integer+float) machine, running realistic programs. Let us note that these were in production 4Q87, and do require a low-latency memory system (as in the Whitechapel Workstation version, for example). Thus, the ratio is 16.7MHz/12 VUP = 1.39 cycles/VUP, to go from raw cycles (including delays, but not MMU or cache-overhead), to actual measured performance. To estimate the performance of either a 20MHz R2000 or 40MHz RPM40, and assuming that the RPM40's floating-point is as good as the R2000+R2010, and that it handles MMU and cache control as well as the R2000 (which does have these built in), and generally ignoring the scale-up effects of relatively slower main memory for both of them: 20 / 1.39 = 14.4 VUP which is well inside "7-9X a 16.7MHz 68020 or 2-6X a Sun-3/260" estimates given by various of the GE folks. So far, all of this has been architectural, i.e., assuming that the RPM40's software was as close to the R2000's as possible, i.e., what would it be if they had our compilers. It is hard to compare against something you don't know, but I would observe: a) Our compiler suite took years of effort by people with tons of experience, including people on their 2nd or 3rd set of RISC compilers. b) This is not an advertisement, but those who've seen our compiler technology have some respect for it. In addition, the RPM40 will be slightly more difficult to do a good global optimizer for: assymetries in the register set(i.e., the ones above 15) the flag bits for word-type turn into something that needs to be tracked and some of the trickeries we used (like the global-register pointer) are less useful on a machine with less registers and shorter offsets. I don't think anyone who's actually seen our compilers go would complain too much if I knocked off 25% for more vanilla compiler technology [this is well within our experiences looking at different levels of optimization.] This one is arguable, but gets at the "if you had this machine today, how would it really act? comparison. 14.4 VUP (for RPM40) * .75 = 12 VUP Given that architectural tradeoffs are what they are, it's always hard to separate a single feature out, but I trust this has given some ideas on the 16vs32-bit instruction issue, and backs up Craig Hansen's thoughts about cycle-count increases on 16-bitters. 4. Conclusion This analysis is only relevant to running substantial programs of the kinds found on general-purpose machines. The RPM may well do relatively better in more embedded-systems environments where the tradeoffs work better, and there is nothing wrong/irrelevant with those environments; they just aren't workstation environments, and whatever one learns in either one doesn't necessarily translate to the other, BECAUSE THE KINDS OF PROGRAMS YOU'RE RUNNING MAY HAVE DIFFERENT STATISTICS. I apologize for any errors in this analysis, which took a fair amount of time to put together, given the sketchiness of the info, but which, I believe, is accurate to within the correctness of the assumptions, and does reflect a modest amount of experience with such things. Feel free to fix it if it's wrong, and if it matters, but as Mr. Jesup says, this has been beaten to death. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jesup@pawl8.pawl.rpi.edu (Randell E. Jesup) (03/11/88)
In article <1840@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >Randall Jesup says the RPM40 stuff has been beaten to death, and I agree, >but I do have some data that may be useful, which I'd have posted before, >but has taken a while to write up, given that I've been busy. And I thank you for it. Numbers/real comparison stuff is great, "my chip is better than yours" stuff get old fast. I'll keep my comments to a minimum, given the length of the original posting. >Summary: > 40MIPS peak -> 20MIPS-cycles -> 14-15 VUP -> 12 VUP: > (matches 7-9X 68020, or 2-6X Sun-3/260 estimates given by GE folks) >------STOP NOW UNLESS YOU ARE A GLUTTON FOR DETAILED ARCHITECTURAL ANALYSIS--- > a) Programs are much bigger, i.e., they may easily have > many megabytes of code, and 100's of megabytes of data, for > single processes. (lots already = 16-32MB, at least) The average program run isn't anywhere near that big, though. > c) For general use, you sometimes have to make worst-case > assumptions on the sizes of addresses, for example. You usually > must assume that addresses are "large", rather than "small", > because you have no idea how big something is until it's linked, > and because you may be compiling code for libraries where you > can't know how big the final objects will be. Hence, shortcuts > often good in dedicated environments (i.e., "short" addresses are > OK) don't work. Here I have to disagree with you. How about this: The object code is compressed assembler, the linker links (with compressed assembler libraries) and does reorganization/assembly. This allows you to always know the process instruction memory size (though the data side is a bit tougher). >Items marked * have a little more (but educated) guesswork than the others, >which are computed with high confidence from extensive data. >Items marked "+" are relevant to the 16-vs-32 issue. >A1. .336 <= 3 cycles of load latency > R2000 data: 21% of instruction cycles are loads (16-29%) > expected fill rates for load-delays are: > 1: 70% (we got 68% on this bunch, and it contained one nasty > program that only filled 48%). > 2: 30% > 3: 10% Here another affect come into play: two address ALU instructions. When the compiler wants 3-address, it has to generate 2 ALU instructions. This means a slightly higher percentage of ALU ops to load/store ops, and therefor better filling of load delays. That should increase the 30% and 10% a fair amount, but I don't know how much. Also, ALU ops with prefixes will help here to, filling otherwise useless cycles. > A1B: assuming that similar-quality optimizers and reorganizers are used See note above re: linker; otherwise OK. >A2.+ .279 <= Loads/stores use 4-bit immediates > 7.1% of the loads/stores could use a 4-bit offset (3.2-18%) > R2000 load/stores have 16-bit offsets, which require the RPM40 > 2 cycles to obtain: Note: for word load/stores, that 4 bit immediate is shifted left 2, effectively 6 bits in byte addressing. Halfwords (minor) get shift left 1. Also, what about >16 bit offsets? Obviously, the R2000 will do a mov; add immediate 32-bit; load sequence (well, a guess), so it hard to compare here. Luckily, >16 bit offsets are rare (note the shifting comes into play again here.) >A3.+ .035 <= Add/sub immediates [.050] > The 3.5% assumes there are both add/sub immediate (i.e., > 4 bits + implied sign). If there is really just add-immed, > and you get 3 bits + sign, use the number 5% I'm not sure I understand you, for reference the leftmost bit of the immediate is extended. All ALU ops can use immediates. >A4.+ .013 <= Compare immediates (COND) > About 1.3% fit in 16-bits, but not 4-bits, and thus > require 1 PFX: Once again, what about >16 bits? (Not TOO common, but not totally uncommon either. >A6.+ .011 <= Load-immediate > This is used to stick constants in registers for arguments, > compares, etc. (it's actually an add to zero, or something like that). > I assume the RPM-40 has an equivalent. It's MOV with the source immediate. (Remember, no 3-addr ALU ops) > About 1.1% (of total) fit in 16-bits, but not 4-bits, and thus > require 1 PFX: > 16 Bits? >A7.+ .018 <= Load-upper-immediate > This puts zeroes in low bits, and 16 bits in the top of a register. > It's often used for setting up 32-bit address, for example, > or a long constant. > R2000 data: 1.9% (.5-6.9%) are LUIs > About 1.8% (of the total) need the top 4 bits, and thus > would need an 2 PFXs, rather than 1 (or 4, rather than 3): One NEVER needs more than 3. 3 PFX's + 4 bits in the instruction = 40 bits. 2 PFX's + 4 bits = 28. 1 PFX + 4 bits = 16. (this is for ALU ops and loads/stores. For branches/XP stuff, the instruction has more immediate bits available (12). >A8. .024 <= Jump-and-link > JAL is done in the RPM-40 by MOV (PC) somplace; BRA... Close, no cigar. BRA; <fill>; MOV or STW. This helps fill branch delay, which otherwise might be tough (subroutine calls are worse than other types of branches, usually). > R2000 data: 1.2% of the instruction cycles (.3->2.0%) are > calls (JAL, JALR), which give 28-bits of byte addressing. > On the RPM-40, a JAL is done by a 2-instruction pair, which > I assume is a MOV of the PC to the return-address register, > followed by a branch, or something equivalent, although I > haven't seen the exact sequence. I assume that the return > jump is a single cycle. > Assuming programs of the size above, the natural code to > generate would be: > MOV (to save PC) > PFX > BRA > i.e., because 12-bits of displacement are enough for local branches, > (99+% on R2000), but not for globals. This adds 2 cycles/call. For internal calls, it's often enough. Do you have data on call distances? I would suspect even in moderate sized programs it's at least 10%, maybe as high as 25%; and in small utilities 25%+. > Assuming that branches get an extra bit of addressabiltiy (halfword > alignment), the RPM gets 13 (BRA) + 12 (PFX) bits, for 32MB of code, > which is adequate for all of the cited programs, although not so > for others. If by some chance the RPM uses a byte offset, getting Yup, 32 Meg, you got it. Why generate bits that are always 0? > cycle-expansion: .012 * 2 = .024 Net result: .012 * 1 = .012 >A9*. .010 <= Load/store hazard > The RPM40 is supposed to have a load/store hazard if a store occurs > at the exact right number of cycles after a load. This is not data > we keep (since we don't have this hazard), and it's hard to compute. > Here's a quick guess: It's a bit wierder than you suppose, but the number sounds OK. >A12*. .033 <= Misses due to branch-target-cache misse > According to the talk, there was supposed to be a hit-rate >90%. > I have no data on the kinds of programs used to calibrate that. > Big programs are clearly worse than little programs in this regard; Our data coimes from BIG vax application programs, if I remember correctly. (BIG!) With small programs it approaches 99%. I'm assuming this was talked about at ISSCC, since the cache in general was: with a good optimizer/reorganizer, you can override the hardware control of replacement of targets, and take over <some, all> of the targets for whatever addresses you like. What this actually gets you in practice is a good question. MAJOR BUG: Your number left something out: you must subtract whatever the R2000's cache miss times penalty from the RPM40's. I'll assume it all equals out (we MAY win, may not). >A13*. .050 <= Lack of ALU forwarding > From all of the discussion, I can't tell how much bypassing the > RPM40 does, or doesn't do. From the various hints, it sounds like: > a) you can store the output of an ALU op with no delay > b) you can't otherwise use the result of an immediately-preceding > ALU op. Correct. > I'll guess 5%, which I think is reasonably conservative: > cycle expansion: .05 Seems OK. As you said, prefixes help here. >A14*. .020 <= Multiply-divide > I'm not exactly sure how this is implemented on the RPM, > although it probably doesn't have a fast multiplier on the CPU. > (Maybe it's on the FPU, in which case some of this would go away). I won't say how we do it (yet), but 32x32 multiplies are 15 or 16 (can't remember exactly) cycles. Divides take longer but about equivalent to R2000. > R2000 data: across the benchmarks, about 2.7% of the time was > spent in multiply/divide interlock cycles [we have a 12/35 cycle > multiple/divider], which includes the effects of having some > instructions scheduled into the latency of an asynchronous unit. > Assume that the interlock cycles are split 50/50 (faster mults, > but more mults than divides, and that RPM's mults take about 3X, > i.e., +2 factor, you get: > cycle expansion: .013 * 2 = .026, which I'll round down to 2. So this drops to just a little. >A15*.+ .040 <= 2-address registers, rather than 3-address ones Sounds OK, tough to figure numbers on (basic architecture diff). >A16*. .039 <= Less registers (21 instead of 32) > Note: this is more of an effect for large, complex programs than > small ones. A cross-check is that you get the same number if > you assume even 1 more register is saved/restored on the average per > function call, (PFX+SW, PFX+LW for 4 cycles), with an average > of 100 instructions/call (representative). This also seems reasonable. >A17*. .020 <= Architectural Reorganization issues > Several of the times above related to reorganization:(A1, A9, A13). > A number of factors appear to make the RPM40 more difficult to > reorganize for: > 1) PFX instructions are difficult, if not impossible to move > around away from the instructions prefixed (unlike the R2000's > style of using a bypassed GP register). True, though they do help a bit in filling load delays. > 2) The instruction that sets the SR2 to get partial-word operations > is hard to move "too far" away from the instruction(s) that need it. Also true, luckily most subroutines only use one of {signed,unsigned} {halfword,byte}, if they use any at all. Also, global optimizers allow better knowlege of what subroutine calls do to Sr2. > 3) The load/store pipeline hazard must be taken care of. True, but it's easy. In practice it doesn't hurt much. > 4) If there is no forwarding, that has to be reorganized also. > Reorganization is very important for many RISC processors. > The RPM40 has a some extra things to worry about, and one less > (R2000 branch delay slot). I'd guess the overall hit to be 2%. Huh? We have a two-cycle branch delay to fill (did you think we had none???) Experience on RPM40 shows it can be filled fairly well (good place for stores, for example.) >A18*. ?? <= Coprocessor issues > I haven't really touched on this very much, as we don't know much, > except that even a few cycles extra latency getting to a floating-point > unit can hurt a lot, except in applications that naturally pipeline > very well, or if the FPU has long cycle-count operations in the first > place. XPLD without XPST is a little puzzling. Note that load = approx 2x store frequency. Should be obvious (and there are other ways to load/store XP stuff.) Suffice it to say, given the envirionment we designed for, XP stuff was VERY carefully designed. There should be no loss, and probably even a gain, vs R2000. >A19*. -.067 <= Contraction issue (R2000 branch nops) > The R2000 loses 6-7% to unfilled branch delay slots, > which the RPM40 does not. (of course, the RPM40 takes hits in other > areas of branching, but we've already included them). Drop this (though we do get better than average stats for branch fills, I predict, especially on CALL's). >A20* .010 <= Miscellaneous > There are a bunch of integer-related issues that I can only guess > at, but observing that there are 4 bits in the opcode field for > ALU ops, (not the R2000's 5), I'd guess that not all of the R2000's > ops are found in the RPM, although I don't know which ones they > might be. Also, if the immediate field encodes 16-bit shifts, > that will help, and hurt, if not. Actually, we had leftovers we had to figure out what to use for. (one became RADD (I think this was my idea; Dennis, do you remember?)) Shifts/rotates are like all other ALU ops re: immediates. >Bottom line, given everything I know: I'll show results modified re: the above comments. >A1. .336 <= 3 cycles of load latency @ .300 guess >A2.+ .279 <= Loads/stores use 4-bit immediates @ .240 guessed at % that could use a 6 bit immediate (effective) >A3.+ .035 <= Add/sub immediates @ >A4.+ .013 <= Compare immediates @ >A5.+ .013 <= Logical immediates @ >A6.+ .011 <= Load-immediate @ >A7.+ .018 <= Load-upper-immediate @ .001 for a process smaller than 28 bits of instructions, we don't ever need to load the top 4 bits for address constants. For integer constants, if the top 5 are all 1 or 0, we don't need the top 4 (almost always the case). >A8.+ .024 <= Jump-and-link @ .012 see above >A9*. .010 <= Load/store hazard .005 I think we can do better at avoiding the problem, loads tend to cluster near beginnings of block, stores at the end. Not big either way. >A10*.+ .098 <= Conditional branch >A11*.+ .010 <= Partial-word load/store @ >A12*. .033 <= Misses due to branch-target-cache misse .000 mistake - R2000 has misses too. >A13*. .050 <= Lack of ALU forwarding >A14*. .020 <= Multiply-divide .005 See above >A15*.+ .040 <= 2-address registers, rather than 3-address ones >A16*.+ .039 <= Less registers (21 instead of 32, sort of >A17*. .020 <= Architectural Reorganization issues >A18*. ?? <= Coprocessor issues >A19*. -.067 <= Contraction issue (R2000 branch nops) .000 Misconception >A20* .010 <= Miscellaneous .000 I don't really think we lose anything signifigant here > >Total 0.992 cycle expansion Amusing: I came out with (after my mods above): 0.892 Pretty close to what you have, all in all. > Thus, for cycle counts (ignoring cache-miss & MMU overhead), > a 40MHz RPM would act more-or-less like a 20MHz R2000, i.e., > it would run twice as many (instruction cycles + delay cycles). Effectively, I get close to the same (1.9 RPM cycles = 1 R2000 cycle). > 20 / 1.39 = 14.4 VUP > >which is well inside "7-9X a 16.7MHz 68020 or 2-6X a Sun-3/260" estimates >given by various of the GE folks. Wow. I guess we are pretty good guessers (actually, the numbers came from things like this, but against 68000, 1750, etc, etc.) >So far, all of this has been architectural, i.e., assuming that the >RPM40's software was as close to the R2000's as possible, i.e., >what would it be if they had our compilers. It is hard to compare >against something you don't know, but I would observe: [stuff about the good MIPS compilers, estimates that equals 25% loss to rpm40] Well, since "GE is not in the computer business", I doubt we'll ever know for sure. But I'm not interested in what GE/USG does with it, just what the architecture wins or loses. >4. Conclusion >This analysis is only relevant to running substantial programs >of the kinds found on general-purpose machines. The RPM may well >do relatively better in more embedded-systems environments where the >tradeoffs work better, and there is nothing wrong/irrelevant with those >environments; they just aren't workstation environments, and whatever >one learns in either one doesn't necessarily translate to the other, >BECAUSE THE KINDS OF PROGRAMS YOU'RE RUNNING MAY HAVE DIFFERENT STATISTICS. Yup! I suspect that for the instruction mixes we optimized for, we do somewhat better than the 1.9x I arrived at vs R2000 (BIG programs, but different types of BIG programs (hint hint)). Maybe much better (1x). >I apologize for any errors in this analysis, which took a fair >amount of time to put together, given the sketchiness of the info, >but which, I believe, is accurate to within the correctness >of the assumptions, and does reflect a modest amount of experience >with such things. Feel free to fix it if it's wrong, and if it matters, >but as Mr. Jesup says, this has been beaten to death. The errors were minor, and helped to correct a few misconceptions about what the rpm-40 instruction set does (ex: branches and CALL). I thank you for all the work you did on this, this is the type of thing I read comp.arch for. It is pretty definitive, no need to quibble over minor points. As I said in a previous message, we didn't optimize for workstations. We can do them, but it's a toss-up (confirmed by this article) whether we're better in a workstation envirionment than the R2000. Of course, in different cases we do better (even large cases, just different problems/instruction mixes). No suprise. The R2000 does very well at what it was designed for, Unix boxes and workstations, and we read whatever articles you had published when we were designing the Rpm-40. >-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> >UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com >DDD: 408-991-0253 or 408-720-1700, x253 >USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/11/88)
This posting is gonna be short on real info, but... I just had to thank John Mashey for his analysis. It looks very good. I'll go over it in detail and e-mail any comments I have back to him. Nice work, John. WE should have done it. Unfortunately, since GE "is not in the computer business", the broad-based and thorough knowledge of UNIX and compilers of which you are possesed is not directly available to us : We only know what we read in the papers, essentially. Sigh. And yes, I have to be truthful : although I'd like to think that the architecture has SOMETHING to do with RPM40 performance, maybe just a little comes from the VHSIC-compatable 1.25 micron AVLSI process. Well, maybe more than a little :-) -- Dennis O'Connor oconnor%sungod@steinmetz.UUCP ARPA: OCONNORDM@ge-crd.arpa ( I wish I could be civil all the time, like Eugene Miya ) (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
mash@mips.COM (John Mashey) (03/14/88)
In article <9895@steinmetz.steinmetz.UUCP> oconnor%sungod@steinmetz.UUCP writes: >This posting is gonna be short on real info, but... > I just had to thank John Mashey for his analysis. It looks > very good. I'll go over it in detail and e-mail any > comments I have back to him.... thanx for the kind words. I'll be looking forward to the comments. You may want to send me a test message though: I've tried several times to send mail to ....!steinmetz!.., by various routes, and the net demons are still winning. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.COM (John Mashey) (03/16/88)
In article <514@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes: ... > And I thank you for it. Numbers/real comparison stuff is great, >"my chip is better than yours" stuff get old fast. I'll keep my comments >to a minimum, given the length of the original posting. Yes, this is much more useful. Thanx for the clarifications. This posting is mostly comments in response to your comments where I wasn't sufficiently clear on the first posting. >>A1. .336 <= 3 cycles of load latency .. > Here another affect come into play: two address ALU instructions. When >the compiler wants 3-address, it has to generate 2 ALU instructions. This >means a slightly higher percentage of ALU ops to load/store ops, and therefor >better filling of load delays. That should increase the 30% and 10% a fair >amount, but I don't know how much. Also, ALU ops with prefixes will help >here to, filling otherwise useless cycles. Yes, this should help a little. The specific cases are the R2000 ones where an lnop is immediately followed by: a) a 3-register ALU op b) an ALU op that uses an immediate >4 bits. I have no data to figure out how often those are. From looking at code, it is sad, but true, that many lnops are structurally there almost no matter what the architecture&compiler system are (with possible exceptions of the VLIW systems). I.e., they are the things that result from code like: if ((p != NULL) && (p->thing)) p = p->nextlink; which gives code like: lw reg1,p nop beq reg1,0,1f nop lw reg2,thingoffset(p1) nop beq reg2,0,1f nop lw reg1,nextlink(reg1) nop 1: maybe you get to fill the branch delays by grabbing the instruction at 1: in, (in fact, we often find a LUI or LI there and move it), but after looking at a lot of code, I can't for the life of me figure out how you frequently get to fill most of the load-nops, in an R2000, an RPM-40, or most other RISCs I've seen...UNIX kernel code is just filled with this kind of thing. >>A2.+ .279 <= Loads/stores use 4-bit immediates > Note: for word load/stores, that 4 bit immediate is shifted left 2, >effectively 6 bits in byte addressing. Halfwords (minor) get shift left 1. >Also, what about >16 bit offsets? Obviously, the R2000 will do a mov; add >immediate 32-bit; load sequence (well, a guess), so it hard to compare here. >Luckily, >16 bit offsets are rare (note the shifting comes into play again >here.) (Actually, what we do is: lui reg1,address<<16; lw reg2,address>>16(reg1). There was an article in ASPLOS this summer that talked about addressing). I already counted this in the LUI analysis that was later, i.e., I had no way to disentangle why LUI's were there, and only have the data on an instruction-by-instruction basis for the offsets. The extra 2 bits for words definitely help (especially in non-stack, non-GP references), so I'd buy the .240. >>A3.+ .035 <= Add/sub immediates [.050] >> The 3.5% assumes there are both add/sub immediate (i.e., >> 4 bits + implied sign). If there is really just add-immed, >> and you get 3 bits + sign, use the number 5% > I'm not sure I understand you, for reference the leftmost bit of the >immediate is extended. All ALU ops can use immediates. That helps clarify it: the number should be .050. The .035 included the possibility that for add/subtract, you interpreted the 4-bit immediate as a zero-extended (i.e., positive) number, in order to get 1 more bit of immediate field, since the add/sub immediates overlap if you sign-extend the immediate. The .050 guess assumes you can add [-8..7] and subtract [-8..7], or add/subtract [-8..8] by flipping add/sub as needed. The .035 number assumed that you could get to [-15..15]. > One NEVER needs more than 3. 3 PFX's + 4 bits in the instruction = >40 bits. 2 PFX's + 4 bits = 28. 1 PFX + 4 bits = 16. (this is for ALU ops >and loads/stores.... Oops, I computed it right, but said it wrong: (I was thinking of the 4-bits as a PFX).... >>A8. .024 <= Jump-and-link >> JAL is done in the RPM-40 by MOV (PC) somplace; BRA... > Close, no cigar. BRA; <fill>; MOV or STW. This helps fill branch >delay, which otherwise might be tough (subroutine calls are worse than >other types of branches, usually). (Actually, JAL's fill pretty well with argument-prep instructions.) Thanx for the clarification: am I still wrong to think that there are 3 cycles (of actual work) needed for a large-size branch-and-link? (Ignoring branch-delay slots for the moment). ... > For internal calls, it's often enough. Do you have data on call >distances? I would suspect even in moderate sized programs it's at least >10%, maybe as high as 25%; and in small utilities 25%+. Call distances is something I don't have in the standard reports. ... > MAJOR BUG: > Your number left something out: you must subtract whatever the R2000's >cache miss times penalty from the RPM40's. > I'll assume it all equals out (we MAY win, may not). (I included the generic cache-miss overhead in the factor that converted cycles->VUPS, and assuming that the RPM was using SRAM as a cache, as needed in a general-purpose environment. See the summary for more info. The R2000's single branch-delay slot covers the time to access to I-cache.) ... >> 2) The instruction that sets the SR2 to get partial-word operations >> is hard to move "too far" away from the instruction(s) that need it. > > Also true, luckily most subroutines only use one of {signed,unsigned} >{halfword,byte}, if they use any at all. Also, global optimizers allow >better knowlege of what subroutine calls do to Sr2. Many user programs don't use halfwords. Systems-type programs mix bytes and halfwords a lot. (experience, but no data, unfortunately). >> 4) If there is no forwarding, that has to be reorganized also. >> Reorganization is very important for many RISC processors. >> The RPM40 has a some extra things to worry about, and one less >> (R2000 branch delay slot). I'd guess the overall hit to be 2%. > Huh? We have a two-cycle branch delay to fill (did you think we >had none???) Experience on RPM40 shows it can be filled fairly well (good >place for stores, for example.) (I thought you might have one, but I didn't have any data. What I've heard (from Stanford MIPS-X) is that the 2nd branch-delay slot is fairly hard to fill. We agree that stores often go into the branch delay.) As can be seen, reorganization statistics are nontrivial to estimate! ... > Note that load = approx 2x store frequency. Should be obvious (and >there are other ways to load/store XP stuff.) Suffice it to say, given the >envirionment we designed for, XP stuff was VERY carefully designed. There >should be no loss, and probably even a gain, vs R2000. (Possible: we'll know when we see it, although the R2000/R2010 is a very low latency design, with the coprocessor watching the instruction stream, and doing direct loads/stores at the right times.) >>A20* .010 <= Miscellaneous >> There are a bunch of integer-related issues that I can only guess >> at, but observing that there are 4 bits in the opcode field for >> ALU ops, (not the R2000's 5), I'd guess that not all of the R2000's >> ops are found in the RPM, although I don't know which ones they >> might be. Also, if the immediate field encodes 16-bit shifts, >> that will help, and hurt, if not. > > Actually, we had leftovers we had to figure out what to use for. >(one became RADD (I think this was my idea; Dennis, do you remember?)) > Shifts/rotates are like all other ALU ops re: immediates. What I meant was: can you do left-shift 16 or right-shift 16 in a single cycle, with no PFX (depends on how you interpret the immediate field)? Not that it matters a lot, but it probably goes up to .005 if you can do shifts by 15, but not 16, in one cycle. Rather than wasting space on torturous reasoning on the opcode field, I'll wait until the opcode list can be published. >>A1. .336 <= 3 cycles of load latency @ > .300 guess [OK] >>A2.+ .279 <= Loads/stores use 4-bit immediates @ > .240 guessed at % that could use a 6 bit immediate (effective) [OK] >>A3.+ .050 <= Add/sub immediates @ >>A4.+ .013 <= Compare immediates @ >>A5.+ .013 <= Logical immediates @ >>A6.+ .011 <= Load-immediate @ >>A7.+ .018 <= Load-upper-immediate @ > .001 for a process smaller than 28 bits of instructions, we don't > ever need to load the top 4 bits for address constants. For > integer constants, if the top 5 are all 1 or 0, we don't need the > top 4 (almost always the case). Actually, almost all of these are for data-address, or logical constants/masks, etc. They're almost never for instructions, so I'd stick with the original number: .018. >>A8.+ .024 <= Jump-and-link @ > .012 see above I still didn't quite understand the change here, as I still think there are 3 cycles (PFX;BRA;fill;MOV) where the R2000 would have (JAL;fill), with both architectures having the same sorts of things moving into the fill. There is room for discussion on the fill issue, and on if you could use short branches in practice. Try: .020 >>A9*. .010 <= Load/store hazard > .005 I think we can do better at avoiding the problem, loads tend to > cluster near beginnings of block, stores at the end. Not big > either way. OK, hard to guess. >>A10*.+ .098 <= Conditional branch >>A11*.+ .010 <= Partial-word load/store @ >>A12*. .033 <= Misses due to branch-target-cache misse > .000 mistake - R2000 has misses too. Actually not, or rather: the R2000 has cache misses in its external cache; the RPM has cache misses in its internal cache + misses in the external cache (this was assuming a design where the 64KI+64KD memories were caches, rather than the only memory. The .033 here was the EXTRA penalty for taking internal cache misses (which might well be external cache hits). I subsumed all of the external cache missing into the cycles->VUP conversion, since I didn't have a better way to get at it, and all of the cycle counts so far were independent of external cache designs. (Ask again if this doesn't make sense). As noted, the .033 number depended on the 90% rate, and would go up or down depending on the environment, but the number definitely is not zero, unless I misunderstand how the RPM works. .033 >>A13*. .050 <= Lack of ALU forwarding >>A14*. .020 <= Multiply-divide > .005 See above OK, I can buy this. >>A15*.+ .040 <= 2-address registers, rather than 3-address ones >>A16*.+ .039 <= Less registers (21 instead of 32, sort of >>A17*. .020 <= Architectural Reorganization issues >>A18*. ?? <= Coprocessor issues >>A19*. -.067 <= Contraction issue (R2000 branch nops) > .000 Misconception OK. >>A20* .010 <= Miscellaneous > .000 I don't really think we lose anything signifigant here [OK, if you can do shift left/right 16 in 1 cycle, else .05]. >>Total 0.992 cycle expansion >Amusing: I came out with (after my mods above): > 0.892 >Pretty close to what you have, all in all. Your revisions, plus my revisions to your revisions come out at .965. My guess is that, unless there's some fundamental misunderstanding left, that there is probably 90% confidence that the most programs of the sorts modeled, would lie in the range 0.9 - 1.0 (to 1 sig. digit!). Thanx for the corrections; I think A8 & A12 are the only spots that might need more clarification, if this last wasn't right. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jesup@pawl17.pawl.rpi.edu (Randell E. Jesup) (03/18/88)
In article <1878@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: :In article <514@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes: ... :>>A1. .336 <= 3 cycles of load latency :.. :> Here another affect come into play: two address ALU instructions. When :>the compiler wants 3-address, it has to generate 2 ALU instructions. ... :Yes, this should help a little. The specific cases are the R2000 ones where :an lnop is immediately followed by: : a) a 3-register ALU op : b) an ALU op that uses an immediate >4 bits. I think you mean a load followed by an a) or b) and a nop. :I have no data to figure out how often those are. From looking at code, :it is sad, but true, that many lnops are structurally there almost no :matter what the architecture&compiler system are (with possible exceptions :of the VLIW systems). I.e., they are the things that result from code like: : if ((p != NULL) && (p->thing)) p = p->nextlink; :which gives code like: : lw reg1,p : nop : beq reg1,0,1f : nop : lw reg2,thingoffset(p1) : nop : beq reg2,0,1f : nop : lw reg1,nextlink(reg1) : nop I think you left out: sw p,reg1 :1: Here's how I'd do it on the rpm40 (and I hope the reorganizer would to) ldw .1, p nop nop nop cond neq, .1, .0 ldw .2, thingoffset[.1] cond neq, .1, .0 ldw .1, nextlink[.1] nop cond neq, .2, .0 pfx <for the store, if not needed, move cond down one> stw p, .1 Note that there are NO branches in it. The first load is hard to fill, you have to hope that you can do something for the next statement in those nops (I would bet you could use 1 of them, maybe 2, depends a lot on the code. Or you could move stuff down from above, like a store.) This is a good example how the cond instruction can help avoid branches (and thus branch delays and branch misses.) :>>A2.+ .279 <= Loads/stores use 4-bit immediates : :> Note: for word load/stores, that 4 bit immediate is shifted left 2, : The extra 2 bits for :words definitely help (especially in non-stack, non-GP references), Why non-stack? Stack refs are usually small offsets. I agree about globals, even with a global pointer to reduce the number of PFXs. :>>A8. .024 <= Jump-and-link :> Close, no cigar. BRA; <fill>; MOV or STW. This helps fill branch :Thanx for the clarification: am I still wrong to think that there :are 3 cycles (of actual work) needed for a large-size branch-and-link? :(Ignoring branch-delay slots for the moment). You're correct, if the call is to greater than 2K instructions away from the current address. :The R2000's single branch-delay slot covers the time to access to I-cache.) Does that mean you can get an instruction 1 cycle after you output it's address on the address lines? :(I thought you might have one, but I didn't have any data. What I've heard :(from Stanford MIPS-X) is that the 2nd branch-delay slot is fairly :hard to fill. We agree that stores often go into the branch delay.) One thing about the PFXs in the rpm40 is that often you see this: BRA; PFX; STW. The stanford numbers make a number of assumptions that aren't valid here, though 2 slots are harder to fill than one in general. :>>A20* .010 <= Miscellaneous :What I meant was: can you do left-shift 16 or right-shift 16 in a single :cycle, with no PFX (depends on how you interpret the immediate field)? :Not that it matters a lot, but it probably goes up to .005 if you can do :shifts by 15, but not 16, in one cycle. Ok, I understand now. I think for shifts it acts as an unsigned number if I remember correctly (Dennis?) :>>A7.+ .018 <= Load-upper-immediate @ :> .001 for a process smaller than 28 bits of instructions, we don't :> ever need to load the top 4 bits for address constants. For :> integer constants, if the top 5 are all 1 or 0, we don't need the :> top 4 (almost always the case). : Actually, almost all of these are for data-address, or logical : constants/masks, etc. They're almost never for instructions, so I'd : stick with the original number: .018. Once again, if the process has less than 28 bits of data memory, you don't need it. Few programs require 256 Meg of data. And of course 99+% of constants are less than 28 bits long, if you sign-extend. So I'll stick with the low number. :>>A12*. .033 <= Misses due to branch-target-cache misse :> .000 mistake - R2000 has misses too. : Actually not, or rather: the R2000 has cache misses in its external : cache; the RPM has cache misses in its internal cache + misses in : the external cache (this was assuming a design where the 64KI+64KD : memories were caches, rather than the only memory. The .033 here was : the EXTRA penalty for taking internal cache misses (which might well : be external cache hits). I subsumed all of the external cache missing : into the cycles->VUP conversion, since I didn't have a better way : to get at it, and all of the cycle counts so far were independent : of external cache designs. (Ask again if this doesn't make sense). : As noted, the .033 number depended on the 90% rate, and would go up : or down depending on the environment, but the number definitely is : not zero, unless I misunderstand how the RPM works. : .033 90% is something of lower bound, typical number for large programs should be more like 95%, 99%+ for small ones. I take it the r2000 never misses on a branch if the destination is in it's 64K cache? :Your revisions, plus my revisions to your revisions come out at .965. :My guess is that, unless there's some fundamental misunderstanding left, :that there is probably 90% confidence that the most programs of the sorts :modeled, would lie in the range 0.9 - 1.0 (to 1 sig. digit!). I'll agree with you here, though I'll emphasize (as Dennis and I have said before) that the rpm-40 was optimized for a different range of applications than the r2000, and we're comparing here in the r2000's specialty. I suspect that for embedded systems and their programs, they would be close to equal (on a per-cycle basis), which would mean the 40Mhz RPM-40 would perform as well as a 35+Mhz r2000 (note: total specualtion here!) The RPM-40 can do large systems, but may not be the ultimate choice for such, though it can do well. :-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> :UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com :DDD: 408-991-0253 or 408-720-1700, x253 :USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 Thanks again for such good analysis. I think this has been pretty well covered now; if we continue, lets start discussing specific RISC features and their plus/minuses (abstract, not rpm-40 vs r2000). // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
kers@otter.hple.hp.com (Christopher Dollin) (03/22/88)
"davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen)" said both: |As nearly as I can determine, workstations are used for graphics, |software development, word processing, ... and: | ... generally the programs run are less than 1 min cpu, less than | 2MB memory ... I'm doing software development on a workstation. There are two or three Poplog windows (mail, X-window playthings, project work); each of these is at least 1.5Mb [overkill for mail and X, but there it is]. Compiling the source of the program[*1] (Lisp -> loaded binary) takes about 5min - of course that's only when I want to start afresh - and running the program on a reasonably large piece of data has, on occasion, taken as long as four hours; a rapid bout of optimisation followed! The program can eat VM like C eats sanity. A quick check: popmemused=523244 longwords; that's 2Mb heap in use, not counting the Poplog kernel, the Common Lisp susbsystem, and the loaded code of the program, all of which are locked down and don't count as active heap. Does using LaTeX count as word processing? If so, I have a document here that takes about 5min elapsed (last time I counted) to format. That's probably over 1min CPU - at least it's not far off. I wouldn't say that this was "really large" stuff, but it beats the 1min 2Mb limit. [*1] Sorry to be vague, but you know how it is. Regards, Kers | "Why Lisp if you can talk Poperly?"