mash@mips.UUCP (04/12/87)
In article <16038@amdcad.AMD.COM> bcase@amdcad.AMD.COM (Brian Case) writes: > >If you are saying "Computers must have memory systems which allow arbitrary >accesses, including unaligned and smaller-than-natural" then I dissagree >completely. If you are simply saying that there must be some way to get >at smaller-than-natural and unaligned things, then clearly, almost >tautologically, you are right. While the Am29000's ability to support a >word-oriented external memory system and still allow byte and half-word UX >accessability isn't perfect (things still get tough (probably have to use >the on-chip funnel shifter) when words are spanned), it is a step in the >right direction (and, BTW, not really original: A very similary thing was >proposed (and I suppose implemented) for the original Stanford MIPS >processor). > >Ok, I think maybe it *is* time to start the discussion of word-addressed >vs. byte-addressed memory systems. John Mashey, care to start us off? > Well, I've been staying away from this, but since you asked, OK. But Brian, I'm sad to say you may not be happy when you're finished reading this. First, (Geoff Steckel) <466@alliant.UUCP> posted a pretty good overall analysis of the issues, so I won't repeat that, except: "Re bus width, byte extraction, unaligned operands, and memory speed: 1) Byte extraction from words should be free in time; it'll cost a few gates. Basically this requires one or more cross-couplings in the memory path. Yup, number 1 turns out to be true: MIPS R2000s pay no noticable cycle-time penalty for having load-byte, load-byte unsigned, load-half (signed and unsigned), and even load-partial-word (left and right) for dealing with unaligned operands). It take some silicon, but it didn't add to the critical path. [Don't ask me how this is done, but I assure you it is possible.] Now, for some history: Brian earlier cited the 1982 paper "Hardware/Software Papers for Increased Performance" by Hennessy, et al, which argued fairly heavily for word-addressed memory with byte extract/insert. Now, there are the following facts: a) Although Stanford MIPS did this, MIPSco didn't, even though it certainly had some people in common. There were specific reasons that we did this, including some observations I'd made while tuning C compilers for 68Ks. Also, the Stanford MIPS crew was mostly VLSI & compiler folks; the MIPSco crew included more board designers and operating systems people.... b) Another offshoot of all of this was word-addressed DEC Titans [again, with some Stanford-related personnel]. Most of them have been converted since to byte-addressed ones... c) There is a "final MIPS evaluation report", which has been submitted to TOCS, I think. Briefly, here's what it concludes on this topic: 1. Memory should be byte-addressed. Word-addressing [as opposed to word-paths to memory] is just too painful for software. 2. Newer statistics seem to say that in the choice of word+extract/insert operations versus byte/half operations, the (previous) belief that the word+extract/insert choice is superior is probably wrong, especially because current compiler technology just isn't able to take advantage of those instructions well enough to make up the difference. I.e., they have changed their minds, at least somewhat. The remainder of this will deal with some structural reasons why word addressing with extract/insert is painful in certain environments, followed by a bunch of statistical analyses that describe the performance loss MIPS machines would suffer if we did it that way. A. Structural reasons: (this is mostly systems, maybe not some controllers) 1) Memory system design: a) Memory system designs with dual-porting of memory, including an I/O bus, usually have to respond to partial-word operations to keep I/O controllers happy. Having done that, it is perfectly feasible to deal with byte writes, either cheaply, with parity, or somewhat more expensively with ECC. b) Some systems use block-oriented buses, often with write-back caches. If the system is doing write-back for you, doing load-word [causing WB-cache to fetch the cache line, if needed] insert byte store-word VERSUS JUST: store-byte [causing cache line to be fetched, if necessary] sure looks like there is at least a 2-cycle hit, maybe a 3-cycle hit, if you don't have 1-cycle cache accesses. c) Some systems use write-thru caches with write-buffers [VAX-780, I assume 8700s, etc, although not 8600/8650]. Sometimes the write-buffers gather contiguous bytes, then send a whole word to memory. Again, having code that does lw/insert/sw just adds cycles. 2) I/O system design. This is clearly not true of all systems, but you run into it often enough. IT IS OFTEN EXCEEDINGLY INCONVENIENT TO BE REQUIRED TO FETCH OR STORE A WORD OF A TIME WHEN DEALING WITH DEVICE CONTROLLERS. Device controllers and interface chips often have all sorts of weird layouts of their control registers [read Lyon & Skudlarek's wonderful "All the Chips That Fit" article, in The Feb 86 UNIX Refview, or earlier USENIX proceedings, for example]. Unfortunately, such things do NOT have the semantics of normal memory: sometimes you cannot do extra fetches or stores without screwing things up. Here's a simple example: struct ... { ushort silo; ushort status; .... The silo give you a new character each time it's read. Now, if the natural C compiler behavior is, when you write: if (status & FLAG) is to fetch the word that contains both silo and status, you are in trouble. Sometimes there are ways to code around this junk, and sometimes not, depending on the device. In any case, devices are already awful enough to deal with, without making one drop into assembler, rather than C. Remember, the layouts of these are whatever they are, and if you want to use what may be otherwise very desirable controllers, this is what you get. Is this a weird example? I found several in 5 minutes. Other OS folks might comment on whether we just happened to pick controllers that are like this, or not. In any case, people want to describe register layouts in C structures, and be done with it. As I read the 29000 specs, maybe it would be possible to use both modes, where main memory uses word+insert/extract, and the I/O path has the alignment network, and uses the load/store control fields to yield partial-word ops. It will be interesting to see how the C compiler compiles a device driver that uses both memory and I/O addresses... There's probably some way around it, but I do belive that it's more than picking up an off-the-shelf controller and it's associated driver, making a few tweaks, and running it. B. Performance reasons. Domain: running UNIX and UNIX programs well. 1. Some qualitative observations: When I was at CT, I spent a bunch of time tuning 68K C compilers. In particular, I looked at the prevalance of code like: move byte to register, extend, extend move byte to register, and with 255 to get byte alone OR clear register, move byte to register I was able to get noticable improvements in at least some programs by optimizing away some of the unnecesary cases. IT was sure clear that a lot of cycles were burnt by the extends, or the and/clear, i.e., one really wished for load byte [signed or unsigned]. If simulations are based only on user-level programs, you can get some horrible surpises when you see what UNIX kernels do. For example, are halfword operations really necessary? ANS: not if you look at their frequency in most UNIX C programs. ANS: if you look at kernel: you bet! many kernel structures are packed for efficiency, some are packed for necessity (you should see the pile of halfword operations in Ethernet code... and you CANNOT sanely get rid of them without rewriting everything). 2. Some quantitative observations. As most people in this newsgroup know, we do a huge amount of simulation on very large programs to analyze performance, look at different board designs and future chip tradeoffs. We get complete instruction traces, so we get outputs that look like: Summary: 15179647 cycles (1.9s @ 8MHz) 14912988 instructions 2950828 basic blocks 176361 calls 2514780 loads 1521285 stores 900335 byte references 266659 multiply/divide interlock cycles 0 flops (0 mflop/s @ 8MHz) 0 floating point interlocks 0 1 cycle interlocks that become 2 cycle stalls 0 overlapped floating point cycles 0.17 loads per instruction 0.1 stores per instruction 0.38 stores per memory reference 0.22 byte references per reference 0.15 branches per instruction 0.2 backward branches per branch 5.1 instructions per basic block 85 instructions per call 0.12 nops per instruction spec 5529383 37% lw 1783177 12% sw 1299709 8.7% addu 1170190 7.8% ...... (plus a lot of stats) Thus, we have really precise statistics on what's going on, at least on our machines, at the user-level, for anything form typical UNIX programs (like nroff), to large simulators [spice, espresso], parts of the compiler system [assembler, optimizer, debugger], to benchmarks like whetstone, dhrystone, linpack. I think one can find a gross cost [to us, in our architecture, no necessary applicability to others] in user programs, as follows, if we had done byte extract/insert, instead of what we did: For each partial-word load, add 1 cycle. (for the extract) For each partial-word store, add 2+N cycles (where you have a load, insert added, and where N (might be 0) is the extra actual cycle cost to get data from the cache, noting that some of the cost might be taken care of by pipeline scheduling. So here a re a few example: I'll give the % of instruction cycles for each instruction, and compute the penalty by using N=0. I'll ignore numbers too small to matter much. as1 (assembler 1st pass) opcode % penalty (dynamic) lbu 4.6% 4.6% sb 1.5% 3.0% lh 0.27% 0.27% lhu 0.07% 0.07% sh 0.02% 0.02% TOT 8% penalty in instruction cycles, asssuming N=0 (best case) There is also a static code-size penalty, I'll only do one since I don't think this is a major issue, but it is interesting; opcode % penalty (static) lbu 4.7% 4.7% sb 3.2% 6.4% lh 0.27% 0.27% sh 0.14% 0.28% TOT 11.6% Note the significance of the static numbers: the byte operations are all over the place, i.e., the dynamic counts aren't substantial just because they're in strcpy or something like that [actually we have tuned routines anyway], but because there's partial word code all over the place. Now, this is an ultra-simplistic analysis, because there are things like: write-buffer effects, cache effects, memory system interference, pipeline scheduling, etc, etc. Consider this a first approximation. Now, a few more examples: Dhrystone: lbu 6.9% 6.9% lb 4.7% 4.7% lwl 1.2% 1.2% (unaligned word stuff) lwr 1.2% 1.2% sb 0.43% 0.8% swl 0.14% 0.3% TOT 14.1% static: lbu 1.6% 1.6% lb 1.1% 1.1% sb 1.1% 2.2% lh 0.54% 1.1% lwl 0.27% 0.5% lwr 0.27% 0.5% Note what's going on: Dhrystone does a lot of character manipulation (dynamically), but statically, it has very few partial word instructions, i.,e., unlike more normal programs, it does NOT have character-pushing sprinkled around. Observations on a few other benchmarks: most of the floating -point ones didn't do much partial-word computing, although I ran into one that had 9.9% (!) lh's, plus 2.2% sh's., i.e., for at least a 15% penalty. Most of those that did any character-pushing at all would take a 5-10% hit. [ccom, compress, terse (a simulator)] nroff was only about a 2% hit, surprisingly. In general, I'd expect the UNIX kernel to have a reasonably high (both static and dynamic) use of partial-word ops. We'll see soon. (This has nothing do to do with word-vs-byte, but I ran across it in looking at these numbers). QUIZ: how many load/stores use 0-displacements off the base register, rather than non-zero ones? ANS: a few were around 50%. most were in the 10-20% range. some were down in the 5-10% range. Dhrystone: 50% I.e., Dhrystone uses zero-offset addressing considerably more than most programs, although not more than all programs. [Relevant to 29000 discussion, if you remember how they did things.] WHEW. That was a lot of info. Sorry about that, but architectural arguments cannot be settled by intuition. Note again that these are the numbers we get, and you cannot analyze choices in a vacuum, so they may or may not be relevant to other architectures and software. In our case, this does say: a) Byte instructions are a substantial win on many real programs. b) Non-zero offsets are frequently-used. and finally, for everybody: c) Be very, very careful on WHICH benchmarks you use to tune your architecture. DON'T use Dhrystone. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
howard@cpocd2.UUCP (04/15/87)
In article <279@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: >First, (Geoff Steckel) <466@alliant.UUCP> posted a pretty good overall >analysis of the issues, so I won't repeat that, except: >"Re bus width, byte extraction, unaligned operands, and memory speed: > 1) Byte extraction from words should be free in time; it'll cost a few gates. > Basically this requires one or more cross-couplings in the memory path. > >Yup, number 1 turns out to be true: MIPS R2000s pay no noticable cycle-time >penalty for having load-byte, load-byte unsigned, load-half (signed and >unsigned), and even load-partial-word (left and right) for dealing with >unaligned operands). It take some silicon, but it didn't add to the >critical path. I don't wish to dispute John's excellent (and voluminous) analysis, but it is worth mentioning that Geoff seems to be suffering from the normal logic designer's misconception that you measure complexity/cost in gates. On ICs, gates are extremely cheap and small; it is wires (communication over distance) that are expensive. Byte extraction can frequently be performed with the same circuitry that does shifts, but fast shifters are area-expensive (expansive?) in 2 dimensions. You don't want to use them without serious consideration. Circular shifts are particularly expensive (they can double shifter size), but fortunately neither byte extraction nor C require them. -- Copyright (c) 1987 by Howard A. Landman. You may copy this material for any non-commercial purpose, or transmit this material to others and charge for such transmission, as long as this notice is retained and you place no additional restrictions on retransmission of the material by the recipients.
hansen@mips.UUCP (Craig Hansen) (04/17/87)
In article <576@cpocd2.UUCP>, howard@cpocd2.UUCP (Howard A. Landman) writes: > >First, (Geoff Steckel) <466@alliant.UUCP> posted a pretty good overall > >analysis of the issues, so I won't repeat that, except: > >"Re bus width, byte extraction, unaligned operands, and memory speed: > > 1) Byte extraction from words should be free in time; it'll cost a few gates. > > Basically this requires one or more cross-couplings in the memory path. > I don't wish to dispute John's excellent (and voluminous) analysis, but it > is worth mentioning that Geoff seems to be suffering from the normal logic > designer's misconception that you measure complexity/cost in gates. On ICs, > gates are extremely cheap and small; it is wires (communication over distance) > that are expensive. Byte extraction can frequently be performed with the same > circuitry that does shifts, but fast shifters are area-expensive (expansive?) > in 2 dimensions. You don't want to use them without serious consideration. As it turns out, the load aligner didn't pay an area penalty for the horizontal shift wires, as the two relevant busses cross each other, for reasons unrelated to alignment, right next to the load aligner. The store alignment is accomplished by the shift unit already, so it didn't cost extra area either. (It did cost a few gates, but it was well worth it.) Howard's statement is true in general: wires are expensive, and gates are relatively cheap, but clever layout makes the best possible use of expensive resources. I really enjoy trying to be wordy in my responses to articles posted. -- Craig Hansen Manager, Architecture Development MIPS Computer Systems, Inc. ...decwrl!mips!hansen