bcase@amdcad.UUCP (04/13/87)
In article <279@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: >In article <16038@amdcad.AMD.COM> bcase@amdcad.AMD.COM (Brian Case) writes: >Well, I've been staying away from this, but since you asked, OK. >But Brian, I'm sad to say you may not be happy when you're finished >reading this. Well, I guess am not overjoyed or anything, but if you meant "I think you will regret designing the Am29000 with the emphasis on word-addressing" you are wrong. If you meant "You'll be sorry you asked", never! Let's have some lively discussions! I'll try make my points with as few words as possible. >First, (Geoff Steckel) <466@alliant.UUCP> posted a pretty good overall >analysis of the issues, so I won't repeat that, except: >"Re bus width, byte extraction, unaligned operands, and memory speed: > 1) Byte extraction from words should be free in time; it'll cost a few gates. > Basically this requires one or more cross-couplings in the memory path. > >Yup, number 1 turns out to be true: MIPS R2000s pay no noticable cycle-time >penalty for having load-byte, load-byte unsigned, load-half (signed and >unsigned), and even load-partial-word (left and right) for dealing with >unaligned operands). It take some silicon, but it didn't add to the >critical path. [Don't ask me how this is done, but I assure you it is >possible.] Sigh, byte extraction *does* take only a few gates. Silicon area was never the issue. Depending, I guess greatly, on implementation details, it is *not* free in time. I'll try to get our circuit designer(s) to post a comment on this one. All I can say is that we designed the Am29000 to *start* at 25 Mhz and go from there. There is a critical path in the chip from the pins to the execute unit. Yes, I know this circuitry will scale as well as any on the chip, but the set-up and hold time effects may not. >Now, for some history: Brian earlier cited the 1982 paper "Hardware/Software >Papers for Increased Performance" by Hennessy, et al, which argued >fairly heavily for word-addressed memory with byte extract/insert. >Now, there are the following facts: > I.e., they have changed their minds, at least somewhat. I don't know how to reconcile the facts that some of the best, most important computer scientists think word addressing is wrong with the fact that it seems so right for us. All I can say is that Titan and MIPS machines have the advantage of being designed as a "closed" system; i.e. nearly all (system) details are controllable. >The remainder of this will deal with some structural reasons why word >addressing with extract/insert is painful in certain environments, >followed by a bunch of statistical analyses that describe the performance >loss MIPS machines would suffer if we did it that way. > >A. Structural reasons: (this is mostly systems, maybe not some controllers) >1) Memory system design: > a) Memory system designs with dual-porting of memory, including > an I/O bus, usually have to respond to partial-word operations > to keep I/O controllers happy. Having done that, it is perfectly > feasible to deal with byte writes, either cheaply, with parity, > or somewhat more expensively with ECC. I/O, especially where older chips (serial ports, etc.) are concerned, is a grungy issue. But I would most *certainly* not slow the memory system with respect to the processor just for the sake of very infrequent interchanges with I/O chips. Notice: an on-chip alignment network for byte extraction/insertion (the only alignment network implementation that makes sense in the majority of instances and the one that we are debating here, I am assuming) does *not* solve this problem (it cannot do the alignment needed by the I/O devices when they deal directly with memory), the bulk of I/O to memory transactions are block transfers from disk and tape devices. Why can't these be word-oriented? For the cheapest systems, it might be nice to hang the old serial port chip right on the processor bus, but I don't think you want to buy an Am29000 (or a MIPS chip) just so you can slow the thing down with stupid system design. Don't get me wrong, I am not flaming: just trying to point out that I/O is something to be dealt with separately from the processor-memory channel design. Dual-ported memory is *not* the only way: how about a DMA chip to do all the alignment/bus-isolation? > b) Some systems use block-oriented buses, often with write-back > caches. If the system is doing write-back for you, doing > load-word [causing WB-cache to fetch the cache line, if needed] > insert byte > store-word > VERSUS JUST: > store-byte [causing cache line to be fetched, if necessary] > sure looks like there is at least a 2-cycle hit, maybe a 3-cycle > hit, if you don't have 1-cycle cache accesses. I think you have a good point here. Caches are nice in that they often don't have ECC so byte writing is much more feasible. However, this is only one possible memory system design. The Am29000 will be interfaced to many different kinds of memory systems. At 30 MHz and beyond (where the Am29000 is intended to be), word-addressing is thought by us to be beneficial in many of these environments. > c) Some systems use write-thru caches with write-buffers > [VAX-780, I assume 8700s, etc, although not 8600/8650]. > Sometimes the write-buffers gather contiguous bytes, then send > a whole word to memory. Again, having code that does lw/insert/sw > just adds cycles. Another good point. Same comments in general. >2) I/O system design. This is clearly not true of all systems, but >you run into it often enough. IT IS OFTEN EXCEEDINGLY INCONVENIENT >TO BE REQUIRED TO FETCH OR STORE A WORD OF A TIME WHEN DEALING WITH >DEVICE CONTROLLERS. [other stuff] Agreed, but again, why not solve the problem (with an interface chip or design approach) instead of propagating it to the processor-memory channel? I have sympathy for OS people: I was an OS person for just a short while. The choice between dumb system design and creating problems for the OS people when they must deal with older chips/boards is a tough one (really, because: the OS is part of the system design too). >As I read the 29000 specs, maybe it would be possible to use both modes, >where main memory uses word+insert/extract, and the I/O path has the >alignment network, and uses the load/store control fields to yield >partial-word ops. It will be interesting to see how the C compiler >compiles a device driver that uses both memory and I/O addresses... >There's probably some way around it, but I do belive that it's more than >picking up an off-the-shelf controller and it's associated driver, >making a few tweaks, and running it. Well, we don't have any plans right now for a compiler which would allow "mixed-mode" memory orientation. More likely, some (a significant amount? Just a little? In between? I don't know...) assembly language programming will have to be done. Perhaps the OS guys will start forcing the hardware guys to design-in only coherently-designed peripheral chips (if they exist) or forcing them to design hardware to hide (is this possible? in some cases it is) the problems. >B. Performance reasons. >Domain: running UNIX and UNIX programs well. > >1. Some qualitative observations: > >When I was at CT, I spent a bunch of time tuning 68K C compilers. >In particular, I looked at the prevalance of code like: > move byte to register, extend, extend > move byte to register, and with 255 to get byte alone OR > clear register, move byte to register >I was able to get noticable improvements in at least some programs >by optimizing away some of the unnecesary cases. IT was sure clear that >a lot of cycles were burnt by the extends, or the and/clear, i.e., >one really wished for load byte [signed or unsigned]. Sigh, please don't tell me about how a *vastly* different processor with *vastly* different time/instruction tradeoffs behaved. I believe every word with respect to the 68K, and it would be naive of me to say that there is *nothing* valuable to be learned from your experience in that experiment. But to say that the results of that experiment have binding implications for a processor like the Am29000 (and I am tempted to say the MIPS, but I am certainly not qualified to do so) seems just wrong to me. >If simulations are based only on user-level programs, you can get >some horrible surpises when you see what UNIX kernels do. For example, >are halfword operations really necessary? >ANS: not if you look at their frequency in most UNIX C programs. >ANS: if you look at kernel: you bet! many kernel structures are packed >for efficiency, some are packed for necessity (you should see the pile >of halfword operations in Ethernet code... and you CANNOT sanely get >rid of them without rewriting everything). I am sure that you are right; I really can't speak too well from experience. The fact that we were simply inequipped to do kernel-level simulations was one of our biggest weaknesses. But again, even if in light of the fact that the kernel does lots of sub-word size stuff, does this really mean that the Am29000 should assume a byte-oriented/half-word oriented memory? >2. Some quantitative observations. As most people in this newsgroup >know, we do a huge amount of simulation on very large programs >to analyze performance, look at different board designs and future >chip tradeoffs. We get complete instruction traces, so we get outputs >that look like: >Summary: Wow, our simulation output looks much the same, with some of the numbers being represented differently. Great minds think alike. :-) >Thus, we have really precise statistics on what's going on, at least on >our machines, at the user-level, for anything form typical UNIX programs >(like nroff), to large simulators [spice, espresso], >parts of the compiler system [assembler, optimizer, debugger], >to benchmarks like whetstone, dhrystone, linpack. Sigh, I wish we could do such simulations. >I think one can find a gross cost [to us, in our architecture, no >necessary applicability to others] in user programs, as follows, >if we had done byte extract/insert, instead of what we did: > >For each partial-word load, add 1 cycle. (for the extract) >For each partial-word store, add 2+N cycles (where you have a load, >insert added, and where N (might be 0) is the extra actual cycle cost >to get data from the cache, noting that some of the cost might be >taken care of by pipeline scheduling. This seems valid, at first glance, for your situation. But it is not directly applicable to the Am29000 because there is a *cost* associated with on-chip byte support. Thus, you gain some, you lose some. We see about twice as many loads as stores. Plus, the stack cache decreases the load/store percentage overall with respect to a machine (like the MIPS) with "only" 32 fixed registers. We seem to have about half as many loads/stores, but it varies (and my compiler ain't the best, e.g. no register coloring for memory-resident stuff). This lower load/store percentage might be another reason that word orientation is more appropriate for the Am29000 (but note that a given system need not implement a stack cache in the local register file (register banking for fast context switching may be a better use of the registers); in this case, the load/ store percentage will go back up and bets are off; Sigh, what's a computer architect to do?). >So here a re a few example: I'll give the % of instruction cycles >for each instruction, and compute the penalty by using N=0. >I'll ignore numbers too small to matter much. > >as1 (assembler 1st pass) >opcode % penalty (dynamic) >lbu 4.6% 4.6% >sb 1.5% 3.0% >lh 0.27% 0.27% >lhu 0.07% 0.07% >sh 0.02% 0.02% >TOT 8% penalty in instruction cycles, asssuming N=0 (best case) This is OK assuming that byte/halfword alignment costs nothing. Again, I am just drawing attention to this missing side of the argument. >There is also a static code-size penalty, I'll only do one since I don't >think this is a major issue, but it is interesting; >opcode % penalty (static) >lbu 4.7% 4.7% >sb 3.2% 6.4% >lh 0.27% 0.27% >sh 0.14% 0.28% >TOT 11.6% Unquestionably there is a code size penalty. This may or may not be an issue given ROM/RAM constaints in some environments. >Note the significance of the static numbers: the byte operations are all over >the place, i.e., the dynamic counts aren't substantial just because they're >in strcpy or something like that [actually we have tuned routines anyway], >but because there's partial word code all over the place. You are so right in pointing out that there is partial word code sprinkled throughout many existing applications. As an after-the-fact observation, I guess that many Am29000 applications will be running "new" code. Now, whether or not the coders will know the right things to do (use the fast library routines, etc.) is not knowable but nonetheless critical. I guess that means that we need to print some sort of "Am29000 Programming Suggestions." > >Now, this is an ultra-simplistic analysis, because there are things like: >write-buffer effects, cache effects, memory system interference, >pipeline scheduling, etc, etc. Consider this a first approximation. > >Now, a few more examples: > >Dhrystone: >lbu 6.9% 6.9% >lb 4.7% 4.7% >lwl 1.2% 1.2% (unaligned word stuff) >lwr 1.2% 1.2% >sb 0.43% 0.8% >swl 0.14% 0.3% >TOT 14.1% But, just a few lines later you'll point out how having a word-oriented processor-memory channel *helps* (artifically since dhrystone is artrificial) dhrystone performance. I'm sorry, but you must to stick to one argument. :-) >(This has nothing do to do with word-vs-byte, but I ran across it in >looking at these numbers). >QUIZ: how many load/stores use 0-displacements off the base register, >rather than non-zero ones? > >ANS: a few were around 50%. > most were in the 10-20% range. > some were down in the 5-10% range. > Dhrystone: 50% >I.e., Dhrystone uses zero-offset addressing considerably more than >most programs, although not more than all programs. [Relevant to 29000 >discussion, if you remember how they did things.] Just in case you are trying to make a subtle intimation: WE DID NOT "OPTIMIZE" THE AM29000 ARCHITECTURE FOR ANY PARTICULAR PROGRAM. The architecture was pretty much fixed before we had significant simultion results (I know, I know; that was the wrong way to do things, but we had no choice). We *did* add the now-infamous compare-bytes instruction very late (after we had simulation results). I wanted the load/store instructions to have only register-indirect addressing mode from the beginning, but only for the sake of simplicity and optimization opportunities. In the end, we realized that we had done a great thing: As far as normal instruction execution is concerned, there cannot be contention between jump and load/store instructions for the TLB. With our pipeline, an addressing mode would have been a minor disaster. >WHEW. That was a lot of info. Sorry about that, but architectural >arguments cannot be settled by intuition. Note again that these are >the numbers we get, and you cannot analyze choices in a vacuum, >so they may or may not be relevant to other architectures and software. Yes; this is an important point. Rarely, if ever, does a team implement in the same technology two versions of a processor with just one variable (e.g. byte alignment/no byte alignment) changed. That would be nice. >In our case, this does say: > a) Byte instructions are a substantial win on many real programs. > b) Non-zero offsets are frequently-used. (But less frequently when there is a stack cache.) >and finally, for everybody: > c) Be very, very careful on WHICH benchmarks you use to tune > your architecture. DON'T use Dhrystone. This is good advice. Thanks, John, for taking the time to post. bcase
phil@amdcad.UUCP (04/14/87)
In article <16122@amdcad.AMD.COM> bcase@amdcad.AMD.COM (Brian Case) writes: >In article <279@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: >>2) I/O system design. This is clearly not true of all systems, but >>you run into it often enough. IT IS OFTEN EXCEEDINGLY INCONVENIENT >>TO BE REQUIRED TO FETCH OR STORE A WORD OF A TIME WHEN DEALING WITH >>DEVICE CONTROLLERS. [other stuff] > >Agreed, but again, why not solve the problem (with an interface chip >or design approach) instead of propagating it to the processor-memory >channel? I have sympathy for OS people: I was an OS person for just >a short while. The choice between dumb system design and creating >problems for the OS people when they must deal with older chips/boards >is a tough one (really, because: the OS is part of the system design >too). If we are talking about trying to use existing controllers, such as (particularly, actually) Unibus controllers, it's likely they jammed the 16-bit registers one after another and a 32-bit word machine will find it hard to cope with these controllers. If we're talking about building new controllers there's no reason why you couldn't give each register its own word. It uses a little more address space but only a few bytes more, nothing really. What the chips do is irrelevant. You can always set the chip decode logic up (at the board level) to make the chip think word addresses are byte addresses. >>Thus, we have really precise statistics on what's going on, at least on >>our machines, at the user-level, for anything form typical UNIX programs >>(like nroff), to large simulators [spice, espresso], >>parts of the compiler system [assembler, optimizer, debugger], >>to benchmarks like whetstone, dhrystone, linpack. > >Sigh, I wish we could do such simulations. Brian, there are plenty of faster machines available at AMD and you ought to consider using them if CPU time is the only constraint. Or (hee hee) buy an R2000 to do simulations on. Limiting yourself to your existing resources is very silly, I think. (I won't post that Brian does his simulations on an IBM-PC to avoid embarassing him.) -- Phil Ngai, {ucbvax,decwrl,allegra}!amdcad!phil or amdcad!phil@decwrl.dec.com
tim@amdcad.UUCP (04/14/87)
In article <16122@amdcad.AMD.COM>, bcase@amdcad.AMD.COM (Brian Case) writes: > In article <279@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: > > >Thus, we have really precise statistics on what's going on, at least on > >our machines, at the user-level, for anything form typical UNIX programs > >(like nroff), to large simulators [spice, espresso], > >parts of the compiler system [assembler, optimizer, debugger], > >to benchmarks like whetstone, dhrystone, linpack. > > Sigh, I wish we could do such simulations. I think Brian misread the previous paragraph to mean that the MIPS simulator is able to run these programs in a simulated UNIX environment (i.e. simulating the entire UNIX kernel), but I see only user-level mentioned, above. Note that we *are* able to perform such simulations, but only in a single- tasking, stand-alone environment. John -- does the MIPS simulator incorporate a simulated UNIX kernel, and have you performed multiprogramming simulations with it? -- Tim Olson Advanced Micro Devices (tim@amdcad.AMD.COM)
hansen@mips.UUCP (04/15/87)
In article <16126@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes: > In article <16122@amdcad.AMD.COM>, bcase@amdcad.AMD.COM (Brian Case) writes: > > In article <279@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: > > >Thus, we have really precise statistics on what's going on, at least on > > >our machines, at the user-level, for anything form typical UNIX programs > > >(like nroff), to large simulators [spice, espresso], > > >parts of the compiler system [assembler, optimizer, debugger], > > >to benchmarks like whetstone, dhrystone, linpack. > > Sigh, I wish we could do such simulations. > I think Brian misread the previous paragraph to mean that the MIPS simulator > is able to run these programs in a simulated UNIX environment (i.e. > simulating the entire UNIX kernel), but I see only user-level mentioned, above. > Note that we *are* able to perform such simulations, but only in a single- > tasking, stand-alone environment. > John -- does the MIPS simulator incorporate a simulated UNIX kernel, and have > you performed multiprogramming simulations with it? Yes - for example, we've run the Byte multi-shell benchmark on our simulator. Actually, we have several MIPS simulators, and in addition to the user-level simulations, we've been running kernel-level simulations as well. The kernel-level simulations take just a little longer, since we use an instruction-level simulator to generate the address trace. (The user-level simulations generate the address trace by an object-code recompilation technique that is also employed by our profiler. As a side note, by using this technique, we can simulate and/or profile any MIPS object code, without having separate profiling libraries.) The user-level simulations take into account the effects of multiprogramming and kernel code execution in simple ways. We have found that our simulations have matched within better than 5% to the actual run-time as reported by the csh time command on our M-series systems. Now that we've got all our MIPS systems connected by NFS, I've been able to run plenty of simulations. -- Craig Hansen Manager, Architecture Development MIPS Computer Systems, Inc. ...decwrl!mips!hansen
jesup@steinmetz.UUCP (04/15/87)
[re: discussion on word addressed memory vs. byte addressed w/ alignment net] Having a alignment network on chip does not necessarily cost you in critical path, depending on your design. In the one design I am familar with, the net doesn't cost us anything, even at considerably more than 30 MHz. It is done in the end of the cycle that latches it onto the chip (if I remember correctly). In any case, it is not on the critical path. In the other direction, it goes through a network again, and 4 lines are driven as appropriate (write lines for each byte.) According to our figures, load/stores are about 40-50%, with about 2 loads/1 store. Lack of direct byte support can (depending on application) cost you a fair amount. It all comes down to the hardware: If it costs you more cycles (on average) to add the alignment net than it would cost to synthesize the the byte/halfword load/stores from word load/stores and byte insert/extract instructions, then don't use the net. But if it's even, or in favor of the net, definitely use the net. If you don't, you'll need to decode at least two more instructions. Picking numbers out of the air, if the alignment net costs 1 cycle on loads and stores, and 65% are loads, and 40% of instructions are loads or stores, and the extra cycle can't be filled 50% of the time, it will cost you: 40% * 65% * 1 cycle * 50% ~= .1 cycles/instruction (the extra cycle on a store doesn't block later instructions, just takes longer for it to get to memory.) If you must do byte insert/extract, each of which costs one cycle, (I assume there are halfword insert extract, otherwise it'll be worse), and assuming 80% are word, 20% non-word, it will cost you: 40% * 100% * 1 cycle * 20% ~= .1 cycles/instruction If there are destination/source interlocks, and 50% of the interlocks are fillable, add .5 cycles/instruction to that, making it .6 cycles/ instruction. Now, all these numbers are fiction, but they aren't far from the actual numbers we see in our data (I'm at home now). From my point of view, any work that might reduce the penalty of the alignment net to less than 1 cycle is a big win (and as I said, 0 is definitely possible, even above 30MHz, depending on design.) Also, you reduce the decode complexity (maybe), by having a smaller number of instructions. If you can save .1 cycles/ instruction, you should get about 10% performance increase. Worth a lot of work and silicon, if you ask me. Randell Jesup jesup@steinmetz.uucp jesup@ge-crd.arpa
mash@mips.UUCP (04/15/87)
In article <16125@amdcad.AMD.COM> phil@amdcad.UUCP (Phil Ngai) writes: > ...sequence begun by inconvenience of I/O Controllers and word-addressing, with rebuttals, and comments.... >If we are talking about trying to use existing controllers, such as >(particularly, actually) Unibus controllers, it's likely they jammed >the 16-bit registers one after another and a 32-bit word machine >will find it hard to cope with these controllers. Yes. > >If we're talking about building new controllers there's no reason why >you couldn't give each register its own word. It uses a little more >address space but only a few bytes more, nothing really. Yes, this is clearly the thing to do. I've been assuming that part of the logic behind all of this is to expect AMD to come out with carefully- designed controller chips that do this [which will help us all anyway]. However, it is sad but true, that when you go to build densely-packed high-performance systems, your choice is often limited. I.e., the original posting was a current and near-term reality analysis, not a "how it should really be" discussion. >>... (bcase) >>Sigh, I wish we could do such simulations. > (phil again) >Brian, there are plenty of faster machines available at AMD and you >ought to consider using them if CPU time is the only constraint. Or >(hee hee) buy an R2000 to do simulations on. Limiting yourself to your >existing resources is very silly, I think. (I won't post that Brian >does his simulations on an IBM-PC to avoid embarassing him.) I'm sure MIPS would be happy to sell Brian a nice M/500 suitable for running lots of simulations. :-) Doug VanLeuven is the local sales guy... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
phil@amdcad.UUCP (04/15/87)
In article <305@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: >>If we're talking about building new controllers there's no reason why >>you couldn't give each register its own word. It uses a little more >>address space but only a few bytes more, nothing really. > >Yes, this is clearly the thing to do. I've been assuming that part of the >logic behind all of this is to expect AMD to come out with carefully- >designed controller chips that do this [which will help us all anyway]. >However, it is sad but true, that when you go to build densely-packed >high-performance systems, your choice is often limited. >I.e., the original posting was a current and near-term reality analysis, >not a "how it should really be" discussion. I think my point has been overlooked here. The question of whether a chip's registers appear at byte, 16-bit, or 32-bit boundaries is outside of the control of the chip designer. The board designer determines this. To be boringly explicit about this, consider a chip with 8 registers. You'll get 3 address pins (call them CA0, CA1, CA2) to select one of the 8 registers and a Chip Select line (CS) to select the chip. Now, if the board designer connects these three address lines to the low 3 address lines on the boards (CA0-BA0, CA1-BA1, CA2-BA2), the registers will appear at byte boundaries. If the board designer skips the bottom address line and instead hooks up (CA0-BA1, CA1-BA2, CA2-BA3), the registers will appear at 16-bit boundaries. Finally, (CA0-BA2, CA1-BA3, CA2-BA4) will space the registers on 32-bit boundaries. There are other problems associated with trying to fit all the needed functionality into a limited number of pins but register placement is not one of them. If you're not tired of reading about hardware yet, consider the interface of the Z8530, a dual serial communications controller. Let us consider just one half of the device, the other half is essentially identical. It has only one control "register" at the chip interface level, (it has a data register too). First, you load the register with a pointer value in the range 0-15 and then access the actual register, one of 15. The problem comes in dealing with interrupts. If one comes in after the pointer is loaded and before the actual register is used, and the interrupt handler needs to use the SCC, two bad things happen. 1) the IH thinks it is writing a pointer when it is really writing a register 2) after the IH returns, the routine thinks it is writing a register but the chip thinks it is receiving a pointer. The only way to deal with this is to use a software locking mechanism to reserve the SCC, since it has this hidden state. Unfortunately, this makes using a DMA controller rather hard, since it won't respect any software locks. Rather, the DMA controller must be turned off before accessing the SCC. This interface saves three pins. Even though I design hardware, I think this is incredibly ugly. Let's not talk about write only registers, chips with weird timing dependencies (the SCC has a cycle recovery time requirement. When I first used it, I thought I could just warn the programmers about the problem. But they don't read the board manual so I put in extra hardware to hide the cycle recovery time.) and registers with inverted logic and other atrocities perpetrated on helpless programmers by narrow-minded or otherwise mis-guided hardware engineers. -- Phil Ngai, {ucbvax,decwrl,allegra}!amdcad!phil or amdcad!phil@decwrl.dec.com
rpw3@amdcad.UUCP (04/16/87)
In article <305@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: >In article <16125@amdcad.AMD.COM> phil@amdcad.UUCP (Phil Ngai) writes: >>[...sequence begun by inconvenience of I/O Controllers and >> word-addressing, with rebuttals, and comments....] [...then some bemoaning the difficulties of interfacing old chips...] >>If we're talking about building new controllers there's no reason why >>you couldn't give each register its own word. It uses a little more >>address space but only a few bytes more, nothing really. Note that you can do the same thing with the older chips, by using "partial address decode" in your I/O address decoders. Say you have a chip with an 8-bit data path, and it has 16 internal registers (byte-wide) which are addresses with A<3:0>. Simply connect chip D<7:0> to bus D<7:0>, and connect chip A<3:0> to bus A<5:2>, and there you are -- "word-addressed" registers. Of course, you'll read garbage on D<31:8>, so if you want to avoid masking that in software, you should add some gates to drive zeros on D<31:8> whenever an 8-bit device is being read. This need exist only once per system, and can be enabled by the central I/O decode system or, if you have plug-in cards with 8-bit I/O on them, can be a common bus line that such devices pull when they are being read. (Exercise for the reader: Make this work if you have a mix of 8- & 16-bit devices, without adding any extra "zero" drivers. Hint: You may have to add another common bus line.) If you put on your hardware/software tradeoff hat, you'll remember that we have all been through this before, in one form or another. Nobody "waited" for Motorola to come out with 68k peripheral chips, we just took our old favorite Z-80 (or whatever) peripherals and glued them on. <<begin bit-level digression>> Sometimes the solutions get really weird, such as once when I had to use Z-80 PIO chips to build a 16-bit PIO. But I needed to be able to talk to the Z-80 PIOs separately, since only one set of registers on each was involved in the 16-bit thing. Solution? Use "unary decoding" in the address decoder. (Now that address space is really cheap, at least in the I/O sub- systems, we shouldn't forget this ancient but useful technique.) "Unary decode" or "unit select" or linear selection" (it never really had a standard name) means using a single bit of the address space to select a device (usually using some high-order binary-decode field to say we were doing this, so as not to chew up the WHOLE address space!) In this way, if you turn on more than one address bit at a time (in the affected range), you select ALL of the addressed devices! (Obviously, not good for reading, unless they drive different bits on the data bus.) So for example, let's say we have a 32-bit word machine, and we want to create a 32-bit parallel register for some I/O function, and for some godawful reason we want to use Z-80 PIOs to do this, rather than 74F374's. (Look, contrived examples are contrived, o.k.?) So use four Z-80 PIOs, and put each PIO on a separate byte of the data bus. Then let's say that address 0x87650000 enables this whole mess as a unit (that is, we tie up 64K of address space -- no biggy). Then give the upper unit (the one on D<31:24>) a unit select address of 0x8000, the next 0x4000, and so on. Finally, wire each chip's address <1:0> to bus A<3:2>. (There is another, even weirder way to do this, described below.) So to simultaneously access Register 2 of all the chips, use the address 0x8765F008. To access Register 1 of the D<23:16> chip (but no others), use the address 0x87654004 (but remember that the data is going in and out on bits <23:16>, so your code has to shift/mask). If you wanted to, this arrangement could give you three 32-bit registers, or six 16-bit ones, or twelve 8-bit registers, OR... one 32-bit register plus two 16-bit plus four 8-bit, etc. (permutations ad nauseum). If you want to get even weirder (as mentioned above), wire the address lines of the PIOs to separate bit fields (we've got plenty), for example, the <31:24> PIO's A<1:0> could be wired to bus A<11:10> (mask 0xC00), the <23:16> PIO's A<1:0> to A<9:8> (mask 0x300), etc. Then to simultaneously access register 0 of the high byte, register 1 of the next, register 2 of the next, and register 3 of the low (<7:0>) byte use address 0x8765F1B0. [I know, I know, PIOs only have 3 data regs.] Actually, if I were to do this again, I would probably allocate the addresses a bit differently, putting the unit select and the register address in a separate nybble for each PIO, making hex debugging easier. In that case, you's use address 0x87658888 to access reg0 of all chips, and 0x876589AB to get the weird ripple addressing of the previous example. <<end of bit-level digression>> Finally, the existence of 8- or 16- or 32-bit "peripheral chips" is not going to make any difference at all the the deployment of the new 32-bit processors, since any good-sized system is going to have complete CPUs as I/O processors anyway, and it's on the I/O processor that such games as the above examples will be played. The real factor will be the bandwidth *and* latency of the channel between the "main" CPU (or CPUs, in an mP system) and the I/O system as a whole. Note that for development reasons one may choose to use the same processor chip on the I/O system as for the "main" CPU(s), but it's on the I/O system that the compatibility interface with 8-bit chips will take place. [Counter-point: We may find these new 32-bit chips also being used in "embedded controller" applications for traditional mainframes, where some of the bit-tweaking above still applies.] Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun,attmail}!redwood!rpw3 ATTmail: !rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403
henry@utzoo.UUCP (Henry Spencer) (04/18/87)
> I think my point has been overlooked here. The question of whether a > chip's registers appear at byte, 16-bit, or 32-bit boundaries is > outside of the control of the chip designer. The board designer > determines this. However, the chip designer is within his rights to put constraints on this (e.g. "memory-mapped I/O will be in big trouble if you don't put the registers at 32-bit boundaries, however this offends your esthetic sense"). Doesn't mean the board designer will listen, of course... and chip-company management might object on the grounds that the restriction might reduce sales 0.01%. -- "We must choose: the stars or Henry Spencer @ U of Toronto Zoology the dust. Which shall it be?" {allegra,ihnp4,decvax,pyramid}!utzoo!henry
haas@msudoc.UUCP (Paul R Haas) (04/18/87)
In article <16122@amdcad.AMD.COM> bcase@amdcad.AMD.COM (Brian Case) writes: >Unquestionably there is a code size penalty. This may or may not be an >issue given ROM/RAM constaints in some environments. > A code size penalty can become a performance penalty, if a loop won't fit in the cache. There is also a cost for the time taken to move the extra code around on various data paths, delaying other users of that data path (delayed writes, other processors, dma devices, etc...). - Paul Haas, haas@msudoc.egr.mich-state.edu ...!ihnp4!msudoc!haas
steckel@alliant.UUCP (Geoff Steckel) (04/19/87)
One point that I believe needs to be emphasized in the word-vs-byte discussion: Many existing and useful boards (VME, Multibus I & (yuck) II, QBUS, etc, etc, not excepting IBMPCbus) cannot be used except by a CPU with bus-level byte addressing. The life cycle cost of redesigning an entire system of peripherals just to be able to use the newest CPU seems very high. If I/O space is separate from memory space, one can kluge some very clumsy fixes, but memory mapped peripherals have problems... Under my OS and I/O hat, I have quite a stock of horror stories about CPUs with this problem. Given the actual cost of CPUs vs peripherals, guess which one got canned when the problems showed up? Hint - we HAD to keep the peripherals. Geoff Steckel (steckel@alliant.uucp, gwes@wjh12.uucp)