firth@sei.cmu.edu (Robert Firth) (05/05/88)
This post is a fairly long review of the WM machine recently described in Computer Architecture News. It is about 280 lines long, and follows this form feed. It assumes (a) you have read the paper, and (b) care tuppence about my thoughts. The opinions, misunderstandings, errors &c are mine alone. The WM Computer Archicecture ============================ [Wm A Wulf, Computer Architecture News, 16 (1988) pp 70..84] Comments by Robert Firth, CMU-SEI, May 1988 Introduction ------------ The WM computer architecture, described in the cited paper, is a sketch of part of the order code of a new computer processor. This processor is designed, it its author's words, to retain the RISC philosophy while achieving significantly greater performance than other RISC designs. This Note comments on the major features of the architecture as described, and reflects on their possible effectiveness. Major Features -------------- The major features of the WM architecture are (a) a load/store design (b) loads and stores use implied FIFO operand and result queues (c) operations use registers (or FIFO) for operands and results (d) each instruction performs TWO operations on THREE operadns, after the pattern result := rand1 rator1 (rand2 rator2 rand3) (e) relational operators return their right operand as result and load a condition into a FIFO; conditional jumps read the condition FIFO There are also facilities for a floating point coprocessor, a "stream mode" load or store, and multiple load FIFOs. As shown, the design allows for 32 named integer registers (of which one or two may in fact be FIFOs), 32 floating registers (similarly), 16 general integer operations, 16 general floating operations, and 32 "special" operations. The paper does not describe any of these operations, but by implication they include arithmetic, logical, and relational. General Observations -------------------- The first observation is that the paper presents only a sketch. There is no detail of any of the operations, no detail on the special instructions other than a brief account of the "stream" loads, and no description of even so important an operation as the subroutine call. Moreover, while the claims for the machine are made in terms of execution speed, data access speed, and so on, the presentation is purely at the level of the instruction set architecture; there are no timing diagrams, data flow diagrams, or other lower-level detail. This incompleteness makes it hard to assess the design; nevertheless, the following sections offer some critical commentary, on specific and more general features. The next observation is clear from Figure 1 of the reference: the instruction set space is almost saturated. As Bell and Newell remark, this is a mistake in the design of a new machine, since it leaves no room for extensions or afterthoughts. In detail: the instruction class is indicated by the leading four bits, and all combinations are used. In all classes except 2#1110# (special), the remaining bit patterns are almost fully allocated. The special class admits 32 operations; the text implies that a large number are already allocated. Load/Store ---------- A load/store design serves two purposes (a) it simplifies the instruction set by allowing general addresses to appear in only a few places. In particular, it usually allows the instruction encoding to use a single uniform size. (b) it improves performance by uncoupling the fetch of a value from its use, thereby allowing the fetch delay to be overlapped with useful work. Previous RISC experience shows that these purposes are well served by this design, and there seems little reason to doubt the WM design will work as well. FIFOs ----- The use of FIFO queues, however, is more problematical. There is some gain in instruction density, since the destination of a load, or the source of a store, need not be specified. But this is a very small gain, since most of a load or store instruction is taken up with the memory address. I do not believe there is any performance gain. If loads or stores can be overlapped, it is relatively easy for a compiler to target successive loads into different registers. With 32 general registers available, there will surely be enough for any reasonable sequence of loads or stores. The one exception - of which there is an example in the paper - is when the loads are in a tight loop and the load delay runs round the loop. But there the rate-determining step is the load, regardless of how it is buffered, and loop unrolling can almost always avoid load destination clashes. Moreover, loads into explicit registers offer several advantages. First, the operands need not be used in the order in which they are loaded, which is often helpful, especially if operations have to be reordered to take advantage of parallel ALUs. Secondly, and far more important, the use of an operand does not destroy it. It is significant that none of the examples given in the paper ever uses the same operand twice; such an example would show clearly how a load FIFO forces an extra instruction to be generated. Finally, FIFOs cause severe implementation problems, one of which at least - efficient saving and restoring of FIFO contents on a task context switch - has never to my knowledge been satisfactorily solved. Other serious problems include resetting the FIFO after an interrupt or exception, and resuming an orderly flow of loads and stores after a memory-management trap. In sum, I believe the use of FIFOs offers no advantages over loads and stores into and out of the general registers, and incurs several disadvantages. Operations use Registers ------------------------ This is another piece of the normal RISC approach, and is again probably a good feature. However, even a RISC machine must provide operations directly on memory, for processor synchronization; the absence of any such instructions in WM might be thought a major oversight. But it is clear from the paper as a whole, that the author is describing really only that part of the machine that a compiler writer would see. Two Operations per Instruction ------------------------------ This feature seems to me quite unwarranted. One major point of a RISC design is that instructions are simple enough to be executed quickly. But here is a machine where, within one instruction, there are two operations, the one gated on the result of the other. This forces the instruction to take twice as long, regardless of how many execution units might be available. A design in which twice as many instructions are executed, but successive instructions are independent, clearly cannot perform worse, and will perform better if several execution units can be run in parallel. One main lesson of prior designs - achieve higher parallelism by uncoupling mutually dependent operations - has simply been ignored. The WM sketch intends the first half of each instruction to be overlapped with the second half of the preceding instruction, and there is a dependency rule about that. But the same apparatus is all that is needed to execute independent instructions in parallel, so the implementation complexity of WM is no less, but its performance is impeded by the extra dependency within each instruction. Relational Operations and Conditions ------------------------------------ It is a matter of debate whether condition codes are justified, rather than having the comparison yield a Boolean result or even drive the conditional branch directly. Proponents of RISC would probably tend to regard condition codes as a needless complication. If that is so, then, a fortiori, a condition-code FIFO is a multiply needless complication. In many years of practical programming, I can recall only a couple of cases where it would have been useful to enqueue two sets of condition codes for successive testing. Accordingly, I must pronounce this feature useless. Moreover, there are several RISC machines where a comparison, test, and jump decision can be done in one cycle. Therefore, no purpose is served by separating the comparison from the jump decision, since there is no latency to be overlapped. And the branch-delay optimisation issue is equally present, under a different guise, in the WM machine, as its author admits. The presence of yet another FIFO adds yet more complexity to the handling of traps, interrupts, or context switches. Address Modes ------------- As a final point of detail, let us look not at a feature, but at a non feature. The WM machine has no address modes. Instead, the load and store instructions perform an integer computation, again of the form rand1 rator1 (rand2 rator2 rand3) and the result becomes the effective address for the memory reference. Consider first, how one might access a simple static variable, at a given (32-bit absolute) address. Clearly, one cannot say the equivalent of Load @16#abcdefgh# since the machine does not permit an absolute address as an operand. Another way, would be to store the address in a literal pool, and access the operand indirectly, by a PC-relative or register-based mode. This is the strategy on the CA LSI-2, for instance. But WM has no indirection, and no PC-relative mode. Alternatively, one could load the address into a register, using a load literal instruction (as is done on the M/500). But there is no such instruction. As far as can be discovered, an address can be copmposed only of one, two, or three registers, and one or two 5-bit literals. This means it is almost impossible, in one instruction, to construct the address of anything efficiently. To generate a 32-bit static address, by my count, takes three instructions. To generate a 16-bit unsigned displacement takes two instructions, and a third is then required to access the based variable so addressed. As another example, consider accessing a local variable. If it is within 32 bytes of the frame pointer, we are lucky: (fp) + (displacement noop noreg) will serve. If it is, say, at local address 40, then we might say (fp) + (8 + (rx)) where rx holds the useful constant 32, and at the price of an extra addition (and surely an extra cycle) we can now encode a six-bit displacement. (Actually, it would be marginally better to reverse the registers). But how are we to load 32 into rx? Perhaps by rx := 16 + (16 noop noreg) To summarise: the addressing capability of the machine is desperately inadequate. It might, as its author says, allow "scaled-index" modes to be generated, but one pays a high price for this fairly negligible capability. It would have been far better to use the bits of this instruction mode to encode, for instance, base register and signed 16-bit displacement. In fact, they could have encoded two base registers and such a displacement. Conclusion ---------- The WM machine contains some standard RISC features and some novel features. Almost all the novel features are both complicated and bad; the claims that they will enhance performance are, in this author's opinion, unfounded: . FIFO load and store queues give inferior performance, at far higher cost, than loads and stores that use general registers . executing two coupled operations per instruction creates needless dependency, and so reduces performance . a FIFO condition code queue serves no purpose and solves no problem . the machine is crippled by inadequate addressing capability Finally, any implementation of this design will be faced with serious problems in building correct synchronous traps, efficient task context switches, and transparent memory management. --------
brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (05/05/88)
I thought I would add a little technical support to a couple of the points. In article <5339@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: >In sum, I believe the use of FIFOs offers no advantages over loads and >stores into and out of the general registers, and incurs several >disadvantages. Also, the fifo that results of load requests return to in general will have to sort the data that arrives so that any scrambling of arrival order, which is very frequent in a multiprocessor, is fixed up. The internal structure used to do this looketh very much like registers, so why not just stay with registers! > >. the machine is crippled by inadequate addressing capability No scatter/gather capability for vectors, something that has become very popular on supercomputers, with good reason.
baum@apple.UUCP (Allen J. Baum) (05/05/88)
-------- [] >In article <6762 (Eugene D. Brooks III) writes: >I thought I would add a little technical support to a couple of the points. Me, too. It seems to me that it is entirely likely that a stream operation will get interrupted, by a page fault, if nothing else. Saving and restoring is a pain, and unlikely to be done without some very special instructions, or a bunch of 'CISC' like "save the state of the machine on the ?stack?" operations. Furthermore, if I have a two operand stream operation (both fifo's busy), then I'm locked out of memory until the streams are finished! Overall, I see very little that a VLIW machine couldn't, including the stream operations, which would probably be trivial in a VLIW. The WM has, as far as I can see, only the virtue of being compact, and it isn't very clear that all the extra instructions you need to solve some of its problems won't expand the instruction stream to make that marginal as well. Since a lot of the criticisms are fairly obvious, it could be that we just don't know all the details. Wulf has been around a while, and must have considered some of these. Maybe we'll hear from him, and find out what the real story is. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
jangr@microsoft.UUCP (Jan Gray) (05/06/88)
In article <5339@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: > >Address Modes >------------- > For the WM computation O1 op1 (O2 op2 O3), assume that we have: two inner operators: 1lit10 ::= ((O2 << 5) | O3 | (1 << 10)) 0lit10 ::= ((O2 << 5) | O3) two outer operators: lli ::= Rd[15-0] <- (O1 << 11) | (O2 op2 O3) ("load lower immediate") lui ::= Rd[31-16] <- (O1 << 11) | (O2 op2 O3) ("load upper immediate") (I know that instructions are actually of the form Rd := O1 op1 (O2 op2 O3) so that "lui" isn't *strictly* legal given the description he presents, but also assume Wulf isn't telling us all the little details...a safe assumption!) >Consider first, how one might access a simple static variable, at a given >(32-bit absolute) address. Clearly, one cannot say the equivalent of > > Load @16#abcdefgh# > >since the machine does not permit an absolute address as an operand. If you allow my assumption above, you can load a 32 bit constant into a register using lli and lui. This still takes two cycles but saves you a read from the literal pool. You still have to issue a load on this address... >Alternatively, one could load the address into a register, using a load >literal instruction (as is done on the M/500). But there is no such >instruction. No instruction as such, but it can be fabricated from the appropriate operators... >As another example, consider accessing a local variable. If it is >within 32 bytes of the frame pointer, we are lucky: > > (fp) + (displacement noop noreg) > >will serve. If it is, say, at local address 40, then we might say > > (fp) + (8 + (rx)) > >where rx holds the useful constant 32, and at the price of an extra >addition (and surely an extra cycle) we can now encode a six-bit >displacement. (Actually, it would be marginally better to reverse the >registers). But how are we to load 32 into rx? Perhaps by With the 1lit10 and 0lit10 operations above, you can access 2K bytes of the frame: (fp) + (#10101 0lit10 #010101) >Finally, any implementation of this design will be faced with serious >problems in building correct synchronous traps, efficient task context >switches, and transparent memory management. That's for sure! According to the paper, Wulf has been simulating this design on "a fairly large set of benchmark fragments". He says this architecture can do about 4 RISC-like instructions per cycle. I believe him. What I'm less sure of is if there is a memory architecture that will keep this machine running. Hopefully someone will implement this architecture (or simulate a real-world memory architecture for it) and we'll all know for sure. These are my opinions, but not necessarily my employer's. Jan Gray uunet!microsoft!jangr Microsoft Corp., Redmond Wash. 206-882-8080 ------------------------------ "Application for patent protection has been made for portions of the material described here". Right.
bcase@Apple.COM (Brian Case) (05/06/88)
In article <5339@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: >Two Operations per Instruction >------------------------------ >This feature seems to me quite unwarranted. One major point of a RISC >design is that instructions are simple enough to be executed quickly. >But here is a machine where, within one instruction, there are two >operations, the one gated on the result of the other. This forces the >instruction to take twice as long, regardless of how many execution >units might be available. No, no, no. The data dependency rule is included specifically to allow a pipeline stage between the two ALUs. Wulf is clearly assuming the pipestage is implemented (at least it is clear to me). The basic cycle of the WM machine should be no longer than that of any other RISC machine. >A design in which twice as many instructions are executed, but successive >instructions are independent, clearly cannot perform worse, and will perform >better if several execution units can be run in parallel. One main lesson >of prior designs - achieve higher parallelism by uncoupling mutually dependent >operations - has simply been ignored. I don't think Wulf is ignoring anything. The complexity of an n-instruction- at-a-time (n > 1) is much greater than that of a WM machine. I don't know how much greater, but it is greater. Maybe significantly. >The WM sketch intends the first half of each instruction to be overlapped with >the second half of the preceding instruction, and there is a dependency >rule about that. But the same apparatus is all that is needed to execute >independent instructions in parallel, so the implementation complexity of >WM is no less, but its performance is impeded by the extra dependency >within each instruction. The implementation complexity of WM is A LOT LESS than a machine that tries to execute two instructions at the same time (two less register file ports, no "scoreboarding" needed, etc.). At least *I* would much rather implement a WM than a two-instruction-at-a-time conventional RISC. However, the performance of a two-instruction-at-a-time conventional RISC might be greater, and is generalizable (but then, just implement a two-instruction-at-a-time version of the WM...). >Moreover, there are several RISC machines where a comparison, >test, and jump decision can be done in one cycle. Therefore, no purpose >is served by separating the comparison from the jump decision, since >there is no latency to be overlapped. And the branch-delay optimisation >issue is equally present, under a different guise, in the WM machine, as >its author admits. Those RISC machines that implement compare/branch in one instruction are fudging a bit (unless the compare is very simple, like == 0, == reg, != 0, != reg). The purpose served by separating the compare from the branch is quite important! This is called, by some, branch spreading, and it allows early evaluation of the branch so that latency is reduced to zero. This is my one big complaint with Wulf's paper: he didn't reference previous branch spreading work done on CRISP and RIDGE machines.
wulf@uvacs.CS.VIRGINIA.EDU (Bill Wulf) (05/07/88)
[As I was editing my last message, our VAX went down -- somehow the air conditioning had gotten turned off, and the VAX was at risk of a melt-down. Anyway, I was editing the middle of the message, and had just said that I didn't have space to describe the OS support features. What I would have said next was ...] ... I am putting the finishing touches on the arch. definition, and hope to make it available as a UVa report in the next few weeks. I'll be happy to mail it to folks who send me a request -- it will tell you a lot more than you probably want to know. Briefly, however, there are really two aspects that I might mention here. First, note the class of "special" instructions isn't described in the paper. Basically the instructions in this class (which includes the OS support stuff) take multiple cycles and are sorta CISC-ish ( I HATE these labels; people get all hung up on the labels at the expense of the substance). They are segregated this way so that the nornal decode-ex path is not delayed by them. They in turn, translate into a sequence of the simpler (RISC-ish) instructions. They take as long as they take! They could not be faster inplemented another way, they do not slow down the other, more common, instructions, and they can safely manipulate state (in prescribed, safe ways) that is not normally accessible in another way (eg, the FIFO state). Second, I wanted to provide a set of support for OS functions that provides an overall speedup without impacting the cycle time -- I didn't want to build a 100MIPS processor, and then throw away 30-50% of it on OS related stuff. The best way I know how to do that is to make more OS-ish operations directly, and SAFELY, available from normal user code (as opposed to "kernel mode"), thus avoiding the considerable overhead of kernel entry/exit. I think we have found a simple and elegant way to do that -- one that, in fact, eliminates kernel mode altogether. BAsically, the scheme involves typing pages, interpreting the rights bits differently for each type, and restricting certain operations to operate on certain types of pages. For example, TCB (Task Control Block) is a hardware understood page type. There is a Context-Swap instruction that must name a TCB page as it's operand -- and to be executed, must have the rights bits set appropriately. The net effect is that it's possible (NOT required) to allow an ordinary user program to act as a scheduler for a set of tasks (eg, it might be an Ada runtime system), even though that program has no other privledges wrt the tasks it schedules (eg, it can't access the code/data of the tasks, modify their TCBs, etc.) Bill
wulf@uvacs.CS.VIRGINIA.EDU (Bill Wulf) (05/07/88)
[..... Aaaaarrrrrgggghhhhh. The perversity of inanimate objects! This is the message that should have gotten sent BEFORE the last one. Grumble, mutter, sputter, fume .... ] I am new to this ... I only began reading this newsgroup a few days ago when one of the folks here told me that some articles had appeared on WM. Alas, I haven't seen them all (like Firth's), so forgive me for jumping into the middle of this without full information and any misunderstandings I may have as a consequence. Re Baum's comments about streaming state and context swaps: There are indeed a number of instructions in the full WM defn for saving/ restoring state, including that of the FIFOs -- though, perhaps not so many as he fears. As should be obvious, I made a conscious decision to increase the amount of state (witness 64 registers, 32 integer and 32 floating point) to improve performance at the expense of context-swapping. I am not especially concerned about the number of instructions this involves (it's only a handful in any case), but the size of the state is a legit concern. For what it's worth, the rationale was as follows: Processors are getting ever faster. Interrupt rates have gotten higher too, but not as much so since they are governed by more by the external world -- and people don't type faster, disks don't rotate especially faster, etc. I think this trend will continue, so in planning processors for the future, it's reassonable to consider increasing the amount of state if that speeds up the processor. One can't be silly about it, like blowing up the state by orders of magnitude, since interrupt rates and required response time to some of the interrupts is increasing, but a modest increase is a reasonable thing to consider. I did, and opted for (some) more state. I don't want to try to describe all the architectural features of WM related to OS issues (of which state save/restore is just one) here -- it would just take too much space. <<insert previous article here>> Re Case's reply to Firth (which, again, I haven't seen): Right on. The two op's/instruction add nothing to the cycle time. But note that they DO add to the latency. Considering the 3-stage pipe simple model of RISCs (decode, op,write-back), WM has 4 stages (decode, op, op, write-back) -- so a result is available 4/3's later. Whether this wins or loses depends on whether you've done one or two operations in the instruction. If 1, and couldn't dispatch an unrelated operation on the next cycle, you lose by 30%. If you use 2, you win by 50% (4/3s rather than 6/3s). The data says you win! As I said in the CAN article, I was very surprised by the frequency of use of 2 ops/cycle. When I first started this, I was really only exploring a number of ways to make all instructions exactly 32 bits. The 2ops/instr was just one idea, and I expected it to be a loser -- so I built a quick and dirty compiler than generated this format and was blown away by the results. (Aside -- After getting the preliminary results, I tried 3, 4,..., ops/instr (ignoring the fact that I couldn't encode them in 32 bits) and found that 2 is magic; you get very little benefit from more than 2). An added (perhaps obvious) addition to Case's reply add-comp-br vs. the way WM does things: Since the IFU is decoupled from the arithmetic units, and hence the cond branches can be executed early (assuming the CC is available), this means that you can start filling the instruction buffer(s) early too, thus eliminating instruction-fetch latency. You can statistically get a similar effect with branch-prediction, but the WM scheme works independent of the way the branch is taken -- so, specifically, works for the if-then-else case that is the the bug-a-boo of branch-prediction schemes. General caveat -- at the time I wrote the WM article, I hadn't seen any designs that handle conditionals like WM (CRISP gets some of the same effect, but with an entirely different, and I think more expensive, mechanism). Since then, the ZS-1 design was pointed out to me, which has a very similar mechanism, has a FIFO interface to memory, and decouples the int & fp units in a similar way. Some ideas just seem to have a time to be ripe.