aglew@urbsdc.Urbana.Gould.COM (04/19/88)
I expected to see something here by now, but maybe I'll have to start - wht do y'all think of the WM processor, as described in Computer Architecture News Vol 16 No 1, March 1988, "The WM Computer Architecture", William Wulf. Normally CAN contains a lot of crank articles, but when somebody like Wulf publishes there you have to pay attention. In brief, it is divided into Instruction Fetch Unit, Integer Execution Unit, and Floating Execution Unit, IFU, IEU, and FEU. An instruction may be dispatched to each of these units on a cycle; all instructions are 32 bits, so the decoder is looking at 96 at a time. Each IEU and FEU instruction is of the form A op (B op C), so you could dispatch as many as 5 <operations> per cycle. The double op is useful for inner products, range checks, etc. There is a dependency rule that says that the result of one instruction is not seen at the inner op of the next. Condition codes are placed in a FIFO, and are taken off the FIFO when the IFU actually needs to branch. Similarly, LOAD and STORE instructions deal with data from FIFOs, placed there by writing to R0 and/or R1, and reading from them. Address computation is decoupled from generation of the data to be stored. You can, for example, set up a stream of data from memory with only one instruction, and thereafter access it with purely scalar ops. As usual, many of these ideas have been seen before: FIFO load/stores and CCs, the A op (B op C) in DSP processors. But they're put together in a nice package here. Somebody who I very much respect says that this is the most important paper in computer architecture since the early papers on RISC. I'll be making subsequent postings in a little while about issues WM brings out, eg use of FIFOs rather than register renaming, - remember Patterson's paper "What do you do with 1000 registers?". Well, I think that we have gone past the point of diminishing returns for number of registers (although maybe not for highly parallel machines, eh Larry?), and now we have to think of something else to do with land area. How about "What do you do with a dozen ALUs?"
agn@UNH.CS.CMU.EDU (Andreas Nowatzyk) (04/24/88)
Wulf presented the WM architecture here at a recent CS seminar, and it appeared like a very clean and efficient system. Most design decisions were backed up with data gathered during his work on high quality optimizing compilers and appeared to be sound. Of particular appeal to me was the introduction of fifo's into the load/store instructions (decoupling the time when an address is issued from the time when the data is accessed) as it has the *potential* of allowing more latency in the memory system without degrading the throughput. However, there are a few dark areas: The WM generates a tremendous load on the memory system: In each cycle, it can generate 9 memory references, half of which could be for more than 32 bit. It seems to me that this implies a multiported *virtual* cache. Cache efficiency could be quite low as the stream-instruction are begging to be used as a powerful, general purpose vector facility that has the potential to sweep of large portions of the memory (caches don't help too much if you are accessing a sizable fraction of the address space periodically, as seen in scientific code). The other dark area are exceptions/interrupts/pagefaults/context-swaps: there is a lot of state in WM and there is a fair amount of asynchronicity to be sorted out. Wulf entertained the notion of an imprecise exception (of IBM 360/91 fame) that I don't find too attractive. However, I'm not convinced that this is the only answer to this problem. -- -- Andreas Nowatzyk (DC5ZV) Carnegie-Mellon University Arpa-net: agn@unh.cs.cmu.edu Computer Science Department Usenet: ...!seismo!unh.cs.cmu.edu!agn
ram%shukra@Sun.COM (Renu Raman (Sun Microsystems)) (04/25/88)
In article <1508@pt.cs.cmu.edu> agn@UNH.CS.CMU.EDU (Andreas Nowatzyk) writes: > >Of particular appeal to me was the introduction of fifo's into the >load/store instructions (decoupling the time when an address is issued >from the time when the data is accessed) as it has the *potential* of >allowing more latency in the memory system without degrading the throughput. > This is not new. Read about the ZS-1 in the prevous ASPLOS conference proceedings. A real machine exists with queues/fifos as interface to the memory system. For more details read "The ZS-1 Central Processor" by Smith et. al. ASPLOS - 87. > > -- Andreas Nowatzyk (DC5ZV) >
hankd@pur-ee.UUCP (Hank Dietz) (04/26/88)
In article <50669@sun.uucp>, ram%shukra@Sun.COM (Renu Raman (Sun Microsystems)) writes: > In article <1508@pt.cs.cmu.edu> agn@UNH.CS.CMU.EDU (Andreas Nowatzyk) writes: > >Of particular appeal to me was the introduction of fifo's into the > >load/store instructions (decoupling the time when an address is issued > >from the time when the data is accessed) as it has the *potential* of > >allowing more latency in the memory system without degrading the throughput. > This is not new. Read about the ZS-1 in the prevous ASPLOS conference Actually, it has been common for a while now. I don't remember the model, but I know at least one of CSPI's array processors used FIFO interfaces not only to decouple memory references, but also to decouple interactions between control, address generation, and arithmetic hardware. You are right about the ZS-1... it looks very similar to WM. Besides, for the past 6 or 7 years, at least a few hardware people I know have used "dataflow microarchitecture" (i.e., FIFO-interconnected functional units) in building conventional-looking special-purpose machines. There is a catch, however, in that FIFOs to shared resources start showing the usual dataflow problems: operand to operation matching and high bus bandwidth requirements. These are non-trivial problems to solve dynamically. WM solves them by forcing operations of each type to be executed in the original sequence, although different types of operations can execute in a variable order relative to each other. This is quite a strong restriction. Personally, I'd rather see these problems solved by static scheduling at compile time: a la VLIW (but not a VLIW machine). Burton Smith has a design which uses static information to control dynamic out-of-sequence evaluation (he's been telling me about this for some years now, but he's building it, not publishing on it), and I've been involved in a couple of designs with similar properties (e.g., SBMs -- Static Barrier MIMDs -- papers/references available upon request). Statically scheduled, but not necessarily fixed sequence, machines seem to have the benefit of very simple hardware supporting a very general execution mechanism, and the compiler technology to get good results is easy enough (for us compiler gurus :-). __ /| _ | | __ / | Compiler-oriented / |--| | | | | Architecture / | | |__| |_/ Researcher from \__ | | | \ | Purdue \ | \ \ \ \ hankd@ee.ecn.purdue.edu
lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (04/26/88)
In article <7992@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >Actually, it has been common for a while now. I don't remember the model, >but I know at least one of CSPI's array processors used FIFO interfaces not >only to decouple memory references, but also to decouple interactions >between control, address generation, and arithmetic hardware. Yes, CSPI had such a machine in the mid-70's. They published about it, in IEEE Computer (I think). The address unit computed addresses and put them into an address FIFO (FIFOs?). The data unit could load and store from a pair of data FIFOs. However, the CSPI machine was different from WM in that the two decoupled units executed from two different instruction streams. Data-dependant branching was not an option. This extreme decoupling was limited by the FIFO depth (3 - to avoid fallthrough time). I believe that CSPI simulated the architecture first, and were going for a win of about 2x speedup (on FFTish problems). -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science
jjb@sequent.UUCP (Jeff Berkowitz) (04/27/88)
In article <1508@pt.cs.cmu.edu>, agn@UNH.CS.CMU.EDU (Andreas Nowatzyk) writes: > Of particular appeal to me was the introduction of fifo's into the > load/store instructions (decoupling the time when an address is issued > from the time when the data is accessed) as it has the *potential* of > allowing more latency in the memory system without degrading the throughput. Culler Scientific implemented this feature. It is visible at the instruction set level, so the compiler could emit code like ... load (r2) <arbitrary instructions during latency> add fifo, r0, r1 ... Note that this is psuedocode, not Culler assembler. The depth of the FIFO is four. It is possible to issue three or four reads in succession, then begin using the data as it arrives from memory. The performance gain is substantial; in practice, the compiler is very often able to schedule reads in advance and make use of the latency. Making this design interruptible was complex. Culler implmented hardware interlocks to stall the processor on an attempt to read an emptry FIFO. The machine included a deadlock detector in hardware which would generate a trap in case of an infinite stall (I'm not advocating anything, just reporting on a real implementation.) I thought this tidbit would be interesting despite the machine's well known commercial failure. Late in our design, we also became aware of a group - maybe they were at U. of Wisconsin? - which had independently designed a machine called PIPE. This machine made heavy use of queues, and the architects had done some worthwhile simulations. Sorry I can't be more specific, my article collection has not yet followed me to Oregon. -- Jeff Berkowitz ...!tektronix!sequent!jjb Sequent Computer Systems Beaverton OR