[comp.arch] Wulf's WM

aglew@urbsdc.Urbana.Gould.COM (04/19/88)

I expected to see something here by now, but maybe I'll have to start
- wht do y'all think of the WM processor, as described in 
Computer Architecture News Vol 16 No 1, March 1988, "The WM
Computer Architecture", William Wulf. Normally CAN contains a lot
of crank articles, but when somebody like Wulf publishes there
you have to pay attention.

In brief, it is divided into Instruction Fetch Unit, Integer Execution
Unit, and Floating Execution Unit, IFU, IEU, and FEU. An instruction
may be dispatched to each of these units on a cycle; all instructions
are 32 bits, so the decoder is looking at 96 at a time.

Each IEU and FEU instruction is of the form A op (B op C), so you
could dispatch as many as 5 <operations> per cycle. The double op
is useful for inner products, range checks, etc. There is a dependency
rule that says that the result of one instruction is not seen at
the inner op of the next.

Condition codes are placed in a FIFO, and are taken off the FIFO 
when the IFU actually needs to branch. Similarly, LOAD and STORE
instructions deal with data from FIFOs, placed there by writing to 
R0 and/or R1, and reading from them. Address computation is decoupled
from generation of the data to be stored. You can, for example, set
up a stream of data from memory with only one instruction, and
thereafter access it with purely scalar ops.

As usual, many of these ideas have been seen before: FIFO load/stores
and CCs, the A op (B op C) in DSP processors. But they're put together
in a nice package here. Somebody who I very much respect says that
this is the most important paper in computer architecture since 
the early papers on RISC.

I'll be making subsequent postings in a little while about issues
WM brings out, eg use of FIFOs rather than register renaming, 
- remember Patterson's paper "What do you do with 1000 registers?".
Well, I think that we have gone past the point of diminishing
returns for number of registers (although maybe not for highly
parallel machines, eh Larry?), and now we have to think of something
else to do with land area. How about "What do you do with a dozen
ALUs?"

agn@UNH.CS.CMU.EDU (Andreas Nowatzyk) (04/24/88)

Wulf presented the WM architecture here at a recent CS seminar, and it
appeared like a very clean and efficient system. Most design decisions
were backed up with data gathered during his work on high quality
optimizing compilers and appeared to be sound.

Of particular appeal to me was the introduction of fifo's into the
load/store instructions (decoupling the time when an address is issued
from the time when the data is accessed) as it has the *potential*  of
allowing more latency in the memory system without degrading the throughput.

However, there are a few dark areas:

The WM generates a tremendous load on the memory system: In each cycle,
it can generate 9 memory references, half of which could be for more than
32 bit. It seems to me that this implies a multiported *virtual* cache.
Cache efficiency could be quite low as the stream-instruction are begging
to be used as a powerful, general purpose vector facility that has the
potential to sweep of large portions of the memory (caches don't help
too much if you are accessing a sizable fraction of the address space
periodically, as seen in scientific code).

The other dark area are exceptions/interrupts/pagefaults/context-swaps:
there is a lot of state in WM and there is a fair amount of asynchronicity
to be sorted out. Wulf entertained the notion of an imprecise exception
(of IBM 360/91 fame) that I don't find too attractive. However, I'm not
convinced that this is the only answer to this problem.

-- 
   --  Andreas Nowatzyk  (DC5ZV)

   Carnegie-Mellon University	     Arpa-net:   agn@unh.cs.cmu.edu
   Computer Science Department         Usenet:   ...!seismo!unh.cs.cmu.edu!agn

ram%shukra@Sun.COM (Renu Raman (Sun Microsystems)) (04/25/88)

In article <1508@pt.cs.cmu.edu> agn@UNH.CS.CMU.EDU (Andreas Nowatzyk) writes:
>
>Of particular appeal to me was the introduction of fifo's into the
>load/store instructions (decoupling the time when an address is issued
>from the time when the data is accessed) as it has the *potential*  of
>allowing more latency in the memory system without degrading the throughput.
>
   This is not new.  Read about the ZS-1 in the prevous ASPLOS conference
   proceedings.  A real machine exists with queues/fifos as interface to
   the memory system.  For more details read "The ZS-1 Central Processor"
   by Smith et. al. ASPLOS - 87.
>
>   --  Andreas Nowatzyk  (DC5ZV)
>

hankd@pur-ee.UUCP (Hank Dietz) (04/26/88)

In article <50669@sun.uucp>, ram%shukra@Sun.COM (Renu Raman (Sun Microsystems)) writes:
> In article <1508@pt.cs.cmu.edu> agn@UNH.CS.CMU.EDU (Andreas Nowatzyk) writes:
> >Of particular appeal to me was the introduction of fifo's into the
> >load/store instructions (decoupling the time when an address is issued
> >from the time when the data is accessed) as it has the *potential*  of
> >allowing more latency in the memory system without degrading the throughput.
>    This is not new.  Read about the ZS-1 in the prevous ASPLOS conference

Actually, it has been common for a while now.  I don't remember the model,
but I know at least one of CSPI's array processors used FIFO interfaces not
only to decouple memory references, but also to decouple interactions
between control, address generation, and arithmetic hardware.  You are right
about the ZS-1... it looks very similar to WM.  Besides, for the past 6 or 7
years, at least a few hardware people I know have used "dataflow
microarchitecture" (i.e., FIFO-interconnected functional units) in building
conventional-looking special-purpose machines.

There is a catch, however, in that FIFOs to shared resources start showing
the usual dataflow problems: operand to operation matching and high bus
bandwidth requirements.  These are non-trivial problems to solve dynamically.
WM solves them by forcing operations of each type to be executed in the
original sequence, although different types of operations can execute in a
variable order relative to each other.  This is quite a strong restriction.

Personally, I'd rather see these problems solved by static scheduling at
compile time: a la VLIW (but not a VLIW machine).  Burton Smith has a design
which uses static information to control dynamic out-of-sequence evaluation
(he's been telling me about this for some years now, but he's building it,
not publishing on it), and I've been involved in a couple of designs with
similar properties (e.g., SBMs -- Static Barrier MIMDs -- papers/references
available upon request).  Statically scheduled, but not necessarily fixed
sequence, machines seem to have the benefit of very simple hardware
supporting a very general execution mechanism, and the compiler technology
to get good results is easy enough (for us compiler gurus :-).

     __         /|
  _ |  |  __   / |  Compiler-oriented
 /  |--| |  | |  |  Architecture
/   |  | |__| |_/   Researcher from
\__ |  | | \  |     Purdue
    \    |  \  \
	 \      \   hankd@ee.ecn.purdue.edu

lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (04/26/88)

In article <7992@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>Actually, it has been common for a while now.  I don't remember the model,
>but I know at least one of CSPI's array processors used FIFO interfaces not
>only to decouple memory references, but also to decouple interactions
>between control, address generation, and arithmetic hardware.

Yes, CSPI had such a machine in the mid-70's. They published about it, in
IEEE Computer (I think).  The address unit computed addresses and put them
into an address FIFO (FIFOs?). The data unit could load and store from a
pair of data FIFOs.

However, the CSPI machine was different from WM in that the two decoupled
units executed from two different instruction streams. Data-dependant
branching was not an option. This extreme decoupling was limited by the FIFO
depth (3 - to avoid fallthrough time). I believe that CSPI simulated the
architecture first, and were going for a win of about 2x speedup (on FFTish
problems).

-- 
	Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

jjb@sequent.UUCP (Jeff Berkowitz) (04/27/88)

In article <1508@pt.cs.cmu.edu>, agn@UNH.CS.CMU.EDU (Andreas Nowatzyk) writes:

> Of particular appeal to me was the introduction of fifo's into the
> load/store instructions (decoupling the time when an address is issued
> from the time when the data is accessed) as it has the *potential*  of
> allowing more latency in the memory system without degrading the throughput.

Culler Scientific implemented this feature.  It is visible at the instruction
set level, so the compiler could emit code like
		...
		load	(r2)
		<arbitrary instructions during latency>
		add	fifo, r0, r1
		...
Note that this is psuedocode, not Culler assembler.  The depth of the FIFO
is four.  It is possible to issue three or four reads in succession, then
begin using the data as it arrives from memory.  The performance gain is
substantial; in practice, the compiler is very often able to schedule
reads in advance and make use of the latency.  Making this design
interruptible was complex.  Culler implmented hardware interlocks to stall
the processor on an attempt to read an emptry FIFO.  The machine included
a deadlock detector in hardware which would generate a trap in case of an
infinite stall (I'm not advocating anything, just reporting on a real
implementation.)

I thought this tidbit would be interesting despite the machine's well known
commercial failure.  Late in our design, we also became aware of a group
- maybe they were at U. of Wisconsin? - which had independently designed
a machine called PIPE.  This machine made heavy use of queues, and the
architects had done some worthwhile simulations.  Sorry I can't be more
specific, my article collection has not yet followed me to Oregon.

-- 

Jeff Berkowitz                        ...!tektronix!sequent!jjb
Sequent Computer Systems              Beaverton OR