[comp.arch] The WM Machine

firth@sei.cmu.edu (Robert Firth) (05/05/88)

This post is a fairly long review of the WM machine recently
described in Computer Architecture News.

It is about 280 lines long, and follows this form feed.  It
assumes (a) you have read the paper, and (b) care tuppence
about my thoughts.

The opinions, misunderstandings, errors &c are mine alone.

	The WM Computer Archicecture
	============================

	[Wm A Wulf, Computer Architecture News, 16 (1988) pp 70..84]

	Comments by Robert Firth, CMU-SEI, May 1988


Introduction
------------

The WM computer architecture, described in the cited paper, is a sketch
of part of the order code of a new computer processor.  This processor
is designed, it its author's words, to retain the RISC philosophy while
achieving significantly greater performance than other RISC designs.

This Note comments on the major features of the architecture as described,
and reflects on their possible effectiveness.


Major Features
--------------

The major features of the WM architecture are

(a) a load/store design

(b) loads and stores use implied FIFO operand and result queues

(c) operations use registers (or FIFO) for operands and results

(d) each instruction performs TWO operations on THREE operadns,
    after the pattern

	result := rand1 rator1 (rand2 rator2 rand3)

(e) relational operators return their right operand as result and
    load a condition into a FIFO; conditional jumps read the condition FIFO

There are also facilities for a floating point coprocessor, a "stream
mode" load or store, and multiple load FIFOs.

As shown, the design allows for 32 named integer registers (of which one
or two may in fact be FIFOs), 32 floating registers (similarly), 16
general integer operations, 16 general floating operations, and 32 "special"
operations.  The paper does not describe any of these operations, but by
implication they include arithmetic, logical, and relational.


General Observations
--------------------

The first observation is that the paper presents only a sketch.  There is
no detail of any of the operations, no detail on the special instructions
other than a brief account of the "stream" loads, and no description of
even so important an operation as the subroutine call.

Moreover, while the claims for the machine are made in terms of execution
speed, data access speed, and so on, the presentation is purely at the
level of the instruction set architecture; there are no timing diagrams,
data flow diagrams, or other lower-level detail.

This incompleteness makes it hard to assess the design; nevertheless, the
following sections offer some critical commentary, on specific and more
general features.

The next observation is clear from Figure 1 of the reference: the instruction
set space is almost saturated.  As Bell and Newell remark, this is a mistake
in the design of a new machine, since it leaves no room for extensions or
afterthoughts.  In detail: the instruction class is indicated by the
leading four bits, and all combinations are used.  In all classes except
2#1110# (special), the remaining bit patterns are almost fully allocated.
The special class admits 32 operations; the text implies that a large
number are already allocated.


Load/Store
----------

A load/store design serves two purposes

(a) it simplifies the instruction set by allowing general addresses to
    appear in only a few places. In particular, it usually allows the
    instruction encoding to use a single uniform size.

(b) it improves performance by uncoupling the fetch of a value from its
    use, thereby allowing the fetch delay to be overlapped with useful
    work.

Previous RISC experience shows that these purposes are well served by this
design, and there seems little reason to doubt the WM design will work as
well.


FIFOs
-----

The use of FIFO queues, however, is more problematical.  There is some
gain in instruction density, since the destination of a load, or the
source of a store, need not be specified.  But this is a very small gain,
since most of a load or store instruction is taken up with the memory
address.

I do not believe there is any performance gain.  If loads or stores can
be overlapped, it is relatively easy for a compiler to target successive
loads into different registers.  With 32 general registers available,
there will surely be enough for any reasonable sequence of loads or
stores.  The one exception - of which there is an example in the paper -
is when the loads are in a tight loop and the load delay runs round the
loop.  But there the rate-determining step is the load, regardless of
how it is buffered, and loop unrolling can almost always avoid load
destination clashes.

Moreover, loads into explicit registers offer several advantages.  First,
the operands need not be used in the order in which they are loaded,
which is often helpful, especially if operations have to be reordered to
take advantage of parallel ALUs.  Secondly, and far more important, the
use of an operand does not destroy it.  It is significant that none of
the examples given in the paper ever uses the same operand twice; such
an example would show clearly how a load FIFO forces an extra instruction
to be generated.

Finally, FIFOs cause severe implementation problems, one of which at least
- efficient saving and restoring of FIFO contents on a task context switch -
has never to my knowledge been satisfactorily solved.  Other serious
problems include resetting the FIFO after an interrupt or exception, and
resuming an orderly flow of loads and stores after a memory-management trap.

In sum, I believe the use of FIFOs offers no advantages over loads and
stores into and out of the general registers, and incurs several
disadvantages.


Operations use Registers
------------------------

This is another piece of the normal RISC approach, and is again probably
a good feature.  However, even a RISC machine must provide operations
directly on memory, for processor synchronization; the absence of any
such instructions in WM might be thought a major oversight.  But it is 
clear from the paper as a whole, that the author is describing really
only that part of the machine that a compiler writer would see.


Two Operations per Instruction
------------------------------

This feature seems to me quite unwarranted.  One major point of a RISC
design is that instructions are simple enough to be executed quickly.
But here is a machine where, within one instruction, there are two
operations, the one gated on the result of the other.  This forces the
instruction to take twice as long, regardless of how many execution
units might be available.

A design in which twice as many instructions are executed, but successive
instructions are independent, clearly cannot perform worse, and will perform
better if several execution units can be run in parallel.  One main lesson
of prior designs - achieve higher parallelism by uncoupling mutually dependent
operations - has simply been ignored.

The WM sketch intends the first half of each instruction to be overlapped with
the second half of the preceding instruction, and there is a dependency
rule about that.  But the same apparatus is all that is needed to execute
independent instructions in parallel, so the implementation complexity of
WM is no less, but its performance is impeded by the extra dependency
within each instruction.


Relational Operations and Conditions
------------------------------------

It is a matter of debate whether condition codes are justified, rather
than having the comparison yield a Boolean result or even drive the
conditional branch directly.  Proponents of RISC would probably tend
to regard condition codes as a needless complication.  If that is so,
then, a fortiori, a condition-code FIFO is a multiply needless complication.

In many years of practical programming, I can recall only a couple of
cases where it would have been useful to enqueue two sets of condition
codes for successive testing.  Accordingly, I must pronounce this feature
useless.  Moreover, there are several RISC machines where a comparison,
test, and jump decision can be done in one cycle.  Therefore, no purpose
is served by separating the comparison from the jump decision, since
there is no latency to be overlapped.  And the branch-delay optimisation
issue is equally present, under a different guise, in the WM machine, as
its author admits.

The presence of yet another FIFO adds yet more complexity to the handling
of traps, interrupts, or context switches.


Address Modes
-------------

As a final point of detail, let us look not at a feature, but at a non
feature.  The WM machine has no address modes.  Instead, the load and
store instructions perform an integer computation, again of the form

	rand1 rator1 (rand2 rator2 rand3)

and the result becomes the effective address for the memory reference.

Consider first, how one might access a simple static variable, at a given
(32-bit absolute) address.  Clearly, one cannot say the equivalent of

	Load @16#abcdefgh#

since the machine does not permit an absolute address as an operand.

Another way, would be to store the address in a literal pool, and access
the operand indirectly, by a PC-relative or register-based mode.  This
is the strategy on the CA LSI-2, for instance.  But WM has no indirection,
and no PC-relative mode.

Alternatively, one could load the address into a register, using a load
literal instruction (as is done on the M/500).  But there is no such
instruction.

As far as can be discovered, an address can be copmposed only of one,
two, or three registers, and one or two 5-bit literals.  This means it
is almost impossible, in one instruction, to construct the address of
anything efficiently.  To generate a 32-bit static address, by my
count, takes three instructions.  To generate a 16-bit unsigned
displacement takes two instructions, and a third is then required to
access the based variable so addressed.

As another example, consider accessing a local variable.  If it is
within 32 bytes of the frame pointer, we are lucky:

	(fp) + (displacement noop noreg)

will serve.  If it is, say, at local address 40, then we might say

	(fp) + (8 + (rx))

where rx holds the useful constant 32, and at the price of an extra
addition (and surely an extra cycle) we can now encode a six-bit
displacement.  (Actually, it would be marginally better to reverse the
registers).  But how are we to load 32 into rx?  Perhaps by

	rx := 16 + (16 noop noreg)

To summarise: the addressing capability of the machine is desperately
inadequate.  It might, as its author says, allow "scaled-index" modes
to be generated, but one pays a high price for this fairly negligible
capability. It would have been far better to use the bits of this
instruction mode to encode, for instance, base register and signed
16-bit displacement.  In fact, they could have encoded two base registers
and such a displacement.


Conclusion
----------

The WM machine contains some standard RISC features and some novel features.

Almost all the novel features are both complicated and bad; the claims that
they will enhance performance are, in this author's opinion, unfounded:

. FIFO load and store queues give inferior performance, at far higher
  cost, than loads and stores that use general registers

. executing two coupled operations per instruction creates needless
  dependency, and so reduces performance

. a FIFO condition code queue serves no purpose and solves no problem

. the machine is crippled by inadequate addressing capability

Finally, any implementation of this design will be faced with serious
problems in building correct synchronous traps, efficient task context
switches, and transparent memory management.



--------

brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (05/05/88)

I thought I would add a little technical support to a couple of the points.

In article <5339@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
>In sum, I believe the use of FIFOs offers no advantages over loads and
>stores into and out of the general registers, and incurs several
>disadvantages.
Also, the fifo that results of load requests return to in general will
have to sort the data that arrives so that any scrambling of arrival
order, which is very frequent in a multiprocessor, is fixed up.  The
internal structure used to do this looketh very much like registers,
so why not just stay with registers!
>
>. the machine is crippled by inadequate addressing capability
No scatter/gather capability for vectors, something that has become
very popular on supercomputers, with good reason.

baum@apple.UUCP (Allen J. Baum) (05/05/88)

--------
[]
>In article <6762 (Eugene D. Brooks III) writes:
>I thought I would add a little technical support to a couple of the points.

Me, too.

It seems to me that it is entirely likely that a stream operation will get
interrupted, by a page fault, if nothing else. Saving and restoring is a
pain, and unlikely to be done without some very special instructions, or
a bunch of 'CISC' like "save the state of the machine on the ?stack?"
operations. Furthermore, if I have a two operand stream operation (both
fifo's busy), then I'm locked out of memory until the streams are finished!

Overall, I see very little that a VLIW machine couldn't, including the
stream operations, which would probably be trivial in a VLIW. The WM has,
as far as I can see, only the virtue of being compact, and it isn't very
clear that all the extra instructions you need to solve some of its problems
won't expand the instruction stream to make that marginal as well.

Since a lot of the criticisms are fairly obvious, it could be that we just
don't know all the details. Wulf has been around a while, and must have
considered some of these. Maybe we'll hear from him, and find out what the
real story is.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

jangr@microsoft.UUCP (Jan Gray) (05/06/88)

In article <5339@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
>
>Address Modes
>-------------
>
For the WM computation O1 op1 (O2 op2 O3), assume that we have:

    two inner operators:
	1lit10 ::= ((O2 << 5) | O3 | (1 << 10))
	0lit10 ::= ((O2 << 5) | O3)

    two outer operators:
	lli ::= Rd[15-0] <- (O1 << 11) | (O2 op2 O3)	("load lower immediate")
	lui ::= Rd[31-16] <- (O1 << 11) | (O2 op2 O3)	("load upper immediate")

(I know that instructions are actually of the form
	Rd := O1 op1 (O2 op2 O3)
so that "lui" isn't *strictly* legal given the description he presents, but
also assume Wulf isn't telling us all the little details...a safe assumption!)

>Consider first, how one might access a simple static variable, at a given
>(32-bit absolute) address.  Clearly, one cannot say the equivalent of
>
>	Load @16#abcdefgh#
>
>since the machine does not permit an absolute address as an operand.

If you allow my assumption above, you can load a 32 bit constant into a
register using lli and lui.  This still takes two cycles but saves you
a read from the literal pool.  You still have to issue a load on this
address...

>Alternatively, one could load the address into a register, using a load
>literal instruction (as is done on the M/500).  But there is no such
>instruction.

No instruction as such, but it can be fabricated from the appropriate
operators...

>As another example, consider accessing a local variable.  If it is
>within 32 bytes of the frame pointer, we are lucky:
>
>	(fp) + (displacement noop noreg)
>
>will serve.  If it is, say, at local address 40, then we might say
>
>	(fp) + (8 + (rx))
>
>where rx holds the useful constant 32, and at the price of an extra
>addition (and surely an extra cycle) we can now encode a six-bit
>displacement.  (Actually, it would be marginally better to reverse the
>registers).  But how are we to load 32 into rx?  Perhaps by

With the 1lit10 and 0lit10 operations above, you can access 2K bytes of the
frame:
	(fp) + (#10101 0lit10 #010101)

>Finally, any implementation of this design will be faced with serious
>problems in building correct synchronous traps, efficient task context
>switches, and transparent memory management.

That's for sure!  According to the paper, Wulf has been simulating this design
on "a fairly large set of benchmark fragments".  He says this architecture can
do about 4 RISC-like instructions per cycle.  I believe him.  What I'm less
sure of is if there is a memory architecture that will keep this machine
running.  Hopefully someone will implement this architecture (or simulate
a real-world memory architecture for it) and we'll all know for sure.

These are my opinions, but not necessarily my employer's.

Jan Gray   uunet!microsoft!jangr   Microsoft Corp., Redmond Wash.   206-882-8080

------------------------------
"Application for patent protection has been made for portions of the material
described here".  Right.

bcase@Apple.COM (Brian Case) (05/06/88)

In article <5339@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
>Two Operations per Instruction
>------------------------------
>This feature seems to me quite unwarranted.  One major point of a RISC
>design is that instructions are simple enough to be executed quickly.
>But here is a machine where, within one instruction, there are two
>operations, the one gated on the result of the other.  This forces the
>instruction to take twice as long, regardless of how many execution
>units might be available.

No, no, no.  The data dependency rule is included specifically to allow
a pipeline stage between the two ALUs.  Wulf is clearly assuming the 
pipestage is implemented (at least it is clear to me).  The basic cycle
of the WM machine should be no longer than that of any other RISC machine.

>A design in which twice as many instructions are executed, but successive
>instructions are independent, clearly cannot perform worse, and will perform
>better if several execution units can be run in parallel.  One main lesson
>of prior designs - achieve higher parallelism by uncoupling mutually dependent
>operations - has simply been ignored.

I don't think Wulf is ignoring anything.  The complexity of an n-instruction-
at-a-time (n > 1) is much greater than that of a WM machine.  I don't know
how much greater, but it is greater.  Maybe significantly.

>The WM sketch intends the first half of each instruction to be overlapped with
>the second half of the preceding instruction, and there is a dependency
>rule about that.  But the same apparatus is all that is needed to execute
>independent instructions in parallel, so the implementation complexity of
>WM is no less, but its performance is impeded by the extra dependency
>within each instruction.

The implementation complexity of WM is A LOT LESS than a machine that tries
to execute two instructions at the same time (two less register file ports,
no "scoreboarding" needed, etc.).  At least *I* would much rather implement
a WM than a two-instruction-at-a-time conventional RISC.  However, the
performance of a two-instruction-at-a-time conventional RISC might be greater,
and is generalizable (but then, just implement a two-instruction-at-a-time
version of the WM...).

>Moreover, there are several RISC machines where a comparison,
>test, and jump decision can be done in one cycle.  Therefore, no purpose
>is served by separating the comparison from the jump decision, since
>there is no latency to be overlapped.  And the branch-delay optimisation
>issue is equally present, under a different guise, in the WM machine, as
>its author admits.

Those RISC machines that implement compare/branch in one instruction are
fudging a bit (unless the compare is very simple, like == 0, == reg, != 0,
!= reg).  The purpose served by separating the compare from the branch is
quite important!  This is called, by some, branch spreading, and it allows
early evaluation of the branch so that latency is reduced to zero.  This
is my one big complaint with Wulf's paper:  he didn't reference previous
branch spreading work done on CRISP and RIDGE machines.

wulf@uvacs.CS.VIRGINIA.EDU (Bill Wulf) (05/07/88)

[As I was editing my last message, our VAX went down -- somehow the
air conditioning had gotten turned off, and the VAX was at risk of a 
melt-down. Anyway, I was editing the middle of the message, and had
just said that I didn't have space to describe the OS support
features. What I would have said next was ...]

... I am putting the finishing touches on the arch. definition, and
hope to make it available as a UVa report in the next few weeks. I'll
be happy to mail it to folks who send me a request -- it will tell you
a lot more than you probably want to know. Briefly, however, there are
really two aspects that I might mention here.

First, note the class of "special" instructions isn't described in
the paper. Basically the instructions in this class (which includes the
OS support stuff) take multiple cycles and are sorta CISC-ish ( I HATE
these labels; people get all hung up on the labels at the expense of
the substance). They are segregated this way so that the nornal decode-ex
path is not delayed by them. They in turn, translate into a sequence
of the simpler (RISC-ish) instructions. They take as long as they take!
They could not be faster inplemented another way, they do not slow down
the other, more common, instructions, and they can safely manipulate
state (in prescribed, safe ways) that is not normally accessible
in another way (eg, the FIFO state).

Second, I wanted to provide a set of support for OS functions that
provides an overall speedup without impacting the cycle time -- I didn't
want to build a 100MIPS processor, and then throw away 30-50% of it
on OS related stuff. The best way I know how to do that is to make
more OS-ish operations directly, and SAFELY, available from normal
user code (as opposed to "kernel mode"), thus avoiding the considerable
overhead of kernel entry/exit. I think we have found a simple and elegant
way to do that -- one that, in fact, eliminates kernel mode altogether.

BAsically, the scheme involves typing pages, interpreting the rights
bits differently for each type, and restricting certain operations to
operate on certain types of pages. For example, TCB (Task Control
Block) is a hardware understood page type. There is a Context-Swap
instruction that must name a TCB page as it's operand -- and to be
executed, must have the rights bits set appropriately. The net effect
is that it's possible (NOT required) to allow an ordinary user program
to act as a scheduler for a set of tasks (eg, it might be an Ada
runtime system), even though that program has no other privledges wrt
the tasks it schedules (eg, it can't access the code/data of the
tasks, modify their TCBs, etc.)

Bill

wulf@uvacs.CS.VIRGINIA.EDU (Bill Wulf) (05/07/88)

[..... Aaaaarrrrrgggghhhhh. The perversity of inanimate objects!
This is the message that should have gotten sent BEFORE the last one.
Grumble, mutter, sputter, fume .... ]

I am new to this ... I only began reading this newsgroup a few
days ago when one of the folks here told me that some articles 
had appeared on WM. Alas, I haven't seen them all (like Firth's),
so forgive me for jumping into the middle of this without full
information and any misunderstandings I may have as a consequence.

Re Baum's comments about streaming state and context swaps:  There
are indeed a number of instructions in the full WM defn for saving/
restoring state, including that of the FIFOs -- though, perhaps not
so many as he fears. As should be obvious, I made a conscious decision
to increase the amount of state (witness 64 registers, 32 integer and 32
floating point) to improve performance at the expense of context-swapping.
I am not especially concerned about the number of instructions this
involves (it's only a handful in any case), but the size of the state
is a legit concern.

For what it's worth, the rationale was as follows:  Processors are getting
ever faster. Interrupt rates have gotten higher too, but not as much so
since they are governed by more by the external world -- and people don't
type faster, disks don't rotate especially faster, etc. I think this trend
will continue, so in planning processors for the future, it's reassonable
to consider increasing the amount of state if that speeds up the processor.
One can't be silly about it, like blowing up the state by orders of magnitude,
since interrupt rates and required response time to some of the interrupts
is increasing, but a modest increase is a reasonable thing to consider. I did,
and opted for (some) more state.

I don't want to try to describe all the architectural features of WM related
to OS issues (of which state save/restore is just one) here -- it would just
take too much space. <<insert previous article here>>

Re Case's reply to Firth (which, again, I haven't seen): Right on.

The two op's/instruction add nothing to the cycle time. But note that they DO
add to the latency. Considering the 3-stage pipe simple model of RISCs (decode,
op,write-back), WM has 4 stages (decode, op, op, write-back) -- so a result
is available 4/3's later. Whether this wins or loses depends on whether
you've done one or two operations in the instruction. If 1, and couldn't
dispatch an unrelated operation on the next cycle, you lose by 30%. If
you use 2, you win by 50% (4/3s rather than 6/3s). The data says you
win!

As I said in the CAN article, I was very surprised by the frequency of
use of 2 ops/cycle. When I first started this, I was really only exploring
a number of ways to make all instructions exactly 32 bits. The 2ops/instr
was just one idea, and I expected it to be a loser -- so I built a quick
and dirty compiler than generated this format and was blown away by the
results. (Aside -- After getting the preliminary results, I tried 3, 4,...,
ops/instr (ignoring the fact that I couldn't encode them in 32 bits) and
found that 2 is magic; you get very little benefit from more than 2).

An added (perhaps obvious) addition to Case's reply add-comp-br vs. the way
WM does things: Since the IFU is decoupled from the arithmetic units, and
hence the cond branches can be executed early (assuming the CC is available),
this means that you can start filling the instruction buffer(s) early too,
thus eliminating instruction-fetch latency. You can statistically get a similar
effect with branch-prediction, but the WM scheme works independent of the
way the branch is taken -- so, specifically, works for the if-then-else case
that is the the bug-a-boo of branch-prediction schemes.

General caveat -- at the time I wrote the WM article, I hadn't seen any designs
that handle conditionals like WM (CRISP gets some of the same effect, but with
an entirely different, and I think more expensive, mechanism). Since then, the
ZS-1 design was pointed out to me, which has a very similar mechanism, has
a FIFO interface to memory, and decouples the int & fp units in a similar way.
Some ideas just seem to have a time to be ripe.