[comp.arch] penalty for microcode

aglew@urbsdc.Urbana.Gould.COM (11/22/88)

>But what about doing microcode prefetch?  I haven't seen anyone account
>for that possibility in this particular question (RISC-vs.-CISC), though
>I seem to recall it's been done on some machines (S-1?).
>
>   Norm

Sounds like someone's got a research topic. Tell us when you publish.

dkirk@k.gp.cs.cmu.edu (Dave Kirk) (11/22/88)

Distribution: usa
References: <3290@ucdavis.ucdavis.edu> <7746@aw.sei.cmu.edu> <3634@pt.cs.cmu.edu> <sXW7FJy00VsEI1xUcS@andrew.cmu.edu>
Organization: Carnegie-Mellon University, CS/RI

In article <sXW7FJy00VsEI1xUcS@andrew.cmu.edu> bader+@andrew.cmu.edu (Miles Bader) writes:
>> >
>> >Now pass that microcode through a peephole optimiser, and trim it down.
>> >If one operand is already in a microcode regitsr, don't move it there;
>>                                                  ^^^^^^^^^^^^^^^^^^^
>> Any reasonable microcoder would have performed this optimization in his
>> original code.  If you are referring to your "new" expanded code, then
>> please don't include it in your reduction costs over the original code.
>But the two cases aren't the same; conventional microcode can't be optimized
>with knowledge of the microcode due to previous (macro) instructions (because
>they aren't known until run-time).  If you were fetching inlined microcode 
>from ram, you could do this, and it probably would be an improvement 
>(how much?  I dono...) over cisc microcode in rom.
>
Let me better explain my comment, and our micro architecture.

Microcode registers look no different to the microcoder than
macrocode registers.  If there are 16 macro registers and 256
micro registers, the only difference in accessing the two is the actual
address.  Hence, the only time we ever "move anything to a register"
is when the instruction says go to memory and get this operand.  There is
never a case when you move an operand from a macro register to a micro 
register, and THEN use it.  As a result, advanced knowledge that a value
may have been in a micro register during the previous macro instruction
does not save any time or code.  

-Dave
--

aglew@urbsdc.Urbana.Gould.COM (11/24/88)

>There's a definite appeal to having one-cycle instructions, but i think it's
>mostly illusory.  If an in-place complement takes less time than a three-operand
>add-with-shift, they shouldn't be forced to take the same amount of time.  In
>other words, if most of your instructions take one cycle, your cycles are too
>long.
>
>So what do y'all think about this?  Are one-cycle instructions a good thing?
>
>--Joe

Cycles are a bad thing! The universe is not discrete.
All instructions should be self-timed, to precisely the length of
time required to do the operation.

:-)

pausv@smidefix.liu.se (Paul Svensson) (11/25/88)

In article <28200241@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>
>Cycles are a bad thing! The universe is not discrete.
>All instructions should be self-timed, to precisely the length of
>time required to do the operation.
>
>:-)

Hear, hear!

I couldn't resist following up on this one, since we actually have such a
beast down in the basement.  The FCPU (Flexible CPU), built by DataSAAB in the
early seventies is completely asynchronous.  The control units delivers
instructions to various computation modules, including main memory, which then
runs until ready.  Communication between modules are through "validated
registers" (queues of length one), because the control unit does not await
completion of an instruction before starting the next one.

It's a truly amazing machine.  At the moment we're running a FORTH in the
control unit only, with the rest of the machine, including main memory,
powered off.  But just wait 'til next week, when we've had cooling installed!
				:-)

DataSAAB couldn't sell more then about half a dozen of it, I guess partly
because they never used it to full capacity. They only used it to emulate
their previous design, a conventional mid-sixties mainframe. :-(
---
		Paul Svensson		psv@ida.liu.se

colwell@mfci.UUCP (Robert Colwell) (11/29/88)

In article <28200241@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>>There's a definite appeal to having one-cycle instructions, but i think it's
>>mostly illusory.  If an in-place complement takes less time than a three-operand
>>add-with-shift, they shouldn't be forced to take the same amount of time.  In
>>other words, if most of your instructions take one cycle, your cycles are too
>>long.
>>So what do y'all think about this?  Are one-cycle instructions a good thing?
>
>Cycles are a bad thing! The universe is not discrete.
>All instructions should be self-timed, to precisely the length of
>time required to do the operation.
>
>:-)

I left in your smiley, Andy, but it's not all that nutty.  There
is a certain appeal to making a machine where the clock cycle doesn't
have to be "one-size-fits-all".  Apart from the added complexity,
the problem is that when you're juggling multiple CPU pipes, you
have to either control each pipe individually in this fashion, or
you control them all (allowing enough time in the current clock
cycle for the slowest pending pipe stage to complete).  That's a 
mess, and I doubt it would ever pay.

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
175 N. Main St.
Branford, CT 06405     203-488-6090

rpw3@amdcad.AMD.COM (Rob Warnock) (12/02/88)

+---------------
| Cycles are a bad thing! The universe is not discrete.
| All instructions should be self-timed, to precisely the length of
| time required to do the operation.
+---------------

Uh, ever look at a DEC PDP-10? (The original KA-10 CPU, circa 1967, based
on the earlier DEC PDP-6 [1965?].)

The internal implementation was exactly what you seem to be asking
for: Each "time-state" was a pulse regenerator which strobed the results
of the previous time-state to its target register, conditionally set
or cleared various bits to choose what to do next (generally enabling
operands onto the input busses of the ALU, or enabling result registers
to accept the output of the ALU), and conditionally fed its output pulse
into one of several delay lines. The specific delay line was chosen so
that when the pulse can out the other end (and got regenerated as the
next "time-state" pulse), the operation was done.

There were hardware "subroutines", for example, "memory read cycle".
For every potential "caller", there was one flip-flop. The caller pulse
set that "return" flop, and also went into an inclusive-OR gate with
all of the other "callers" pulses. At the bottom of the "subroutine",
the last time pulse was fanned out to a bunch of AND gates, one per caller,
whose other input was the "return-PC" ;-}   ...that is, the flip-flop that
had been set when the subroutine was called. The output wire of the selected
AND gate fed back to the continuation point of the caller.

There was no centralized "clock", nor were the delay lines bunched up in
some centrtal place and shared. There was exactly one delay line for each
event in the CPU.  In other words, the "PC" of the microengine was expressed
by which delay line the pulse was hiding in at any given time. (Think of
the micro-PC as being in unary, rather than binary!) The "clock ticks" were
those instants when the pulse could be seen between delay lines, as it got
regenerated, when it could "do things", and get routed around before hopping
into another delay line.

In fact, the micro-PC could travel between cabinets. The memories, you see,
were in external boxes (a whole 16k words each!), and during a memory-cycle
subroutine the timing pulse travelled out to the selected memory module and
ran the timing routines of the memory itself out there, and then travelled
back to the CPU in the form of the "memory done" pulse, only to be routed to
whichever part of the CPU which had called the memory-cycle subroutine.

It was simply *lovely* to 'scope! The flow-charts of the instruction
interpreter were virtually one-to-one with the delay lines and pulse
regenerators of the hardware. And since the micro-PC was unary, it was
simplicity itself to trigger an oscilloscope on any desired micro-step.

(Ah! Nostalgia...)

p.s.
A later version of this technique -- called "Chartware" by DEC, for the ease
from which you could go from the flowchart to the wiring diagram -- was used
in the PDP-14 industrial controller modules (a sort of Tinker Toy build-your-
own-CPU family -- there was never a general-purpose computer built out of it
that I knwo of). It did use a centralized clock, but still had a unary micro-PC,
and still left the timing of the operations to the various functional units.
It used a scheme similar to the HP-IB (a.k.a. IEEE-488) bus. There was a
common wired-OR "ready" line, and as the clock ticked the selected functional
units pulled down on "ready" (made if false) until their operation was complete.
The last one to let go allowed the clock to tick again, thus strobing the
results in the the destination, and at the same time clocking the "PC" from
its previous location to the next. The "PC" in this case was flip-flops instead
of delay lines, and only one "control" flip-flop in the system should be set
at a time.  (A unary PC, again.) It might be better to say that the "PC" was
a huge shift register, with loops and branches.

John Alderman (founder of Dig. Comm. Assoc.) and I developed a still simpler
version we called "synchronous chartware" (though it owed as much to the PDP-10
style as to the Chartware style), which was just a shift register (with loops
and branches) driven from a single system clock, wherein the operations were
timed by how many shift register stages (flip flops) lay between the one that
started the operation and the one that used the result. Still, operations could
be of different lengths, and even of variable lengths. (Long variable-length
delays were implemented with a "while loop" which waited for the completion
signal from the functional unit.) We found this design technique to be of
great utility for things like magtape and disk controllers. (Cheap fast ROMs
weren't yet available [circa 1971], nor was the now-common bit-slice microcode
controller, e.g. the Am2911.) The technique, though for most uses hopelessly
low-density by today's standards, still comes in handy for very-high-speed
state machines with a lot of multi-way transition edges.


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403