[comp.arch] Self timed processors

hunt@spar.SPAR.SLB.COM (Neil Hunt) (02/19/88)

In article <1232@alliant.Alliant.COM> lackey@alliant.UUCP (Stan Lackey) writes:
>Actually, I once heard a proposal to make a microprocessor totally 
>ansynchronous, with logic added to determine when each stage of logic was
>complete, and use that to start the next stage.  It would take advantage of
>the fact that an ALU might be done sooner when adding small numbers, and lots
>of times the numbers added are small (compared to the total size of the 
>data path).  "Self-timed" is what it was called.

>An interesting idea, but likely wouldn't work too well in a pipeline, and
>would be difficult to interface to.  -Stan

I think that it would actually work rather well in a pipeline, with a little
care.

First, to recap on asynchonous signalling: an event is indicated
by a signal transition on a wire (with either sign). In the simplest
form of signaling, two wires are used for each bit of data. A transition
on one wire indicates the transmission of a one bit, while the transition
on the other wire indicates the transmission of a zero bit.
Thus a single transition signals not only the arrival of an event,
but also the type of event.
The receiving unit signals back along a single wire that the data
has been accepted, and more may be sent. To conserve wires, a data bundle
is sometimes used. Here the bits of data are put on a bundle
of wire in the conventional manner, using level signalling, and a
single event wire transition signals the arrival of new stable data
to the next stage.  Again, an acknowledgement transition on a return
wire is used.

Each section of the pipeline has event connections to the unit
preceeding and following which signal the availability and consumption
of each data item. Consider a linear pipeline of processing elements.
Data enters at one end, and propagates through the stages. Its speed
of propagation is limited by the speed of the processing stages, and by
the need to wait until the next stage is available. This means that
the pipeline will run correctly at the speed of the slowest component;
this would have been the clock frequency of a synchronous system.
But if the slowest component is speeded up, perhaps by processing data
which involves less propagation up the carry chain, the whole pipeline
speeds up to take advantage of the smaller delays.

The problem with pipelines running in a self timed fashion concerns
external conditions. The obvious example is in a branch instruction;
in a synchronous system, there are a known number of branch delay slots,
which can be filled or empty, squashed, predicted, etc. The machine is
designed to throw away the wasted cycles in an incorrectly predicted
branch. But in a self timed system, it is not possible to say
how many instructions could be in the pipeline when the branch takes
the unpredicted direction. (A slower instruction could have entered the
pipeline, and be lagging behind a fast branch instruction, or
several fast instructions could all be bumper-to-bumper behind
the branch instruction.)

The answer is to make the relationships between the stages explicit,
and represent them as additional signalling connections. For example,
we could have some logic maintaining a state of the pipeline: either
full and running, or flushing discarded instructions.
When a taken branch is encountered, this is set to flushing mode.
A signal which arrives with the new stream of instructions from
the memory system resets this to the running state.
The state of this unit controls whether the results of
computations are written or discarded. In this way, regardless of the
number of instructions actually in the pipeline when the branch was
taken, the processor can start to execute the new stream as soon as
it starts to arrive in the processor; there is no need to wait
for the longest possible time which it might take for the pipeline
to flush itself, as the entire processor is self timed.
Appropriate use of FIFOs and signal acknowledgements takes care of
the situation where the processor might have more than one
taken branch in the pipeline at once, which might, without care,
lead to the signal for the earlier branch being interpreted prematurely
as the OK to start using instructions after the second branch.

Concerning interfacing; many system busses are currently asynchronous,
offering the same advantages of being able to use the speed of the cheaper
operations, while not being limited to the slowness of the more expensive
operations except when they are actually being performed.
With a synchronous processor, some of this advantage is lost as the
asynchronous delays on the bus must be quantised to clock cycles when
interfacing to a synchronous processor. Would it not be better to have the
entire system running in an asynchronous manner ?

I think that this is in fact rather an exciting possibility.

Neil/.

					hunt@spar.slb.com
					...{amdahl|decwrl|hplabs}!spar!hunt

steckel@Alliant.COM (Geoff Steckel) (02/20/88)

The ancient DEC 36-bit machines were asynchronous.  The PDP-6 and PDP-10
(KA-10) processors had chains of delays to time cycles.  The concept of
'hardware subroutines' was used to perform memory cycles, ALU operations,
and other arbitrarily timed operations.  It worked remarkably well, except
when one of the subroutines failed to return.  Result: a hung machine,
with busy people chasing around the estimated 2000+ lights looking for
the one representing the condition being waited for.

A later 36-bit processor, the KI-10, had a central clock, but also used
the hardware subroutine concept, resynchronizing at every clock edge.
It also used cycle stretching, with 3 lengths of cycles depending on what
was to be performed.

Since these were large (physically), had long cables to memory, and used
core (200nS access was BLINDINGLY fast), they showed performance changes
if you recabled the memory bus or changed the physical layout of memory
boxes - even 10 feet of cable showed up as a percent or two.

Note neither of these machines had microcode - all direct logic.
With about 360+ instructions, they were almost impossible to compile for,
but the easiest machines to code in assembler I've ever seen.

They had registers mapped as low memory.  This was useful to simplify the
addressing (no special register-to-register instructions).  In the PDP-6
the registers were actually part of low core.

The 'delay chain' architecture could be built; it would require someone
to adequately address synchronization (indeterminacy) with other clocked
systems, but that problem arises in every system, so what else is new.

pauls@nsc.nsc.com (Paul Sweazey) (02/22/88)

Although asynchronous computing is attractive, it isn't likely to be
commercially attempted until a production VLSI design methodology
exist that synthesized asynchronous state machines that are correct
by construction, and that analyzes them to eliminate the races and
hazards, without also eliminating the performance advantages.

If someone out there is qualified to build such tools, send me email.

Paul Sweazey,  M/S D3678
National Semiconductor Corporation
2900 Semiconductor Drive, PO Box 58090
Santa Clara, CA 95052		Work: 408-721-5860
{decwrl,hplabs,ihnp4,sun,pyramid,amdahl}!nsc!pauls

malcolm@spar.SPAR.SLB.COM (Malcolm Slaney) (02/22/88)

In article <4979@nsc.nsc.com> pauls@nsc.UUCP (Paul Sweazey) writes:
>Although asynchronous computing is attractive, it isn't likely to be
>commercially attempted until a production VLSI design methodology
>exist that synthesized asynchronous state machines that are correct
>by construction, and that analyzes them to eliminate the races and
>hazards, without also eliminating the performance advantages.
>
>If someone out there is qualified to build such tools, send me email.

I think that Bob Sproul (of CMU) and Ivan Southerland (as in the graphics
company) have the methodology down.  I attended a short course they taught 
on their ideas last year and while the stuff is messy it does seem to work.
Their ideas were written up in Electronics last year.

Cheers.

							Malcolm

grosen@amadeus.ucsb.edu (Mark D. Grosen) (02/22/88)

People are working on asynchronous processors at UC Berkeley.  We had
Theresa Meng visit our department a couple weeks ago.  She (and others) have
developed a methodology for designing asynchronous processors that
eliminates races and hazards using handshaking.  Most of her work was aimed
at DSP processors.  She reported a 2x speedup of the TMS32010 using her
async design instead of the original clocked scheme. She should be finishing
her PhD dissertation soon, so that would be a reference to start with.

Mark


Mark D. Grosen		ARPA: grosen%filter@hub.ucsb.edu
Signal Processing Lab
ECE Dept.
University of California
Santa Barbara, CA  93106

grunwald@uiucdcsm.cs.uiuc.edu (02/23/88)

There is a recent CalTech CS tech report on synthesis of self-timed circuits.
You might send them a letter & ask for it.