[comp.arch] Cycle stretching

david@daisy.UUCP (David Schachter) (02/16/88)

Here is a dumb question.  Say I have a CPU where 99 percent of the instructions
take, say, one clock.  The remaining instructions need just a little longer--
one clock plus a few nanoseconds.  Why not stretch the clock a bit when exec-
uting those instructions, instead of wasting most of a second clock period?

					-- David Schachter

Opinions herein are an artifact of pulse dialing, not a carbon-based life form.

wesommer@athena.mit.edu (William E. Sommerfeld) (02/16/88)

In article <844@daisy.UUCP> david@daisy.UUCP (David Schachter) writes:
>Say I have a CPU where 99 percent of the instructions
>take, say, one clock.  The remaining instructions need just a little longer--
>one clock plus a few nanoseconds.  Why not stretch the clock a bit when exec-
>uting those instructions, instead of wasting most of a second clock period?

Rumour has it that the original Lisp Machines (the `CADR's) did just
this; there were two clocks, and one bit of the microcode selected
which clock would be used.  The story goes that both clocks were
hand-adjustable, and that special microcode diagnostics were used to
tune each one (speed it up until it crashes, and then back off a 1/4
turn..) so each machine would run as fast as possible..

				-- Bill Sommerfeld
				   wesommer@athena.mit.edu

mac3n@babbage.acc.virginia.edu (Alex Colvin) (02/16/88)

I believe that the 64-bit microcode of the Honeywell Level6 = DPS6 also
had a cycle width bit.  Anyone from Bellerica care to confirm this?

baum@apple.UUCP (Allen J. Baum) (02/17/88)

--------
[]
>In article <844@daisy.UUCP> david@daisy.UUCP (David Schachter) writes:
>
>Here is a dumb question.  Say I have a CPU where 99% of the instructions
>take, say, one clock.  The remaining instructions need just a little longer--
>one clock plus a few nanoseconds.  Why not stretch the clock a bit when exec-
>uting those instructions, instead of wasting most of a second clock period?

Not a dumb question. Lots of older microcoded minis did exactly this in their
microcode. They had a control field to slow down the clock (from 150ns. to
180ns., for instance) when something slow came up, like a branch. I believe
the first Prime was a machine that did this.
This does complicate the world, especially synchronizing to the outside world.
Its easier to just take a full cycle in the 1% cases.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

philip@amdcad.AMD.COM (Philip Freidin) (02/17/88)

In article <844@daisy.UUCP> david@daisy.UUCP (David Schachter) writes:
>
>Here is a dumb question.  Say I have a CPU where 99 percent of the instructions
>take, say, one clock.  The remaining instructions need just a little longer--
>one clock plus a few nanoseconds.  Why not stretch the clock a bit when exec-
>uting those instructions, instead of wasting most of a second clock period?
>
>					-- David Schachter
>

This is not a dumb question, because I have the answer! The technique you
describe is a common one, and has been dealt with many times in various
ways in different implementations of assorted architectures. The ratio of
quick versus not so quick instructions though is not 99:1, but more in the
50:50 region (give or take 20%). The quick instructions are the logical
ops, because there is no interbit communications. The not so quick are the
primitive arith ops such as add, inc, sub, dec, etc, because the inter bit
communications slow things down (the carry chain). Longer instructions use
multiple cycles (and in some systems, a mix of long and short cycle instruc-
tions). 

SALES MODE ON  :-)

The implementation of this variable period clock can be done with the

AMD  AM2925  clock generator chip.

It is specifically designed to do exactly what you asked about. In a micro-
programmed system, each micro cycle execution duration is a function of
what the critical path will be for the specified opperation. This is
known at assembly time of the microcode. An extra field is added to the
microcode, which controls the Am2925, and thus sets the duration of the
clock for that microcycle. The chip has a 3 bit control field, so it can
generate 8 different clock periods. With wait states, this can be extended.

SALES MODE OFF :-) :-)  (double smily because I prefer to be out of sales
			 mode)

Philip Freidin @ AMD SUNYVALE on {favorite path!amdcad!philip)
Section Manager of Product Planning for Microprogrammable Processors
(you know.... all that 2900 stuff...)
"We Plan Products; not lunches" (a quote from a group that has been standing
				 around for an hour trying to decide where
				 to go for lunch)

crabb@cadsys.dec.com (Charlie, SEG/CAD, HLO2-2/G13, (dtn 225)(617)568-5739) (02/17/88)

>>>>...The remaining instructions need just a little longer--
>>>>one clock plus a few nanoseconds.  Why not stretch the clock a bit when exec-
>>>>uting those instructions, instead of wasting most of a second clock period?

>>>Rumour has it that the original Lisp Machines (the `CADR's) did just
>>>this; there were two clocks, and one bit of the microcode selected
>>>which clock would be used.  

 
>>Not a dumb question. Lots of older microcoded minis did exactly this in their
>>microcode. They had a control field to slow down the clock (from 150ns. to
>>180ns., for instance) when something slow came up, like a branch. I believe
>>the first Prime was a machine that did this.
 
>The implementation of this variable period clock can be done with the
>AMD  AM2925  clock generator chip.
 
 
	I third the motion for non-dumbness.  The PR1ME (note
	old logo :-) ) did indeed have a clock field in the microcode
	word to adjust the clock on a per-instruction granularity.
	Time was spent tuning the machine for various (macro)
	instruction times in the 4xx-7xx series (pre-pipeline era). 
 
	/Charlie Crabb !decwrl!cadsys.dec.com!crabb

jesup@pawl20.pawl.rpi.edu (Randell E. Jesup) (02/17/88)

In article <20409@amdcad.AMD.COM> philip@amdcad.UUCP (Philip Freidin) writes:
>The implementation of this variable period clock can be done with the

>AMD  AM2925  clock generator chip.

>It is specifically designed to do exactly what you asked about. In a micro-
>programmed system, each micro cycle execution duration is a function of
>what the critical path will be for the specified operation. This is
>known at assembly time of the microcode. An extra field is added to the
>microcode, which controls the Am2925, and thus sets the duration of the
>clock for that microcycle. The chip has a 3 bit control field, so it can
>generate 8 different clock periods. With wait states, this can be extended.

	This will probably only work at relatively slow speeds.  At higher
clock rates, you will find that the inter-chip latency is magnitudes greater
that the amount you want to adjust the clock by.

	If, in some micro-programmed CISC (yuch! :-), you wnat to stretch a
cycle by 20%, you could just use a 4 or 5 times faster clock and take another
cycle.  In a RISC, unless it's awfully slow, you might as well take the
extra cycle, because chip-edge capacitance slows things down so much.  That
is the real next hurdle that state of the art stuff has to beat.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

lackey@Alliant.COM (Stan Lackey) (02/18/88)

>In article <844@daisy.UUCP> david@daisy.UUCP (David Schachter) writes:
>>
>>Say I have a CPU where 99 percent of the instructions
>>take, say, one clock.  The remaining instructions need just a little longer--
>>one clock plus a few nanoseconds.  Why not stretch the clock a bit when exec-
>>uting those instructions, instead of wasting most of a second clock period?

Actually, I once heard a proposal to make a microprocessor totally 
ansynchronous, with logic added to determine when each stage of logic was
complete, and use that to start the next stage.  It would take advantage of
the fact that an ALU might be done sooner when adding small numbers, and lots
of times the numbers added are small (compared to the total size of the 
data path).  "Self-timed" is what it was called.

An interesting idea, but likely wouldn't work too well in a pipeline, and
would be difficult to interface to.  -Stan

vandys@hpindda.HP.COM (Andy Valencia) (02/18/88)

	If you're going to do this, why not take it all the way and
make your computer "event driven" instead of clocked?  Then your
computation can continue at the highest speed available from your
components (gee, and you could even replace slow components with faster
ones....)

	So an "add register to memory" would go like:

Events		Sequencer
		<Request register #x>
		<Request memory location #N>
<Register x>
<Mem N>
		... pipline in: <Request next instr location>
		<Request ADD>
<ADD done>
		<Request write memory location #N>
... <Next instr>
		... start next instruction
<Write done>

	Bunches of asynchronously executing components... I wonder
what it would be like to diagnose the microcode :-<.

			Andy Valencia
			vandys%hpindda.UUCP@hplabs.hp.com

przemek@gondor.cs.psu.edu (Przemyslaw Klosowski) (02/18/88)

In article <8802162251.AA20090@decwrl.dec.com> crabb@cadsys.dec.com (Charlie, SEG/CAD, HLO2-2/G13, (dtn 225)(617)568-5739) writes:
>>>Not a dumb question. Lots of older microcoded minis did exactly this in their
>>>microcode. They had a control field to slow down the clock (from 150ns. to
>>>180ns., for instance) when something slow came up, like a branch. I believe
>>>the first Prime was a machine that did this.
>	/Charlie Crabb !decwrl!cadsys.dec.com!crabb

Hey, I saw an old PDP (was it 8?) with a knob on the front panel, regulating
the clock frequency! you are pressed for time? turn it clockwise! (probably
at the expense of the error rate). I personally would rather implement it as
a foot operated lever under the operator console...   :^)
				przemek@psuvaxg.bitnet
				psuvax1!gondor!przemek

henry@utzoo.uucp (Henry Spencer) (02/19/88)

> ... The remaining instructions need just a little longer--
> one clock plus a few nanoseconds.  Why not stretch the clock a bit when exec-
> uting those instructions, instead of wasting most of a second clock period?

Having several different clock periods used to be fairly routine in the
days when minis were built from TTL.  Some of the PDP11 family, for example,
had three different clock periods selectable on a microcycle-by-microcycle
basis.  It's gotten less popular nowadays because everything tends to be in
one chip that's invariably short of pins, and it's not as easy to just tap
one or two bits of the microword to control the clock.  It still can be 
done -- the Sun 3/100 series really does have 1.5 wait states for memory
accesses, even though the 68020 has no notion of fractional wait states,
because the clock generator knows about memory accesses and lengthens the
cycle.
-- 
Those who do not understand Unix are |  Henry Spencer @ U of Toronto Zoology
condemned to reinvent it, poorly.    | {allegra,ihnp4,decvax,utai}!utzoo!henry

bjj@psueclb.BITNET (02/19/88)

> Hey, I saw an old PDP (was it 8?) with a knob on the front panel, regulating
> the clock frequency!

That was a PDP-10.  KA10 processor.  Sorry, that know only changes the speed
front panel repeat function.  Like when you turn on repeat, hit "deposit next",
and have the CPU fill all of memory.  Handy for memory tests when the CPU
won't run anything.

>        If you're going to do this, why not take it all the way and
> make your computer "event driven" instead of clocked?

The KA10 is asynchronous.  There really is no clock.  All timing is
determined by delay lines (and wire length).  It has subroutines
(like the memory cycle subroutine) which accept parameters (read
and write).  All done by sending the pulse off in various directions
and gating the returning pulse with a flag to indicate who's waiting
for the subroutine to finish.  Very nice for doing things in parallel
as you can have separate steams executing independently and wait
for the last to finish.

>        Bunches of asynchronously executing components... I wonder
> what it would be like to diagnose the microcode :-<.

Who knows.  The microcode works, why diagnose it?  Fixing it is easy,
just takes a scope.  Occasionally our KA10 will have a pulse amplifier
go bad resulting in a lost pulse.  You just check the state of the
machine (there are lights on nearly every register) to get a good
idea where the pulse was lost.

jk3k+@andrew.cmu.edu (Joseph G. Keane) (02/19/88)

I've been thinking about making an asynchronous processor for a while.  You 
need a lot of extra timing circuitry (i'd guess about double), but it mostly 
runs in parallel.  I think eventually this idea will win out.  You don't need 
any safety margin (`turn till it dies then back off a quarter turn'); the 
thing will always run as fast as possible.  But can you imagine trying to 
benchmark the thing?!

A couple weeks ago there was a talk here by someone who had apparently done 
just this.  I'm kicking myself because i missed it.  I suppose i could get a 
reference if anyone wants it.

--Joe

grunwald@uiucdcsm.cs.uiuc.edu (02/19/88)

There is a recent tech report from CalTech discussing synthesis of self
time circuits. CalTech has historically promoted the use of self-timed
circuitry for reliability. Heretofore, the main problems have been
design complexity.

As an example, the AMRD (Async. Message Routing Device) of the Ametek 2010
machien uses a self-timed network. When I saw the machine, they had 1/2 the
parts running at 8Mhz (I think) & the other half at 12Mhz. They wanted to
get to 20Mhz eventually. However, the key point is that when you communicated
between the 12Mhz parts, you ran at 12Mhz. The 12Mhz parts only ran at 8Mhz
when there was an 8Mhz part in the chain.  As long as the complexity is
managable, it certainly appears to be a good design method.

baum@apple.UUCP (Allen J. Baum) (02/20/88)

--------
[]
>In article <1232@alliant.Alliant.COM> lackey@alliant.UUCP (Stan Lackey)writes:
>
>Actually, I once heard a proposal to make a microprocessor totally 
>ansynchronous, with logic added to determine when each stage of logic was
>complete, and use that to start the next stage.  It would take advantage of
>the fact that an ALU might be done sooner when adding small numbers, and lots
>of times the numbers added are small (compared to the total size of the 
>data path).  "Self-timed" is what it was called.
>
>An interesting idea, but likely wouldn't work too well in a pipeline, and
>would be difficult to interface to.  -Stan

Machines like this have been built (e.g. the Illiac II), but none recently.
Although people have talked about self-timed processors, I'm not aware of
any that have been built. Pieces of microprocessors are sometimes self-timed,
like register files and caches, but thats the extent of it.
--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

mitch@Stride.COM (Thomas Mitchell) (02/20/88)

In article <844@daisy.UUCP> david@daisy.UUCP (David Schachter) writes:
>
>Here is a dumb question.  Say I have a CPU where 99 percent of 
           ^^^ Not dumb.
>the instructions
>take, say, one clock.  The remaining instructions need just a little longer--
>one clock plus a few nanoseconds.  Why not stretch the clock 
>a bit when executing those instructions, instead of wasting most
>of a second clock period?

David;

That is the same question we asked when were selecting a clock
rate for our 400 series custom MMU.  We could have clocked that
processor (MC68010) at 12 MHz but that would have required an
extra state on each memory cycle.  At 10MHz that extra state was
not required.  The result was the same throughput at 10MHz as
12MHz in most cases.

The hard part is finding the knee in the curve.  If the extra
state instructions are used rarely then adding the state to all
operations is a loss.  If few but commonly used then it is a
gain.

Well thanks for the Soap Box,

kyriazis@pawl11.pawl.rpi.edu (George Kyriazis) (02/20/88)

In article <1232@alliant.Alliant.COM> lackey@alliant.UUCP (Stan Lackey) writes:
>>In article <844@daisy.UUCP> david@daisy.UUCP (David Schachter) writes:
>>>
>>>Say I have a CPU where 99 percent of the instructions
>>>take, say, one clock.  The remaining instructions need just a little longer-
>>>one clock plus a few nanoseconds...
>
>Actually, I once heard a proposal to make a microprocessor totally 
>ansynchronous, with logic added to determine when each stage of logic was
>complete, and use that to start the next stage.  It would take advantage of
>the fact that an ALU might be done sooner when adding small numbers, and lots
>of times the numbers added are small (compared to the total size of the 
>data path).  "Self-timed" is what it was called.

  Yes.  There is a chapter in Mead-Conway's book 'Introduction to VLSI Systems'
describing self-timed circuits.  The concept is pretty interesting, since (for
example) a circuit can be built using self-timed circuits and the interface
can be built to communicate with normal clocke circuitry.  It starts beeing
interesting when you realize that if you want to make the chip (or the CPU)
faster, you can just lower the temperature... No adjustable clocks, no nothing.

>
>An interesting idea, but likely wouldn't work too well in a pipeline, and
>would be difficult to interface to.  -Stan

  Yes, interfacing is more difficult, but there are standard ways to overcome
the difficulty.  The problem is elsewhere.  Since there cannot be a BUS ENABLE
signal for internal buses (you will have to wait for the last signal to change
state before you toggle ENABLE), the only solution is to have 2 wires for every
bit.  One to say 'ok, I have a 1' and another one saying 'ok, I have a 0'.
That doubles the amount of wires required for every datapath, thing that
can easily lower the transistor density of the chip.
  Another interesting thing about self-timed circuits is that they look that
they have a lot of things in common with the pronciples used in Dataflow
computers, like 'This operation won't be executed unless I get results from
"there" and "there"'.

*******************************************************
*George C. Kyriazis                                   *    Gravity is a myth
*userfe0e@mts.rpi.edu or userfe0e@rpitsmts.bitnet     *        \    /
*Electrical and Computer Systems Engineering Dept.    *         \  /
*Rensselear Polytechnic Institute, Troy, NY 12180     *          ||
*******************************************************      Earth sucks.

aglew@ccvaxa.UUCP (02/22/88)

..> Variable clock rates

(1) There's a company that sells a box for the VAX (I think 750, but not
    sure) that varies the clock rate according to processor activity.
    They say that they can get an extra 15% out of your old tired VAX.

    Not sure of details - just crossed this in some DEC magazine.

(2) I've always liked the idea of self timed circuitry (note that this
    is an order of scale different from varying clock rates), but have
    a question that someone with more experience with self timed techniques
    can answer.

    Am I correct in saying that a trivial way of obtaining a self timed
    circuit is to take a "normal" circuit, say an adder, and put a timing
    circuit beside it that will produce a pulse when the adder is finished?
    And that there are "transformations" that will more closely intertwine
    the timing circuit with the function, so that they share gates?
    Doesn't this require extremely accurate parametrization of the device's
    performance, more than is required for non-self-timed systems?

rick@svedberg.bcm.tmc.edu (Richard H. Miller) (02/23/88)

In article <3297@psuvax1.psu.edu>, przemek@gondor.cs.psu.edu (Przemyslaw Klosowski) writes:
> 
> Hey, I saw an old PDP (was it 8?) with a knob on the front panel, regulating
> the clock frequency! you are pressed for time? turn it clockwise! (probably
> at the expense of the error rate). I personally would rather implement it as
> a foot operated lever under the operator console...   :^)

We have a clock speed switch (two actually, a course speed and fine speed pot)
on the console of our KI-10. (PDP-10). The documentation indicates that the
speed control is used only for maintenance. It is always kept in the fastest
position during production. From reading the schematics of the processor, this
switch is usually used to diagnosis clock problems or timing problems in the
processor. 

Richard H. Miller                 Email: rick@svedberg.bcm.tmc.edu
Head, System Support              Voice: (713)799-4511
Baylor College of Medicine        US Mail: One Baylor Plaza, 302H
                                           Houston, Texas 77030

cantrell@Alliant.COM (Paul Cantrell) (02/24/88)

In article <3297@psuvax1.psu.edu> przemek@gondor.cs.psu.edu (Przemyslaw Klosowski) writes:
>Hey, I saw an old PDP (was it 8?) with a knob on the front panel, regulating
>the clock frequency! you are pressed for time? turn it clockwise! (probably
>at the expense of the error rate). I personally would rather implement it as
>a foot operated lever under the operator console...   :^)
>				przemek@psuvaxg.bitnet
>				psuvax1!gondor!przemek

This was almost certainly a KA-10 processor, part of a DECSystem-10
computer system. The knob is used during debugging (of hardware or
software) to control the speed of single-step operation, not the
speed during normal operation. This was actually pretty nice if you
were debugging a problem with the operating system. You could place
the system in single step mode, crank the knob around to get desired
speed of single step, and watch the lights on the processor and memory
until you saw the condition you were looking for.

The KA-10 was indeed an asynchronous machine (I always heard it referred
to as a 'race' machine). The machine would run at different rates
depending on environmental conditions, which memory things were being
accessed from, etc. Different machines would run at different rates
from each other. It made it difficult to do good benchmarks...

					PC

bobc%wings@Sun.COM (Bob Clark) (02/24/88)

In article <28200107@ccvaxa> aglew@ccvaxa.UUCP writes:
>
>    Am I correct in saying that a trivial way of obtaining a self timed
>    circuit is to take a "normal" circuit, say an adder, and put a timing
>    circuit beside it that will produce a pulse when the adder is finished?
>    And that there are "transformations" that will more closely intertwine
>    the timing circuit with the function, so that they share gates?
>    Doesn't this require extremely accurate parametrization of the device's
>    performance, more than is required for non-self-timed systems?

You have defined two approaches:

1) Take a functional module, characterize the worst case delay, and
add a delay line in parallel with the function.  Use the output of
the delay line to determine that the function is complete.

	This is the trivial approach, and buys you nothing over standard
	synchronous design.  It is a way of modifying a synchronous
	circuit to work withing an otherwise self-timed system.

2) Design an entirely new ciruit to implement the function, whose
state changes are controlled in such a way that the final state change
indicates completion of the function.

	This is the truly self-timed approach, and requires careful
	definition of the state changes.  One approach is to design
	an asynchronous state machine, whose states are carefully
	designed so that only a single bit of the state code can
	change at a time.  This requires no characterization of the
	circuit speed, and is referred to in the literature as
	the "one-hot" approach.

	An alternative is to assemble your macro self-timed circuit
	out of micro-self-timed modules, such as C-elements.  As
	others have mentioned, Ivan Sutherland and others have been
	working in this area recently, and I would guess that some
	work has gone on sporadically since the early days of
	computing.

It is possible to design circuits whose functional completion does
not require parametrization of the device's perfromance.

Bob Clark
Sun Microsystems

blarson@skat.usc.edu (Bob Larson) (02/24/88)

In article <42976@sun.uucp> bobc@sun.UUCP (Bob Clark) writes:
[In reply to <28200107@ccvaxa> aglew@ccvaxa.UUCP]

>1) Take a functional module, characterize the worst case delay, and
>add a delay line in parallel with the function.  Use the output of
>the delay line to determine that the function is complete.

or

>2) Design an entirely new ciruit to implement the function, whose
>state changes are controlled in such a way that the final state change
>indicates completion of the function.

Why not something intermediate?  Rather than having a fixed delay,
have one that is a function of the inputs.  An adder that has an extra
output indicating that the maximum carry propigation will be N bits
could be designed.  (Probably sharing some gates with the look ahead
carry generator.) The output may be stable earlier than predicted,
(therefore wasting time) but it is still better than always waiting
the worst case for any input, and possibly uses fewer gates that the
fully determanistic "I tell you exactly when I'm ready" circuts.
--
Bob Larson	Arpa: Blarson@Ecla.Usc.Edu	blarson@skat.usc.edu
Uucp: {sdcrdcf,cit-vax}!oberon!skat!blarson
Prime mailing list:	info-prime-request%fns1@ecla.usc.edu
			oberon!fns1!info-prime-request

przemek@gondor.cs.psu.edu (Przemyslaw Klosowski) (02/27/88)

In article <1268@alliant.Alliant.COM> cantrell@alliant.UUCP (Paul Cantrell) writes:
>In article <3297@psuvax1.psu.edu> przemek@gondor.cs.psu.edu (Przemyslaw Klosowski) writes:
>>Hey, I saw an old PDP (was it 8?) with a knob on the front panel, regulating
>>the clock frequency! you are pressed for time? turn it clockwise! (probably
>
>This was almost certainly a KA-10 processor, part of a DECSystem-10
>					PC

I went to our dungeon with defunct equipment  and  I found out that it was
PDP15. Another machine that used this was a machine build by Polish pioneer
of minicomputers, Karpinski, around 1965 (?); it was a contract for the physics
department of the University of Warsaw. At this time they couldn't afford 
anything commercial, so they hired Karpinski. This machine still works, 
maintained by a dedicated engineer, even though they are getting some IBM PC
that are comparable in computational power to it (there was a front page article
in Wall Street Journal about the PC revolution in Poland).

				przemek@psuvaxg.bitnet
				psuvax1!gondor!przemek