hunt@spar.SPAR.SLB.COM (Neil Hunt) (02/19/88)
In article <1232@alliant.Alliant.COM> lackey@alliant.UUCP (Stan Lackey) writes: >Actually, I once heard a proposal to make a microprocessor totally >ansynchronous, with logic added to determine when each stage of logic was >complete, and use that to start the next stage. It would take advantage of >the fact that an ALU might be done sooner when adding small numbers, and lots >of times the numbers added are small (compared to the total size of the >data path). "Self-timed" is what it was called. >An interesting idea, but likely wouldn't work too well in a pipeline, and >would be difficult to interface to. -Stan I think that it would actually work rather well in a pipeline, with a little care. First, to recap on asynchonous signalling: an event is indicated by a signal transition on a wire (with either sign). In the simplest form of signaling, two wires are used for each bit of data. A transition on one wire indicates the transmission of a one bit, while the transition on the other wire indicates the transmission of a zero bit. Thus a single transition signals not only the arrival of an event, but also the type of event. The receiving unit signals back along a single wire that the data has been accepted, and more may be sent. To conserve wires, a data bundle is sometimes used. Here the bits of data are put on a bundle of wire in the conventional manner, using level signalling, and a single event wire transition signals the arrival of new stable data to the next stage. Again, an acknowledgement transition on a return wire is used. Each section of the pipeline has event connections to the unit preceeding and following which signal the availability and consumption of each data item. Consider a linear pipeline of processing elements. Data enters at one end, and propagates through the stages. Its speed of propagation is limited by the speed of the processing stages, and by the need to wait until the next stage is available. This means that the pipeline will run correctly at the speed of the slowest component; this would have been the clock frequency of a synchronous system. But if the slowest component is speeded up, perhaps by processing data which involves less propagation up the carry chain, the whole pipeline speeds up to take advantage of the smaller delays. The problem with pipelines running in a self timed fashion concerns external conditions. The obvious example is in a branch instruction; in a synchronous system, there are a known number of branch delay slots, which can be filled or empty, squashed, predicted, etc. The machine is designed to throw away the wasted cycles in an incorrectly predicted branch. But in a self timed system, it is not possible to say how many instructions could be in the pipeline when the branch takes the unpredicted direction. (A slower instruction could have entered the pipeline, and be lagging behind a fast branch instruction, or several fast instructions could all be bumper-to-bumper behind the branch instruction.) The answer is to make the relationships between the stages explicit, and represent them as additional signalling connections. For example, we could have some logic maintaining a state of the pipeline: either full and running, or flushing discarded instructions. When a taken branch is encountered, this is set to flushing mode. A signal which arrives with the new stream of instructions from the memory system resets this to the running state. The state of this unit controls whether the results of computations are written or discarded. In this way, regardless of the number of instructions actually in the pipeline when the branch was taken, the processor can start to execute the new stream as soon as it starts to arrive in the processor; there is no need to wait for the longest possible time which it might take for the pipeline to flush itself, as the entire processor is self timed. Appropriate use of FIFOs and signal acknowledgements takes care of the situation where the processor might have more than one taken branch in the pipeline at once, which might, without care, lead to the signal for the earlier branch being interpreted prematurely as the OK to start using instructions after the second branch. Concerning interfacing; many system busses are currently asynchronous, offering the same advantages of being able to use the speed of the cheaper operations, while not being limited to the slowness of the more expensive operations except when they are actually being performed. With a synchronous processor, some of this advantage is lost as the asynchronous delays on the bus must be quantised to clock cycles when interfacing to a synchronous processor. Would it not be better to have the entire system running in an asynchronous manner ? I think that this is in fact rather an exciting possibility. Neil/. hunt@spar.slb.com ...{amdahl|decwrl|hplabs}!spar!hunt
steckel@Alliant.COM (Geoff Steckel) (02/20/88)
The ancient DEC 36-bit machines were asynchronous. The PDP-6 and PDP-10 (KA-10) processors had chains of delays to time cycles. The concept of 'hardware subroutines' was used to perform memory cycles, ALU operations, and other arbitrarily timed operations. It worked remarkably well, except when one of the subroutines failed to return. Result: a hung machine, with busy people chasing around the estimated 2000+ lights looking for the one representing the condition being waited for. A later 36-bit processor, the KI-10, had a central clock, but also used the hardware subroutine concept, resynchronizing at every clock edge. It also used cycle stretching, with 3 lengths of cycles depending on what was to be performed. Since these were large (physically), had long cables to memory, and used core (200nS access was BLINDINGLY fast), they showed performance changes if you recabled the memory bus or changed the physical layout of memory boxes - even 10 feet of cable showed up as a percent or two. Note neither of these machines had microcode - all direct logic. With about 360+ instructions, they were almost impossible to compile for, but the easiest machines to code in assembler I've ever seen. They had registers mapped as low memory. This was useful to simplify the addressing (no special register-to-register instructions). In the PDP-6 the registers were actually part of low core. The 'delay chain' architecture could be built; it would require someone to adequately address synchronization (indeterminacy) with other clocked systems, but that problem arises in every system, so what else is new.
pauls@nsc.nsc.com (Paul Sweazey) (02/22/88)
Although asynchronous computing is attractive, it isn't likely to be commercially attempted until a production VLSI design methodology exist that synthesized asynchronous state machines that are correct by construction, and that analyzes them to eliminate the races and hazards, without also eliminating the performance advantages. If someone out there is qualified to build such tools, send me email. Paul Sweazey, M/S D3678 National Semiconductor Corporation 2900 Semiconductor Drive, PO Box 58090 Santa Clara, CA 95052 Work: 408-721-5860 {decwrl,hplabs,ihnp4,sun,pyramid,amdahl}!nsc!pauls
malcolm@spar.SPAR.SLB.COM (Malcolm Slaney) (02/22/88)
In article <4979@nsc.nsc.com> pauls@nsc.UUCP (Paul Sweazey) writes: >Although asynchronous computing is attractive, it isn't likely to be >commercially attempted until a production VLSI design methodology >exist that synthesized asynchronous state machines that are correct >by construction, and that analyzes them to eliminate the races and >hazards, without also eliminating the performance advantages. > >If someone out there is qualified to build such tools, send me email. I think that Bob Sproul (of CMU) and Ivan Southerland (as in the graphics company) have the methodology down. I attended a short course they taught on their ideas last year and while the stuff is messy it does seem to work. Their ideas were written up in Electronics last year. Cheers. Malcolm
grosen@amadeus.ucsb.edu (Mark D. Grosen) (02/22/88)
People are working on asynchronous processors at UC Berkeley. We had Theresa Meng visit our department a couple weeks ago. She (and others) have developed a methodology for designing asynchronous processors that eliminates races and hazards using handshaking. Most of her work was aimed at DSP processors. She reported a 2x speedup of the TMS32010 using her async design instead of the original clocked scheme. She should be finishing her PhD dissertation soon, so that would be a reference to start with. Mark Mark D. Grosen ARPA: grosen%filter@hub.ucsb.edu Signal Processing Lab ECE Dept. University of California Santa Barbara, CA 93106
grunwald@uiucdcsm.cs.uiuc.edu (02/23/88)
There is a recent CalTech CS tech report on synthesis of self-timed circuits. You might send them a letter & ask for it.