preston@titan.rice.edu (Preston Briggs) (10/09/90)
In article <kb4Rxwa00VpI1JbUxl@andrew.cmu.edu> mh2f+@andrew.cmu.edu (Mark Hahn) writes: >can anyone say anything about architectures for speculative execution? >I think the Microprocessor Report mentioned that a new sparc >implementation will do it. isn't the 88000 (over-)due for a major >overhaul, too? Well sure. Compilers can schedule instructions to be executed speculatively (or optimistically) on all sorts of machines. Wide instruction words offer opportunites, as do long pipelines. You could probably do it in hardware too (that it, go shooting off down some code path before knowing a condition), but I expect compilers can do a better job. Think in terms of trace scheduling. -- Preston Briggs looking for the great leap forward preston@titan.rice.edu
khb@chiba.Eng.Sun.COM (Keith Bierman - SPD Advanced Languages) (10/10/90)
In article <1990Oct9.162639.23516@rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
You could probably do it in hardware too (that it, go shooting off
down some code path before knowing a condition), but I expect
compilers can do a better job. Think in terms of trace scheduling.
At the Hot Chips forum (and at M/S too, from the preprogram notes) LSI
gave some details its "Lightning" chip. By having hw support one can
execute instructions before knowing if they are legal. The MF/trace
machine lost big time when they predicted wrong .... this limits the
length of speculative traces, and the applications for which you get
speedup.
Compilers can't do everything ;>
--
----------------------------------------------------------------
Keith H. Bierman kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33 | (415 336 2648)
Mountain View, CA 94043
preston@titan.rice.edu (Preston Briggs) (10/10/90)
In article <KHB.90Oct9105243@chiba.Eng.Sun.COM> khb@chiba.Eng.Sun.COM (Keith Bierman - SPD Advanced Languages) writes: >By having hw support one can >execute instructions before knowing if they are legal. The MF/trace >machine lost big time when they predicted wrong .... this limits the >length of speculative traces, and the applications for which you get >speedup. >Compilers can't do everything ;> Sure they can :-) And if we could build these good compilers, think how cheaply we could build the hardware. In particular, we can arrange all sorts of optimistic execution. Imagine some hypothetical machine that can issue one FLOAT and one INT instruction per cycle. IF we've got code like this int-1 int-2 int-3 int-4 if (something) { float-1 float-2 float-3 store result } float-4 int-5 int-6 ... We could rearrange it (perhaps) into int-1, float-1 int-2, float-2 int-3, float-3 int-4, float-4 if (somthing) { store result } int-5 int-6 Generally, we can hoist computations from one branch that set registers that are dead on the "other" branch or store into a dead location (on the other branch). We need to be careful to avoid traps. Note that there's no cost if the branch isn't taken if we only hoist operations to fill unused slots. This (I believe) differs from the trace scheduling philosophy which is more aggressive. In general, we need to be careful about fatally increasing register pressure. The i860's exposed pipeline provides an elegant way out, allowing simple aborts of optimistic computations by ignoring what's partially computed in the pipe. -- Preston Briggs looking for the great leap forward preston@titan.rice.edu
schow@bcarh185.bnr.ca (Stanley T.H. Chow) (10/10/90)
In article <1990Oct9.162639.23516@rice.edu> preston@titan.rice.edu (Preston Briggs) writes: > >You could probably do it in hardware too (that it, go shooting off >down some code path before knowing a condition), but I expect >compilers can do a better job. Think in terms of trace scheduling. >-- >Preston Briggs looking for the great leap forward >preston@titan.rice.edu Hmm, you mean your compiler can schedule code so that five instructions are issued to one function unit on every clock? There are some occasions when multiple copies of h/w can do better than compiler. I believe speculative execution is one. Stanley Chow BitNet: schow@BNR.CA BNR UUCP: ..!uunet!bnrgate!bcarh185!schow (613) 763-2831 ..!psuvax1!BNR.CA.bitnet!schow Me? Represent other people? Don't make them laugh so hard.
preston@titan.rice.edu (Preston Briggs) (10/10/90)
I wrote: >>You could probably do it in hardware too (that it, go shooting off >>down some code path before knowing a condition), but I expect >>compilers can do a better job. Think in terms of trace scheduling. In article <3431@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes: >Hmm, you mean your compiler can schedule code so that five instructions >are issued to one function unit on every clock? >There are some occasions when multiple copies of h/w can do better than >compiler. I believe speculative execution is one. Given the same number of functional units, with same latencies, same number of registers, ... I believe I can write a compiler that will make code run faster than hardware that does speculative execution. My version of the hardware should also be cheaper in that it doesn't have to coordinate the multiple paths of speculation. Why can I win? I can look farther ahead at compilation than the hardware can while running. Is there some variation of hardware and software that's even quicker? I dunno. -- Preston Briggs looking for the great leap forward preston@titan.rice.edu
schow@bcarh185.bnr.ca (Stanley T.H. Chow) (10/10/90)
In article <1990Oct9.224312.2031@rice.edu> preston@titan.rice.edu (Preston Briggs) writes: >In article <3431@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes: >>Hmm, you mean your compiler can schedule code so that five instructions >>are issued to one function unit on every clock? >>There are some occasions when multiple copies of h/w can do better than >>compiler. I believe speculative execution is one. > >Given the same number of functional units, with same latencies, >same number of registers, ... >I believe I can write a compiler that will make code run >faster than hardware that does speculative execution. >My version of the hardware should also be cheaper >in that it doesn't have to coordinate the multiple paths of speculation. I assume you mean given two machines, otherwise identical, differing only in that M1 does hardware speculative execution and M2 does not, you can write a compiler that will make M2 run faster than M1. This sounds wrong. I do not doubt that you can speedup M2 by speculative execution (with hoisting, etc). But surely the same technology can be applied to M1 with the same result. A priori, I would expect the benifit to be the same for both machines. On the other hand, there are many things that *need* H/W. E.g., pre- fetching (and executing) multiple paths. Surely these gains are additive to those of the compiler. (This is not to say that H/W speculative execution is easy or even possible, and certainly says nothing about the relative gains or the relative difficulty of H/W vs S/W). Stanley Chow BitNet: schow@BNR.CA BNR UUCP: ..!uunet!bnrgate!bcarh185!schow (613) 763-2831 ..!psuvax1!BNR.CA.bitnet!schow Me? Represent other people? Don't make them laugh so hard.
jkenton@pinocchio.encore.com (Jeff Kenton) (10/10/90)
From article <1990Oct9.212103.363@rice.edu>, by preston@titan.rice.edu (Preston Briggs): > > In general, we need to be careful about fatally increasing > register pressure. The i860's exposed pipeline provides an > elegant way out, allowing simple aborts of optimistic > computations by ignoring what's partially computed in > the pipe. > It would take a lot to convince me that the i860 is an elegant solution to anything. No one has produced a compiler which can take advantage of the theoretically possible parallelism of the i860. It's a very fast chip for certain kinds of applications, but I wouldn't call it elegant, or general purpose. ----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ----- ----- jeff kenton: consulting at jkenton@pinocchio.encore.com ----- ----- always at (617) 894-4508 ----- ----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----
mshute@cs.man.ac.uk (Malcolm Shute) (10/10/90)
>In article <3431@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes: >>There are some occasions when multiple copies of h/w can do better than >>compiler. I believe speculative execution is one. In article <1990Oct9.224312.2031@rice.edu> preston@titan.rice.edu (Preston Briggs) writes: >Given the same number of functional units, with same latencies, >same number of registers, ... >I believe I can write a compiler that will make code run >faster than hardware that does speculative execution. On the face of it, since we are all using Turing Machines with a nifty user interface, we can all do the same thing. If I can do 'X' in a lot of hardware, then you can also do the same 'X' with a little bit of hardware, but with a software simulation of my hardware for your machine. However, the hardware-ish (execution-time) solution has a major advantage over the software-ish (compile-time) solution: it can make all of its scheduling decisions dynamically, as the needs arise. A batch system can schedule its work load optimally before embarking upon the run; but a real-time system can do almost as good a job on the fly, and be far more flexible with it. Speculative execution surely is just an extension of scheduling. When the scheduler has handled every job which it has been asked to schedule, and finds that there are still processors lieing idle, it looks around for jobs which aren't yet scheduled but which might be needed in the near future. (It's a bit like thinking of a scheduler working down from the jobs with really high positive priorities, all the way down to those with low (zero) priority, and finding that there are still processing resources to spare, going on the schedule those with the least negative of priorities!) -- Malcolm SHUTE. (The AM Mollusc: v_@_ ) Disclaimer: all
preston@titan.rice.edu (Preston Briggs) (10/10/90)
I wrote >> In general, we need to be careful about fatally increasing >> register pressure. The i860's exposed pipeline provides an >> elegant way out, allowing simple aborts of optimistic >> computations by ignoring what's partially computed in >> the pipe. and In article <12905@encore.Encore.COM> jkenton@pinocchio.encore.com (Jeff Kenton) writes: >It would take a lot to convince me that the i860 is an elegant solution >to anything. No one has produced a compiler which can take advantage of >the theoretically possible parallelism of the i860. It's a very fast >chip for certain kinds of applications, but I wouldn't call it elegant, >or general purpose. Lots of complaints here... First, the exposed pipeline stuff. If we've got an if-then that looks like this int-1 int-2 int-3 if (something) { pfmul.ss f3,f4,f0 pfmul.ss f0,f0,f0 pfmul.ss f0,f0,f0 pfmul.ss f0,f0,f5 } fst.l f5,somewhere The idea is that if something is true we multiply f3 and f4 together, putting the reult in f5. Then we store f5. So we can't optimistically perform the entire multiply before knowing the value of "something" since f5 is live on the false branch. We can however, hoist the initial pipeline stages (perhaps overlapping them with earlier pipeline compuations). int-1, pfmul.ss f3,f4,f0 int-2, pfmul.ss f0,f0,f0 int-3, pfmul.ss f0,f0,f0 if (something) { pfmul.ss f0,f0,f5 } fst.l f5,somewhere The true path get much shorter. No increase in the path length of the false path. And no extra register required. It's perhaps a dirty trick rather than elegant, but I try to describe my ideas glowingly and reserve disparaging terms for other peoples' work. The point though, is that the exposed pipeline scheme requires less registers because result registers are not frozen at the beginning of a pipelined sequence, but at the end. Similarly, the source registers become avaliable immediately after they are used. In the example above, f3 and f4 are immediately avaliable after the 1st instruction and f5 isn't required until the result pops out of the pipe. Renaming helps, but requires more hidden registers that might be used profitably by the compiler for other work. Regarding compilers, I believe The Portland Group and Ardent both have compilers that will take advantage of the pipelined instructions. Besides that, the i860 is a wonderful source of thesis topics. The i860 may not be your ideal chip, but it's chock full of ideas. The good and useful ones shouldn't be ignored. -- Preston Briggs looking for the great leap forward preston@titan.rice.edu
preston@titan.rice.edu (Preston Briggs) (10/11/90)
In article <3432@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes: >I assume you mean given two machines, otherwise identical, differing >only in that M1 does hardware speculative execution and M2 does not, >you can write a compiler that will make M2 run faster than M1. That's what I believe. >This sounds wrong. I do not doubt that you can speedup M2 by speculative >execution (with hoisting, etc). But surely the same technology can be >applied to M1 with the same result. A priori, I would expect the benifit >to be the same for both machines. Well, I wan't going to let M1 use my fabulous scheduling ideas. It had to be satidfied with hardware. Further, M2 ought to have a higher clock speed since its hardware is simpler. >On the other hand, there are many things that *need* H/W. E.g., pre- >fetching (and executing) multiple paths. Surely these gains are additive >to those of the compiler. I feel like I haven't made myself clear. I'm advocating a wide instruction word so that I can, in a single instruction, specify enough work to keep all the resources busy. Ways of accomplishing this include trace scheduling, global compaction, and software pipelining. Speculative execution hardware notices that it has resources that aren't fully utilized and tries to find work for them to do. I suppose the advantage of speculative hardware is that you can use a skinny instruction word to get some of the same effect, non-deterministically. So fancy hardware can get some parallelism without fancy software, and fancy software can get some parallelism without fancy hardware. Given one, I don't think you need the other. So is it cheaper to build the compiler or the chip? Don't forget to make them correct. -- Preston Briggs looking for the great leap forward preston@titan.rice.edu
chased@rbbb.Eng.Sun.COM (David Chase) (10/11/90)
In article <3432@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes: >In article <1990Oct9.224312.2031@rice.edu> preston@titan.rice.edu (Preston Briggs) writes: >>Given the same number of functional units, with same latencies, >>same number of registers, ... >>I believe I can write a compiler that will make code run >>faster than hardware that does speculative execution. >I assume you mean given two machines, otherwise identical, differing >only in that M1 does hardware speculative execution and M2 does not, >you can write a compiler that will make M2 run faster than M1. >This sounds wrong. I do not doubt that you can speedup M2 by speculative >execution (with hoisting, etc). But surely the same technology can be >applied to M1 with the same result. A priori, I would expect the benifit >to be the same for both machines. One point was not sufficiently emphasized in Preston's posting: "SAME NUMBER OF REGISTERS". A reasonable trick in hardware speculative execution is to make up boatloads of secret registers to hold the results of speculative computations. Preston says, "expose those registers to the compiler", which of course means "change your architecture", which is typically not acceptable to a company with customers committed to an existing architecture (be it VAX, MIPS, SPARC, RS/6000, 370, 80386, whatever). Preston is also assuming, probably correctly, that it requires less hardware to expose the registers and let the compiler juggle them than it does to hide the registers and manage the juggling on the fly in hardware. Less hardware means a number of things (possibly): a) higher yield b) more room for other stuff (cache, tlb, whatever) c) choice of a faster, less compact, or more power-hungry technology (i.e., GaAs or ECL) d) shorter critical path for clock cycle e) shorter pipeline. Thus, if you sell enough chips to recover your investment in high-powered compiler technology, then you win. Big IF there, of course, and you'll be selling a new architecture, and we haven't said much about context switching or any other OS-related issues yet, either. David Chase Sun Microsystems, Inc.
schow@bcarh185.bnr.ca (Stanley T.H. Chow) (10/11/90)
In article <12905@encore.Encore.COM> jkenton@pinocchio.encore.com (Jeff Kenton) writes: >From article <1990Oct9.212103.363@rice.edu>, by preston@titan.rice.edu (Preston Briggs): >> >> In general, we need to be careful about fatally increasing >> register pressure. The i860's exposed pipeline provides an >> elegant way out, allowing simple aborts of optimistic >> computations by ignoring what's partially computed in >> the pipe. >> > >It would take a lot to convince me that the i860 is an elegant solution >to anything. No one has produced a compiler which can take advantage of >the theoretically possible parallelism of the i860. It's a very fast >chip for certain kinds of applications, but I wouldn't call it elegant, >or general purpose. In the context of this discussion - Speculative Execution, exposing the pipeline AND letting S/W control the flow of oeprand/result through it makes sense. As Preston Briggs points out, it is easy (on the i860) to start execution, there is no cost in terms of register usage or issuing instructions. Other current micros will tie up the result register and/or require a seperate instruction issue slot. Anyway, wasn't the i860 designed as a numeric coprocessor? I think of it as DSP-like but with enough "General Purpose" that it does not need a "controller". Stanley Chow BitNet: schow@BNR.CA BNR UUCP: ..!uunet!bnrgate!bcarh185!schow (613) 763-2831 ..!psuvax1!BNR.CA.bitnet!schow Me? Represent other people? Don't make them laugh so hard.
schow@bcarh185.bnr.ca (Stanley T.H. Chow) (10/11/90)
In article <1990Oct10.170424.21489@rice.edu> preston@titan.rice.edu (Preston Briggs) writes: >In article <3432@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes: >>I assume you mean given two machines, otherwise identical, differing >>only in that M1 does hardware speculative execution and M2 does not, >>you can write a compiler that will make M2 run faster than M1. > >That's what I believe. > >>This sounds wrong. I do not doubt that you can speedup M2 by speculative >>execution (with hoisting, etc). But surely the same technology can be >>applied to M1 with the same result. A priori, I would expect the benifit >>to be the same for both machines. > >Well, I wan't going to let M1 use my fabulous scheduling ideas. >It had to be satidfied with hardware. Further, M2 ought to have a >higher clock speed since its hardware is simpler. Ah, but that is not very fair, is it? If code scheduling works for both M1 & M2, why restrict it to M2 only? It sounds like we are starting to get into the philosophical aspects of the problem - like the RISC/CISC wars of old (may be even now? :-). There are certainly many possible trade-off and design points. Some will rely exclusivly on H/W, others on S/W, and yet others on both. It is not clear how the H/W and S/W efforts interact. Do they go after the same "parallellism" or do they exploit different types of "parallellism"? Which kinds of S/W speculative execution work well with which kinds of H/W? Do some H/W designs make the S/W impossible? >I'm advocating a wide instruction word so that I can, in a single >instruction, specify enough work to keep all the resources busy. >Ways of accomplishing this include trace scheduling, >global compaction, and software pipelining. > >Speculative execution hardware notices that it has resources >that aren't fully utilized and tries to find work for them to do. The difference is that the compiler has the "global" view and can do "higher" level shuffling and optimization. On the other hand, H/W has the advantage that it actually "knows" the total instantaneous values of all the registers. As someone (Andy?) suggests, clever H/W could resolve aliasing at run time. > >I suppose the advantage of speculative hardware is that you can use >a skinny instruction word to get some of the same effect, >non-deterministically. > >So fancy hardware can get some parallelism without fancy software, >and fancy software can get some parallelism without fancy hardware. >Given one, I don't think you need the other. So is it cheaper >to build the compiler or the chip? Don't forget to make them >correct. It is not clear to me that the compiler would be able to extract all the possible parallelism (even perfect aliasing analysis is not as good as "instantaneous" alias detection). Surely the H/W can contribute some more parallelism by "knowing" all the current values! Often, the actual time required is non-deterministic, e.g., time to fetch a word, time for a divide. This looks like another area where the H/W may augment the compiler. Stanley Chow BitNet: schow@BNR.CA BNR UUCP: ..!uunet!bnrgate!bcarh185!schow (613) 763-2831 ..!psuvax1!BNR.CA.bitnet!schow Me? Represent other people? Don't make them laugh so hard.
preston@titan.rice.edu (Preston Briggs) (10/11/90)
In article <1990Oct10.164353.21070@rice.edu> preston@titan.rice.edu (Preston Briggs) writes: >I wrote, about the i860 >Regarding compilers, I believe The Portland Group and Ardent both have >compilers that will take advantage of the pipelined instructions. Instead of Ardent, I should have said Alliant. Ardent (or rather Stardent) of course have a nice compiler, but it makes code for their MIPS based machine. -- Preston Briggs looking for the great leap forward preston@titan.rice.edu
gillies@m.cs.uiuc.edu (10/11/90)
> No one has produced a compiler which can take advantage of the > theoretically possible parallelism of the i860. It's a very fast chip > for certain kinds of applications, but I wouldn't call it elegant, or > general purpose. One of the problems in new CPU designs is that designers don't realize which architecture enhancements are pointless, because we don't and may never have the optimization technology to take advantage of them. Many enhancements to "increase parallelism" cannot be exploited by compilers unless P = NP, or unless you are willing to wait {hours, days, weeks} for your compiler to finish its job. So don't just relegate EVERYTHING to the compiler! When you do something like add dedicated functional units to a CPU, you are (potentially) increasing the complexity of the scheduling from say, triangle-inequality traveling salesman (for which a heuristic with bounded performance exists), to full generality traveling salesman (for which no bounded heuristic exists unless P = NP). I believe you are much better off spending the extra real estate to implement identical functional units, or add vector hardware, since the compiler technology for exploiting these features runs in polynomial time. Don W. Gillies, Dept. of Computer Science, University of Illinois 1304 W. Springfield, Urbana, Ill 61801 ARPA: gillies@cs.uiuc.edu UUCP: {uunet,harvard}!uiucdcs!gillies
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/11/90)
Software can look at the global structure of a program and make decisions based on that. Hardware can see what's happening at execution time, which may include poor memory performance due to cycle stealing, variable FPU performance for algorithms which are value sensitive, and other things which can vary from run to run, or with data. If people would just drop the idea that either can solve all performance problems, and start making hardware and software work better together, then cost/performance would be improved with a lot lower noise level. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
philip@beeblebrox.dle.dg.com (Philip Gladstone) (10/11/90)
In article <1990Oct10.164353.21070@rice.edu> preston@titan.rice.edu (Preston Briggs) writes: [stuff deleted] > We can however, hoist the initial pipeline stages (perhaps overlapping > them with earlier pipeline compuations). > > int-1, pfmul.ss f3,f4,f0 <--- These floats hoisted from > int-2, pfmul.ss f0,f0,f0 <--- from the if > int-3, pfmul.ss f0,f0,f0 > if (something) { > pfmul.ss f0,f0,f5 > } > fst.l f5,somewhere > > The true path get much shorter. > No increase in the path length of the false path. > And no extra register required. It seems to me that it is very difficult to spot when this is legal or not. It is assumed in this example that none of the hoisted instructions will cause any problems if they are executed when the programmer did not intend them to be executed. Consider the case of overflow! I must confess to knowing little or nothing about the i860, but similar examples exist for the 88100 (which I do understand). Consider the following code fragment: if (i > 0) { volatile_global = b / i; } On the 88000 at least, divides are very slow (36 cycles) and there is a great temptation to move them back up the instruction stream so as to get greater parallelism. In this case, the divide could blow up (if i was equal to 0) when the source would not have blown up. This whole problem arises from the fact that most instructions have an implicit conditional branch (interrupt) as part of their specification. You can only (reliably) hoist instructions that cannot trap -- unless you can perform major flow analysis to check that the trap conditions never arise (such as adding two shorts in long registers with an overflow causing instruction). Removing the implicit trapping action of these instructions would be a way out of this problem, but you would have to introduce some sort of 'trap on previous overflow' instruction. Even then you would have to take great care that you only detected the right overflows. Enough of this --- -- Philip Gladstone philip@dle.dg.com Development Lab Europe C=gb/AD=gold 400/PR=dgc/O=dle Data General, Cambridge /SN=gladstone/GN=philip England. +44 223-67600
dik@cwi.nl (Dik T. Winter) (10/11/90)
In article <PHILIP.90Oct11102731@beeblebrox.dle.dg.com> philip@beeblebrox.dle.dg.com (Philip Gladstone) writes: > In article <1990Oct10.164353.21070@rice.edu> preston@titan.rice.edu (Preston Briggs) writes: > [stuff deleted] [more stuff deleted] > > int-1, pfmul.ss f3,f4,f0 <--- These floats hoisted from > > int-2, pfmul.ss f0,f0,f0 <--- from the if > > int-3, pfmul.ss f0,f0,f0 > > if (something) { > > pfmul.ss f0,f0,f5 > > } > > fst.l f5,somewhere > > The true path get much shorter. > > No increase in the path length of the false path. > > And no extra register required. > It seems to me that it is very difficult to spot when this is legal or not. > It is assumed in this example that none of the hoisted instructions > will cause any problems if they are executed when the programmer did > not intend them to be executed. This raises a point, although for different reasons. I wil start with some explanations: on the i860, if (for instance) a floating-point instruction overflows, a status bit is set and if this status bit is set the next floating-point instruction will trap. If we assume the pipeline was empty when this sequence started the hoisted instructions will not set status bits and do not deliver a result. It is only the last (conditional) instruction that can set a status bit and deliver a result; but that instruction does not take any source operands. So in this aspect the example is valid. There is another aspect: after the three pfmul instructions the pipeline is filled in an indeterminate state if the false path is to be taken. There might be an overflow status or somesuch pending in the last stage, and that must be cleared. The only ways I see to clear this is either to perform a non pipelined multiply after such a sequence (which voids the pipeline) or else you need in the false path an instruction to drain the pipeline (pfmul.ss f0,f0,f0) and an instruction to clear status bits in the fsr. As far as I see, in both cases you will lose in some cases what you gain in others. And yes, the Alliant compiler generates pipelined instructions and dual instruction mode instructions. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
chased@rbbb.Eng.Sun.COM (David Chase) (10/12/90)
> = bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) >> = preston@titan.rice.edu (Preston Briggs) >>Well, I wan't going to let M1 use my fabulous scheduling ideas. >>It had to be satidfied with hardware. Further, M2 ought to have a >>higher clock speed since its hardware is simpler. >Ah, but that is not very fair, is it? If code scheduling works for both >M1 & M2, why restrict it to M2 only? Agreed, it is unfair to run the code scheduling only on M2. However, I think the claim is that scheduled code won't run any faster on M1 than unscheduled code -- that's what you added all that hardware for, right? You can think of M1 as a machine that looks far ahead for anything that it can execute (and fires off stalled instructions in a dataflow fashion), whereas M2 is a machine that will only suck up instructions for which the operands are valid and the functional unit is idle (i.e., the first stall stops all subsequent execution) or (Stanford-MIPS style) doesn't even stall -- just rely on the compiler to insert NOPS. "Clearly" the hardware for M2 is simpler, but it will run just as quickly if the compiler does the lookahead and reorders the instructions into the order that M1 would have executed them. Run that same code through M1, and "in theory" it executes the instructions in the same way as it did before they were reordered -- no speedup, but lots more chip real estate, and a possibly slower cycle time. Now, M1 does win in a couple of ways, including one very important one that I forgot to mention: 1) M1 can look like a chip that people already use (customers are a minor detail). 2) M1 can take advantage of early instruction completion (i.e., division, or a first-level-cache hit on load) instead of assuming worst case. 3) M1 can do (dynamic) prefetch prediction in some limited way, and thus only needs to suck up instructions from *one* potential successor, instead of all of them. Counter-proposals to (3) for M2 include profiling feedback to the compiler, and the possibility (I haven't worked out the details) of exposing the branch history as a value for the compiler to use. I suspect this involves a wee bit of code expansion, or the use of "conditional execution" bits on every instruction in the style of the Acorn Risc Machine. Use your history of the impending branch to determine which instructions to execute on your way to it -- if you guessed wrong, then you use the history bits to figure out which fix-up code to execute (actually, you could have several different conditional branches, themselves conditional on the history bits, so that you'd choose the right one with respect to fixup code). David Chase Sun Microsystems
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (10/12/90)
In article <1990Oct10.212713.28260@rice.edu> preston@titan.rice.edu (Preston Briggs) writes: >>Regarding compilers, I believe The Portland Group and Ardent both have >>compilers that will take advantage of the pipelined instructions. >Instead of Ardent, I should have said Alliant. I've been waiting for someone else to rain on this parade. Since no one has, I will point out that Alliant's compiler was _not_ shipped on schedule. In general, i860 compilers so far have justified the pessimists more than the optimists. It would be interesting to find out _which_ of the i860's features causes the most compiler difficulty. The T, KI, KR registers? The non-self-draining pipeline? Scheduling? IEEE issues? -- Don D.C.Lindsay
preston@titan.rice.edu (Preston Briggs) (10/12/90)
I suggested the trickiness of hoisting pipeline filling instructions above a conditional. The result looked like this >>> int-1, pfmul.ss f3,f4,f0 <--- These floats hoisted from >>> int-2, pfmul.ss f0,f0,f0 <--- from the if >>> int-3, pfmul.ss f0,f0,f0 >>> if (something) { >>> pfmul.ss f0,f0,f5 >>> } >>> fst.l f5,somewhere >>> The true path get much shorter. >>> No increase in the path length of the false path. >>> And no extra register required. >philip@beeblebrox.dle.dg.com (Philip Gladstone) writes: >> It seems to me that it is very difficult to spot when this is legal or not. >> It is assumed in this example that none of the hoisted instructions >> will cause any problems if they are executed when the programmer did >> not intend them to be executed. But we *can* check for legality. Generally, I must not overwrite any registers or memory locations that are live on the other path. Further, I should trigger no traps. This means being careful about loads, etc... and dik@cwi.nl (Dik T. Winter) writes: [says that I'm reasonably ok on result traps] >There is another aspect: after >the three pfmul instructions the pipeline is filled in an indeterminate >state if the false path is to be taken. There might be an overflow status >or somesuch pending in the last stage, and that must be cleared. The pipe would have to be drained, but we must always drain the pipe anyway. So, if at some later point in the program, I perform a multiply, the first 3 pfmul's must direct their (unspecified) values into f0. But we always have to do this anyway. And, as you point out, a scalar (non-pipelined) multiply will overwrite the pipe. So, if we'd had an if-then-else instead, we might get this result. int-1, pfmul.ss f3,f4,f0 <--- These floats hoisted from int-2, pfmul.ss f0,f0,f0 <--- from the then clause int-3, pfmul.ss f0,f0,f0 if (something) { pfmul.ss f0,f0,f5 } else { pfmul.ss f7,f8,f0 -- draining the pipe pfmul.ss f0,f0,f0 pfmul.ss f0,f0,f0 pfmul.ss f0,f0,f5 } fst.l f5,somewhere The point is that the else clause is unchanged. Actually, we could build this form int-1, pfmul.ss f3,f4,f0 <--- These floats hoisted from int-2, pfmul.ss f7,f8,f0 <--- from the then clause int-3, pfmul.ss f0,f0,f0 if (something) { pfmul.ss f0,f0,f5 } else { pfmul.ss f0,f0,f0 pfmul.ss f0,f0,f5 } fst.l f5,somewhere Note also that the scalar multiply doesn't "clear the pipe"; it just overwrites it with new junk. We need to be careful to drain this new junk to avoid traps. Recently I tried to use the "r2pt.ss f10,f0,f0" instruction to set the KR register to f10. Unfortunately, this may provoke a source exception trap from the adder if the multiply pipeline is not in some nice state. -- Preston Briggs looking for the great leap forward preston@titan.rice.edu
mash@mips.COM (John Mashey) (10/12/90)
In article <PHILIP.90Oct11102731@beeblebrox.dle.dg.com> philip@beeblebrox.dle.dg.com (Philip Gladstone) writes: ... >It is assumed in this example that none of the hoisted instructions >will cause any problems if they are executed when the programmer did >not intend them to be executed. Consider the case of overflow! I must >This whole problem arises from the fact that most instructions have an >implicit conditional branch (interrupt) as part of their >specification. You can only (reliably) hoist instructions that cannot >trap -- unless you can perform major flow analysis to check that the >trap conditions never arise (such as adding two shorts in long >registers with an overflow causing instruction). Yes, and there many subtle cases. Mike O'Dell talked about some in his talks about the Prisma machine at the last USENIX, i.e., about the exact state presented to the user upon a signal. As another example, consider the effects of hoisting a load instruction, which is something you'd like to do in a machine with out-of-order characteristics, especially if it has multiple memory paths. Here are a few cases of interest: a) The load is an uncached load to a volatile device register. Hoisting is of course a no-no, as this will drive OS programmers nuts, and in fact, must be carefully synchronized with surrounding operations, or else. So, of course, you don't hoist these things, or allow the hardware to do any hoisting for you via speculation. For instance, in the code that looks like: if (flag) x = p->device1; else y = p->device2; Any hardware that speculatively issues EITHER of the assignments before being sure which one, may well have trouble if the read actually gets to the memory system. In fact, I think this means that any uncached load stops any speculative path, because you dare not issue it for the side-effect. b) Clearly stores can't be speculatively issued. Even cached loads may take care, any time a load can cause a fault, such as an MMU miss, or an out-of-range access (like to location 0 on many machines.) A good example to always consider, evil though it may be, is the Bourne shell's expectation of fixing up a memory fault and returning to the instruction that did it... I believe this can be handled, but if there is any case that doesn't, soembody will get bitten. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
dik@cwi.nl (Dik T. Winter) (10/12/90)
I write: >There is another aspect: after >the three pfmul instructions the pipeline is filled in an indeterminate >state if the false path is to be taken. There might be an overflow status >or somesuch pending in the last stage, and that must be cleared. > Preston Briggs responds: > The pipe would have to be drained, but we must always drain the > pipe anyway. So, if at some later point in the program, > I perform a multiply, the first 3 pfmul's must direct > their (unspecified) values into f0. But we always have to > do this anyway. And, as you point out, a scalar (non-pipelined) > multiply will overwrite the pipe. As far as I understand the processor this is still wrong: > So, if we'd had an if-then-else instead, we might get this result. > int-1, pfmul.ss f3,f4,f0 <--- These floats hoisted from > int-2, pfmul.ss f0,f0,f0 <--- from the then clause > int-3, pfmul.ss f0,f0,f0 > if (something) { > pfmul.ss f0,f0,f5 > } > else { > pfmul.ss f7,f8,f0 -- draining the pipe Note the above instruction may generate a result trap because the result of the multiplication of f3 and f4 is stored in this stage! (Nowhere in the documentation do I find that result traps are discarded if the result is stored in f0. Documentation even carefully tells us that 'f0 always reads as 0 whatever is stored in it.') > > Note also that the scalar multiply doesn't "clear the pipe"; it just > overwrites it with new junk. We need to be careful to drain this > new junk to avoid traps. Yes, we have to drain this to avoid source traps but not to avoid result traps: 'After a scalar operation, the values of all pipeline stages of the affected unit are undefined. No spurious result-exception traps result when the undefined values are subsequently stored by pipelined operations; however, the values should not be referenced as source operands.' > Recently I tried to use the "r2pt.ss f10,f0,f0" > instruction to set the KR register to f10. Unfortunately, this may > provoke a source exception trap from the adder if the multiply > pipeline is not in some nice state. Yes, source exceptions occur when you do use undefined results. So a scalar multiply does indeed not totally clear the pipe, but clears all status information. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
preston@titan.rice.edu (Preston Briggs) (10/12/90)
In article <2326@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >As far as I understand the processor this is still wrong: > > So, if we'd had an if-then-else instead, we might get this result. > > int-1, pfmul.ss f3,f4,f0 <--- These floats hoisted from > > int-2, pfmul.ss f0,f0,f0 <--- from the then clause > > int-3, pfmul.ss f0,f0,f0 > > if (something) { > > pfmul.ss f0,f0,f5 > > } > > else { > > pfmul.ss f7,f8,f0 -- draining the pipe >Note the above instruction may generate a result trap because the result of the >multiplication of f3 and f4 is stored in this stage! (Nowhere in the >documentation do I find that result traps are discarded if the result is >stored in f0. It looks like you're right. Silly me. (That's what I say instead of typing a long string of cuss words). If somebody at Intel knows differently, I'd appreciate hearing about it! I guess I must therefore advocate interpreting stores to f0 as "discard everything". Any reason not to? (Besides the cost of changing the chip, documentation, ...) -- Preston Briggs looking for the great leap forward preston@titan.rice.edu
ckp@grebyn.com (Checkpoint Technologies) (10/12/90)
In article <PHILIP.90Oct11102731@beeblebrox.dle.dg.com> philip@beeblebrox.dle.dg.com (Philip Gladstone) writes: >This whole problem arises from the fact that most instructions have an >implicit conditional branch (interrupt) as part of their >specification. You can only (reliably) hoist instructions that cannot >trap -- unless you can perform major flow analysis to check that the >trap conditions never arise (such as adding two shorts in long >registers with an overflow causing instruction). > >Removing the implicit trapping action of these instructions would be a >way out of this problem, but you would have to introduce some sort of >'trap on previous overflow' instruction. Even then you would have to >take great care that you only detected the right overflows. #pragma MY_TWO_CENTS Forgive my insolence - I'm not an architecture designer, be gentle with me, but this just occured to me.. Why not include condition code bits with each register? Store these when the register value is stored, when it pops out the end of a pipeline. Include a flag for NAN (not a new idea, right?), and store this for things like divide-by-0. Then you can trap when the register is used as a source operand, and you can code explicit tests (and maybe traps) when you want to, on operations that could very well be taking place simultaneously. -- First comes the logo: C H E C K P O I N T T E C H N O L O G I E S / / \\ / / Then, the disclaimer: All expressed opinions are, indeed, opinions. \ / o Now for the witty part: I'm pink, therefore, I'm spam! \/
schow@bcarh185.bnr.ca (Stanley T.H. Chow) (10/12/90)
In article <42062@mips.mips.COM> mash@mips.COM (John Mashey) writes: >As another example, consider the effects of hoisting a load >instruction, which is something you'd like to do in a machine >with out-of-order characteristics, especially if it has multiple >memory paths. Here are a few cases of interest: > > a) The load is an uncached load to a volatile device register. > Hoisting is of course a no-no, as this will drive OS programmers > nuts, and in fact, must be carefully synchronized with > surrounding operations, or else. This should not be particularly hard, just use "volatile" in C. (I know, the zillions of lines of existing code may break, but as some people say - those programs are brain-damaged :-). It is also possible for (very clever) hardware to handle it with some sort of MMU attribute per page. Several existing processors already have this. Speculative loads could proceed blindly until it hits one of these stop bits. The affected load would be put "on-hold" pending a "really-do-it" or a "forget-it". (So it's hard, but we gotta use up those millions of transistors :-) > > So, of course, you don't hoist these things, or allow the hardware > to do any hoisting for you via speculation. For instance, in > the code that looks like: > if (flag) > x = p->device1; > else > y = p->device2; > Any hardware that speculatively issues EITHER of the assignments > before being sure which one, may well have trouble if the > read actually gets to the memory system. In fact, I think this > means that any uncached load stops any speculative path, ^^^ > because you dare not issue it for the side-effect. This is not so. An uncached load (at worse) stops *its* speculative path. In other words, paths without uncached loads can still proceed. E.g., if (flag) x = p->device1; else x = somelocal; The fetch of "somelocal" can happen before the "if". In general, after figuring out side effects and aliasing, the only restriction should be that volatile accesses cannot "pass" each other. Also, as suggested in the paragraphs above, with sufficiently clever hardware, one can still "conservatively" speculatively fetch volatile addresses. (Hmm, is that an oxymoron?) > > b) Clearly stores can't be speculatively issued. > Even cached loads may take care, any time a load can cause > a fault, such as an MMU miss, or an out-of-range access > (like to location 0 on many machines.) [..] Again, issue the store early to get through the address translation and protection checking, then wait for the "really-do-it". Stanley Chow BitNet: schow@BNR.CA BNR UUCP: ..!uunet!bnrgate!bcarh185!schow (613) 763-2831 ..!psuvax1!BNR.CA.bitnet!schow Me? Represent other people? Don't make them laugh so hard.
spot@TR5.GP.CS.CMU.EDU (Scott Draves) (10/13/90)
In article <3445@bnr-rsc.UUCP>, schow@bcarh185.bnr.ca (Stanley T.H. Chow) writes: |> In article <42062@mips.mips.COM> mash@mips.COM (John Mashey) writes: |> > |> > a) The load is an uncached load to a volatile device register. |> > Hoisting is of course a no-no, as this will drive OS programmers |> > nuts, and in fact, must be carefully synchronized with |> > surrounding operations, or else. |> |> This should not be particularly hard, just use "volatile" in C. (I know, |> the zillions of lines of existing code may break, but as some people say - |> those programs are brain-damaged :-). |> that looks like a pretty good argument that this should be done by the compiler rather than the hardware. the same goes for the exception stuff that other people have mentioned; a good compiler can determine when exceptions are impossible and proceed to speculate. Consume Scott Draves Be Silent spot@cs.cmu.edu Die
brett@cayman.amd.com (Brett Stewart) (10/13/90)
>From article <1990Oct9.212103.363@rice.edu>, by preston@titan.rice.edu (Preston Briggs): > > In general, we need to be careful about fatally increasing > register pressure. The i860's exposed pipeline provides an > elegant way out, allowing simple aborts of optimistic > computations by ignoring what's partially computed in > the pipe. > One of the advantages of variable register-stack cache window frames of the 29000 family is the instant availability of relief of register pressure. Michael Tiemann, of Cygnus Support, also has some very interesting ideas about compiler techniques for superscalar architectures that would allow the possibly large numbers of registers of a procedure in the 29K to be used for optimistic computation in the gnu c or other Fraser-Davidson UVa type compiler schemes. Comments, Michael? Best Regards; Brett Stewart Advanced Micro Devices, Inc. 1-512-462-5321 FAX 5900 E. Ben White Blvd MS561 1-512-462-4336 Telephone Austin, Texas 78741 USA brett@cayman.amd.com
dik@cwi.nl (Dik T. Winter) (10/13/90)
In article <22587@grebyn.com> ckp@grebyn.UUCP (Checkpoint Technologies) writes: > Why not include condition code bits with each register? Store these when > the register value is stored, when it pops out the end of a pipeline. > Include a flag for NAN (not a new idea, right?), and store this for > things like divide-by-0. Then you can trap when the register is used as > a source operand, and you can code explicit tests (and maybe traps) when > you want to, on operations that could very well be taking place > simultaneously. Something similar is done in all machines designed by Seymour Cray. There is no instruction that generates a result trap, only source traps are generated. There are advantages and disadvantages. Advantages: 1. It is possible to initialize all variables at a trapping value. This will very fast eleminate the use of uninitialized values. (CDC users will know this: LDSET(PRESET=UNDEF), although this would not work with integer variables.) 2. It makes it possible for the compiler to generate spurious results. I.e. if the array B has never been initialized the following loop: for I from 1 to N do if A[I] != 0.0 then B[I] = 1.0/A[I] can be modified by omitting the test without adverse effects. This can be beneficial on vector processors. In fact the Fortran compiler on the NEC SX/2 generates spurious results in some cases; alas, they trigger a result trap. Disadvantage: 1. Debugging is very problematical. I know, I have debugged a number of programs on the CDC Cyber. It is very difficult to see where a bad result originates. This is even worse if (as on the Cyber) there are multiple functional units so that it is even problematical which unit did generate the trap. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
mash@mips.COM (John Mashey) (10/14/90)
In article <3445@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes: >In article <42062@mips.mips.COM> mash@mips.COM (John Mashey) writes: >>As another example, consider the effects of hoisting a load >>instruction, which is something you'd like to do in a machine >>with out-of-order characteristics, especially if it has multiple >>memory paths. Here are a few cases of interest: >> >> a) The load is an uncached load to a volatile device register. >> Hoisting is of course a no-no, as this will drive OS programmers >> nuts, and in fact, must be carefully synchronized with >> surrounding operations, or else. > >This should not be particularly hard, just use "volatile" in C. (I know, >the zillions of lines of existing code may break, but as some people say - >those programs are brain-damaged :-). >It is also possible for (very clever) hardware to handle it with some >sort of MMU attribute per page. Several existing processors already have >this. Speculative loads could proceed blindly until it hits one of these >stop bits. The affected load would be put "on-hold" pending a "really-do-it" >or a "forget-it". (So it's hard, but we gotta use up those millions of >transistors :-) I think we have some terminology confusion here, at least some my fault for use of term "issue". I've observed that people often note that one canot speculatively issue stores, but can so issue loads. I observed that one cannot in general "issue" loads, if you believe that "issue" actually means: run it through the cache/MMU system, and fetch from memory. I agree with the style that says: run it to the point where you realize its unsafe, then hold it until you're sure. >> if (flag) >> x = p->device1; >> else >> y = p->device2; >> Any hardware that speculatively issues EITHER of the assignments >> before being sure which one, may well have trouble if the >> read actually gets to the memory system. In fact, I think this >> means that any uncached load stops any speculative path, ^^^ >> because you dare not issue it for the side-effect. >This is not so. Sorry, wording again; we agree: I meant what you said ("its path"), and would have said "all paths" otherwise; "any" was imprecise. >An uncached load (at worse) stops *its* speculative path. In other words, >paths without uncached loads can still proceed. E.g., >The fetch of "somelocal" can happen before the "if". In general, after >figuring out side effects and aliasing, the only restriction should be >that volatile accesses cannot "pass" each other. > >Also, as suggested in the paragraphs above, with sufficiently clever >hardware, one can still "conservatively" speculatively fetch volatile >addresses. (Hmm, is that an oxymoron?) >> >> b) Clearly stores can't be speculatively issued. >> Even cached loads may take care, any time a load can cause >> a fault, such as an MMU miss, or an out-of-range access >> (like to location 0 on many machines.) [..] Again, I don't think we disagree. The only real point was that there were loads, and then there were loads, and you must have some way to distinguish the ones that can be safely completed, versus the ones that cause catastrophe if they get to memory. In particular, you must especially be careful if doing a speculative-issue machine supposed to be binary-compatible at kernel level with existing machines. Such may, or may not be, a goal. thanx for the clarification; soem of my languauge was sloppy. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
rbw00@ccc.amdahl.com ( 213 Richard Wilmot) (10/15/90)
In article <22587@grebyn> ckp@grebyn.com says >#pragma MY_TWO_CENTS >Forgive my insolence - I'm not an architecture designer, be gentle >with me, but this just occured to me.. >Why not include condition code bits with each register? Store these when >the register value is stored, when it pops out the end of a pipeline. >Include a flag for NAN (not a new idea, right?), and store this for >things like divide-by-0. Then you can trap when the register is used as >a source operand, and you can code explicit tests (and maybe traps) when >you want to, on operations that could very well be taking place >simultaneously. >-- >First comes the logo: C H E C K P O I N T T E C H N O L O G I E S / / \\ / / >Then, the disclaimer: All expressed opinions are, indeed, opinions. \ / o >Now for the witty part: I'm pink, therefore, I'm spam! \/ Please NO. Save me from deferred traps. The CDC 6600 and some others could produce several FP results that were illegal values FOR ANY OTHER OPERATIONS. So exception traps took place not when/where you produced the result but later when you tried to use it. Then the game was to guess where you produced this value. Well, I finally got used to coding explicit tests (usually not needed) after every potentially violative FP operation (that might produce +- infinity or +- indefinite). This is highly wasteful of code space for something that is most often rare - I can always check in cases where I need to know but please give me the trap when I create the bad result. PRECISELY. -- Dick Wilmot | I declaim that Amdahl might disclaim any of my claims. (408) 746-6108
jkenton@pinocchio.encore.com (Jeff Kenton) (10/17/90)
From article <cf2f02Ea02o001@JUTS.ccc.amdahl.com>, by rbw00@ccc.amdahl.com ( 213 Richard Wilmot): > > Please NO. Save me from deferred traps. The CDC 6600 and some others > could produce several FP results that were illegal values FOR ANY OTHER > OPERATIONS. So exception traps took place not when/where you produced the > result but later when you tried to use it. Then the game was to guess where > you produced this value. > What you are describing is fairly close to the idea of NaN (Not a Number) required by the IEEE spec for floating point calculations. The operation which produces a NaN ( 0/0, infinity - infinity, inf / inf ) *will* trap when it occurs, but the NaNs will trap when they are used (if the trap is enabled). ----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ----- ----- jeff kenton: consulting at jkenton@pinocchio.encore.com ----- ----- always at (617) 894-4508 ----- ----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/18/90)
In article <cf2f02Ea02o001@JUTS.ccc.amdahl.com> rbw00@JUTS.ccc.amdahl.com ( 213 Richard Wilmot) writes: | Please NO. Save me from deferred traps. The CDC 6600 and some others | could produce several FP results that were illegal values FOR ANY OTHER | OPERATIONS. So exception traps took place not when/where you produced the | result but later when you tried to use it. Then the game was to guess where | you produced this value. It depends on how the system is designed. If I were designing a CPU (and I don't), I would allow parallelism by allowing the FPU to have a delay mode in which an instruction, let's call it DTRAP, sets a flag such that if a trap would occur, the FPU enters a wait state until the CPU issues the sync instruction, say WAIT, at which point the trap occurs. This assumes that either you don't care about pinpointing the problem, or you have some hook in the FPU to report it in a meaningful way. Obviously you want an ITRAP for immediate trapping in other cases. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
schow@bcarh185.bnr.ca (Stanley T.H. Chow) (10/18/90)
In article <2772@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >In article <cf2f02Ea02o001@JUTS.ccc.amdahl.com> rbw00@JUTS.ccc.amdahl.com ( 213 Richard Wilmot) writes: > >| Please NO. Save me from deferred traps. The CDC 6600 and some others >| could produce several FP results that were illegal values FOR ANY OTHER >| OPERATIONS. So exception traps took place not when/where you produced the >| result but later when you tried to use it. Then the game was to guess where >| you produced this value. > > It depends on how the system is designed. If I were designing a CPU >(and I don't), I would allow parallelism by allowing the FPU to have a >delay mode in which an instruction, let's call it DTRAP, sets a flag >such that if a trap would occur, the FPU enters a wait state until the >CPU issues the sync instruction, say WAIT, at which point the trap >occurs. > > This assumes that either you don't care about pinpointing the problem, >or you have some hook in the FPU to report it in a meaningful way. >Obviously you want an ITRAP for immediate trapping in other cases. There is another way. Assuming one wanted to allow speculative execution and don't want traps gumming up the works, one could have traps happen at *assignment*. This means each stage of the (say the FPU) pipe could generate a "Trap Flag". This flag would be propagated through each stage and finally causing a trap when you try to catch value from the exit end of the pipe. Possible refinements include not trapping if catching into r0 (or other "non-real" registers); making the trap flag include the trap reason and or stage that caused the trap. It seems to me this solves the problem (of traps gumming up speculative execution) and is more debuggable than many architectures. Stanley Chow BitNet: schow@BNR.CA BNR UUCP: ..!uunet!bnrgate!bcarh185!schow (613) 763-2831 ..!psuvax1!BNR.CA.bitnet!schow Me? Represent other people? Don't make them laugh so hard.
aglew@crhc.uiuc.edu (Andy Glew) (10/19/90)
>| Please NO. Save me from deferred traps. The CDC 6600 and some others >| could produce several FP results that were illegal values FOR ANY OTHER >| OPERATIONS. So exception traps took place not when/where you produced the >| result but later when you tried to use it. Then the game was to guess where >| you produced this value. Why not split the expensive possibly trap produciung instruction into two parts: (1) expensive, deferred trap producing instruction that is percolated upwards and speculatively executed; and (2) an inexpensive instruction that takes the deferred trap and actually produces an immediate trap, and is not subjected to code motion? -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
cik@l.cc.purdue.edu (Herman Rubin) (10/19/90)
In article <12976@encore.Encore.COM>, jkenton@pinocchio.encore.com (Jeff Kenton) writes: > From article <cf2f02Ea02o001@JUTS.ccc.amdahl.com>, by rbw00@ccc.amdahl.com ( 213 Richard Wilmot): > > Please NO. Save me from deferred traps. The CDC 6600 and some others > > could produce several FP results that were illegal values FOR ANY OTHER > > OPERATIONS. So exception traps took place not when/where you produced the > > result but later when you tried to use it. Then the game was to guess where > > you produced this value. > What you are describing is fairly close to the idea of NaN (Not a Number) > required by the IEEE spec for floating point calculations. The operation > which produces a NaN ( 0/0, infinity - infinity, inf / inf ) *will* trap > when it occurs, but the NaNs will trap when they are used (if the trap > is enabled). The 6600 was really RISCy, and had no provision whatever for optional traps. These conditions were testable, however, and many algorithms did such tests. It was also hoped that the increased exponent range would decrease the occurrence. These results were only illegal in FP operations. One way this trap was used as a feature was to initialize memory to negative indefinite (not produced by any FP operation) to detect when it was read before being written. BTW, these traps could be turned off globally. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!cik(UUCP)
cik@l.cc.purdue.edu (Herman Rubin) (10/23/90)
In article <3483@bnr-rsc.UUCP>, schow@bcarh185.bnr.ca (Stanley T.H. Chow) writes: > In article <2772@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > >In article <cf2f02Ea02o001@JUTS.ccc.amdahl.com> rbw00@JUTS.ccc.amdahl.com ( 213 Richard Wilmot) writes: > > > >| Please NO. Save me from deferred traps. The CDC 6600 and some others > >| could produce several FP results that were illegal values FOR ANY OTHER > >| OPERATIONS. So exception traps took place not when/where you produced the > >| result but later when you tried to use it. Then the game was to guess where > >| you produced this value. .................... > There is another way. Assuming one wanted to allow speculative execution > and don't want traps gumming up the works, one could have traps happen > at *assignment*. This means each stage of the (say the FPU) pipe could > generate a "Trap Flag". This flag would be propagated through each stage > and finally causing a trap when you try to catch value from the exit end > of the pipe. > > Possible refinements include not trapping if catching into r0 (or other > "non-real" registers); making the trap flag include the trap reason and > or stage that caused the trap. > > It seems to me this solves the problem (of traps gumming up speculative > execution) and is more debuggable than many architectures. We need more, not less, user control. The reasonable programmer knows what the program is doing, what odd things should occur, and what apparently odd things should be ignored. Not being ominiscient, the programmer will miss a few. The "optimal" thing to do would be to have these handled efficiently in accordance with the available information. To do this efficiently would require the appropriate hardware for traceback, and a rarely used part of the instruction to handle the enabling and disabling of traps, what to do if they occur, etc. Many architectural problems and considerations are involved here, but every simple solution proposed runs into problems. It is quite possible, even, that some traps can be anticipated and thereby avoided. I have not seen any such architectural features. One example of this is buffer read and call a refill procedure when empty. The reads are sporadic. Now an anticipatory procedure would issue a weak interrupt when the last item is read, but a strong interrupt if an attempt is made to read when the buffer is empty. Similar things can occur with arithmetic and logical conditions. Attempts to oversimplify the problem are not likely to be optimal. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!cik(UUCP)
hankd@dynamo.ecn.purdue.edu (Hank Dietz) (10/24/90)
Well, first off let me say that speculative execution consists of two components: getting the instructions started and making sure the program semantics are preserved. Only a static mechanism, e.g., a compiler, can look arbitrarily far ahead for instructions to start; hence, compiler technology beats hardware on this by a large margin. However, preserving program semantics can be a dynamic problem: for example, suppose that you get a floating-point exception in one of a bunch of speculatively-executed instructions... you really need a dynamic mechanism, e.g. hardware, to deal with this -- as in the Multiflow machines. Hence, the basic problem is static (compilers), but some secondary problems could use dynamic (hardware) support. In article <3300194@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes: >Many enhancements to "increase parallelism" cannot be exploited by >compilers unless P = NP, or unless you are willing to wait {hours, >days, weeks} for your compiler to finish its job. So don't just >relegate EVERYTHING to the compiler! ... >I believe you are much better off spending the extra real estate to >implement identical functional units, or add vector hardware, since >the compiler technology for exploiting these features runs in >polynomial time. This sounds reasonable, but isn't really correct. The trick is to recognize that the comparison here is between spending lots of compile time to generate *OPTIMAL* solutions and spending no compile time to implement no solution (i.e., simply discarding the structure). The best compromise is generally to invest enough compile time to get most of the benefits... and that usually isn't excessive compile time. As for vector, etc., being polynomial time compiler problems, that simply isn't true for optimal solutions -- but nobody cares because you can get good solutions quickly enough. Remember that O() complexity measures don't tell you actual times -- i.e., high O() complexity doesn't necessarily imply long runtime for real cases. For example, who cares if compile time "blows up" when given input with 30-level nested loops? If it really bothers you in such a rare case, you just turn off the expensive optimization for that particular piece of code... no big deal. My rule is simply to figure out where each thing should be done (static vs. dynamic mechanisms) and then to do the best we can in implementing it that way. In this approach, I have yet to come across a compiler problem that defied reasonably quick, approximate, solution. -hankd@ecn.purdue.edu