[comp.arch] speculative execution

preston@titan.rice.edu (Preston Briggs) (10/09/90)

In article <kb4Rxwa00VpI1JbUxl@andrew.cmu.edu> mh2f+@andrew.cmu.edu (Mark Hahn) writes:

>can anyone say anything about architectures for speculative execution?
>I think the Microprocessor Report mentioned that a new sparc
>implementation will do it.  isn't the 88000 (over-)due for a major
>overhaul, too?  

Well sure.  Compilers can schedule instructions to be executed
speculatively (or optimistically) on all sorts of machines.
Wide instruction words offer opportunites, as do long pipelines.

You could probably do it in hardware too (that it, go shooting off
down some code path before knowing a condition), but I expect
compilers can do a better job.   Think in terms of trace scheduling.
-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

khb@chiba.Eng.Sun.COM (Keith Bierman - SPD Advanced Languages) (10/10/90)

In article <1990Oct9.162639.23516@rice.edu> preston@titan.rice.edu (Preston Briggs) writes:


   You could probably do it in hardware too (that it, go shooting off
   down some code path before knowing a condition), but I expect
   compilers can do a better job.   Think in terms of trace scheduling.

At the Hot Chips forum (and at M/S too, from the preprogram notes) LSI
gave some details its "Lightning" chip. By having hw support one can
execute instructions before knowing if they are legal. The MF/trace
machine lost big time when they predicted wrong .... this limits the
length of speculative traces, and the applications for which you get
speedup. 

Compilers can't do everything ;>
--
----------------------------------------------------------------
Keith H. Bierman    kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33			 | (415 336 2648)   
    Mountain View, CA 94043

preston@titan.rice.edu (Preston Briggs) (10/10/90)

In article <KHB.90Oct9105243@chiba.Eng.Sun.COM> khb@chiba.Eng.Sun.COM (Keith Bierman - SPD Advanced Languages) writes:

>By having hw support one can
>execute instructions before knowing if they are legal. The MF/trace
>machine lost big time when they predicted wrong .... this limits the
>length of speculative traces, and the applications for which you get
>speedup. 

>Compilers can't do everything ;>

Sure they can :-)  And if we could build these good compilers,
think how cheaply we could build the hardware.

In particular, we can arrange all sorts of optimistic execution.
Imagine some hypothetical machine that can issue one FLOAT and
one INT instruction per cycle.
IF we've got code like this

	int-1
	int-2
	int-3
	int-4
	if (something) {
		float-1
		float-2
		float-3
		store result
	}
	float-4
	int-5
	int-6
	...

We could rearrange it (perhaps) into

	int-1, float-1
	int-2, float-2
	int-3, float-3
	int-4, float-4
	if (somthing) {
		store result
	}
	int-5
	int-6

Generally, we can hoist computations from one branch that 
set registers that are dead on the "other" branch or store 
into a dead location (on the other branch).
We need to be careful to avoid traps.

Note that there's no cost if the branch isn't taken if
we only hoist operations to fill unused slots.
This (I believe) differs from the trace scheduling philosophy
which is more aggressive.

In general, we need to be careful about fatally increasing
register pressure.  The i860's exposed pipeline provides an
elegant way out, allowing simple aborts of optimistic 
computations by ignoring what's partially computed in
the pipe.

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

schow@bcarh185.bnr.ca (Stanley T.H. Chow) (10/10/90)

In article <1990Oct9.162639.23516@rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>
>You could probably do it in hardware too (that it, go shooting off
>down some code path before knowing a condition), but I expect
>compilers can do a better job.   Think in terms of trace scheduling.
>-- 
>Preston Briggs				looking for the great leap forward
>preston@titan.rice.edu

Hmm, you mean your compiler can schedule code so that five instructions
are issued to one function unit on every clock?

There are some occasions when multiple copies of h/w can do better than
compiler. I believe speculative execution is one.

Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!uunet!bnrgate!bcarh185!schow
(613) 763-2831               ..!psuvax1!BNR.CA.bitnet!schow
Me? Represent other people? Don't make them laugh so hard.

preston@titan.rice.edu (Preston Briggs) (10/10/90)

I wrote:
>>You could probably do it in hardware too (that it, go shooting off
>>down some code path before knowing a condition), but I expect
>>compilers can do a better job.   Think in terms of trace scheduling.

In article <3431@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes:
>Hmm, you mean your compiler can schedule code so that five instructions
>are issued to one function unit on every clock?
>There are some occasions when multiple copies of h/w can do better than
>compiler. I believe speculative execution is one.

Given the same number of functional units, with same latencies,
same number of registers, ...
I believe I can write a compiler that will make code run
faster than hardware that does speculative execution.
My version of the hardware should also be cheaper
in that it doesn't have to coordinate the multiple paths of speculation.

Why can I win?  I can look farther ahead at compilation than the
hardware can while running.

Is there some variation of hardware and software that's even quicker?
I dunno.

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

schow@bcarh185.bnr.ca (Stanley T.H. Chow) (10/10/90)

In article <1990Oct9.224312.2031@rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>In article <3431@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes:
>>Hmm, you mean your compiler can schedule code so that five instructions
>>are issued to one function unit on every clock?
>>There are some occasions when multiple copies of h/w can do better than
>>compiler. I believe speculative execution is one.
>
>Given the same number of functional units, with same latencies,
>same number of registers, ...
>I believe I can write a compiler that will make code run
>faster than hardware that does speculative execution.
>My version of the hardware should also be cheaper
>in that it doesn't have to coordinate the multiple paths of speculation.

I assume you mean given two machines, otherwise identical, differing
only in that M1 does hardware speculative execution and M2 does not,
you can write a compiler that will make M2 run faster than M1.

This sounds wrong. I do not doubt that you can speedup M2 by speculative
execution (with hoisting, etc). But surely the same technology can be
applied to M1 with the same result. A priori, I would expect the benifit
to be the same for both machines.

On the other hand, there are many things that *need* H/W. E.g., pre-
fetching (and executing) multiple paths. Surely these gains are additive
to those of the compiler.

(This is not to say that H/W speculative execution is easy or even
possible, and certainly says nothing about the relative gains or the
relative difficulty of H/W vs S/W).

Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!uunet!bnrgate!bcarh185!schow
(613) 763-2831               ..!psuvax1!BNR.CA.bitnet!schow
Me? Represent other people? Don't make them laugh so hard.

jkenton@pinocchio.encore.com (Jeff Kenton) (10/10/90)

From article <1990Oct9.212103.363@rice.edu>, by preston@titan.rice.edu (Preston Briggs):
> 
> In general, we need to be careful about fatally increasing
> register pressure.  The i860's exposed pipeline provides an
> elegant way out, allowing simple aborts of optimistic 
> computations by ignoring what's partially computed in
> the pipe.
> 

It would take a lot to convince me that the i860 is an elegant solution
to anything.  No one has produced a compiler which can take advantage of
the theoretically possible parallelism of the i860.  It's a very fast
chip for certain kinds of applications, but I wouldn't call it elegant,
or general purpose.


----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----
-----  jeff kenton:    	consulting at jkenton@pinocchio.encore.com  -----
-----		                 always at (617) 894-4508 	    -----
----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----

mshute@cs.man.ac.uk (Malcolm Shute) (10/10/90)

>In article <3431@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes:
>>There are some occasions when multiple copies of h/w can do better than
>>compiler. I believe speculative execution is one.

In article <1990Oct9.224312.2031@rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>Given the same number of functional units, with same latencies,
>same number of registers, ...
>I believe I can write a compiler that will make code run
>faster than hardware that does speculative execution.

On the face of it, since we are all using Turing Machines with a nifty
user interface, we can all do the same thing.  If I can do 'X' in
a lot of hardware, then you can also do the same 'X' with a little
bit of hardware, but with a software simulation of my hardware
for your machine.

However, the hardware-ish (execution-time) solution has a major advantage
over the software-ish (compile-time) solution: it can make all of its
scheduling decisions dynamically, as the needs arise.

A batch system can schedule its work load optimally before embarking
upon the run; but a real-time system can do almost as good a job on the
fly, and be far more flexible with it.

Speculative execution surely is just an extension of scheduling.
When the scheduler has handled every job which it has been asked
to schedule, and finds that there are still processors lieing idle,
it looks around for jobs which aren't yet scheduled
but which might be needed in the near future.

(It's a bit like thinking of a scheduler working down from the jobs
with really high positive priorities, all the way down to those with
low (zero) priority, and finding that there are still processing resources
to spare, going on the schedule those with the least negative of priorities!)
--

Malcolm SHUTE.         (The AM Mollusc:   v_@_ )        Disclaimer: all

preston@titan.rice.edu (Preston Briggs) (10/10/90)

I wrote
>> In general, we need to be careful about fatally increasing
>> register pressure.  The i860's exposed pipeline provides an
>> elegant way out, allowing simple aborts of optimistic 
>> computations by ignoring what's partially computed in
>> the pipe.

and
In article <12905@encore.Encore.COM> jkenton@pinocchio.encore.com (Jeff Kenton) writes:
>It would take a lot to convince me that the i860 is an elegant solution
>to anything.  No one has produced a compiler which can take advantage of
>the theoretically possible parallelism of the i860.  It's a very fast
>chip for certain kinds of applications, but I wouldn't call it elegant,
>or general purpose.

Lots of complaints here...
First, the exposed pipeline stuff.  If we've got an if-then
that looks like this

		int-1
		int-2
		int-3
		if (something) {
		    pfmul.ss f3,f4,f0
		    pfmul.ss f0,f0,f0
		    pfmul.ss f0,f0,f0
		    pfmul.ss f0,f0,f5
		}
		fst.l	f5,somewhere

The idea is that if something is true we multiply f3 and f4 together,
putting the reult in f5.  Then we store f5.  So we can't optimistically
perform the entire multiply before knowing the value of "something"
since f5 is live on the false branch.

We can however, hoist the initial pipeline stages (perhaps overlapping
them with earlier pipeline compuations).

		int-1, pfmul.ss f3,f4,f0
		int-2, pfmul.ss f0,f0,f0
		int-3, pfmul.ss f0,f0,f0
		if (something) {
		    pfmul.ss f0,f0,f5
		}
		fst.l	f5,somewhere

The true path get much shorter.
No increase in the path length of the false path.
And no extra register required.

It's perhaps a dirty trick rather than elegant, but I try to
describe my ideas glowingly and reserve disparaging terms for other
peoples' work.

The point though, is that the exposed pipeline scheme requires less
registers because result registers are not frozen at the beginning of
a pipelined sequence, but at the end.  Similarly, the source registers
become avaliable immediately after they are used.  In the example above,
f3 and f4 are immediately avaliable after the 1st instruction and f5
isn't required until the result pops out of the pipe.

Renaming helps, but requires more hidden registers that might be
used profitably by the compiler for other work.

Regarding compilers, I believe The Portland Group and Ardent both have
compilers that will take advantage of the pipelined instructions.
Besides that, the i860 is a wonderful source of thesis topics.

The i860 may not be your ideal chip, but it's chock full of ideas.
The good and useful ones shouldn't be ignored.

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

preston@titan.rice.edu (Preston Briggs) (10/11/90)

In article <3432@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes:
>I assume you mean given two machines, otherwise identical, differing
>only in that M1 does hardware speculative execution and M2 does not,
>you can write a compiler that will make M2 run faster than M1.

That's what I believe.

>This sounds wrong. I do not doubt that you can speedup M2 by speculative
>execution (with hoisting, etc). But surely the same technology can be
>applied to M1 with the same result. A priori, I would expect the benifit
>to be the same for both machines.

Well, I wan't going to let M1 use my fabulous scheduling ideas.
It had to be satidfied with hardware.  Further, M2 ought to have a
higher clock speed since its hardware is simpler.

>On the other hand, there are many things that *need* H/W. E.g., pre-
>fetching (and executing) multiple paths. Surely these gains are additive
>to those of the compiler.

I feel like I haven't made myself clear.

I'm advocating a wide instruction word so that I can, in a single
instruction, specify enough work to keep all the resources busy.
Ways of accomplishing this include trace scheduling,
global compaction, and software pipelining.

Speculative execution hardware notices that it has resources
that aren't fully utilized and tries to find work for them to do.

I suppose the advantage of speculative hardware is that you can use
a skinny instruction word to get some of the same effect, 
non-deterministically.

So fancy hardware can get some parallelism without fancy software,
and fancy software can get some parallelism without fancy hardware.
Given one, I don't think you need the other.  So is it cheaper
to build the compiler or the chip?  Don't forget to make them
correct.

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

chased@rbbb.Eng.Sun.COM (David Chase) (10/11/90)

In article <3432@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes:
>In article <1990Oct9.224312.2031@rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>>Given the same number of functional units, with same latencies,
>>same number of registers, ...
>>I believe I can write a compiler that will make code run
>>faster than hardware that does speculative execution.

>I assume you mean given two machines, otherwise identical, differing
>only in that M1 does hardware speculative execution and M2 does not,
>you can write a compiler that will make M2 run faster than M1.

>This sounds wrong. I do not doubt that you can speedup M2 by speculative
>execution (with hoisting, etc). But surely the same technology can be
>applied to M1 with the same result. A priori, I would expect the benifit
>to be the same for both machines.

One point was not sufficiently emphasized in Preston's posting: "SAME
NUMBER OF REGISTERS".  A reasonable trick in hardware speculative
execution is to make up boatloads of secret registers to hold the
results of speculative computations.  Preston says, "expose those
registers to the compiler", which of course means "change your
architecture", which is typically not acceptable to a company with
customers committed to an existing architecture (be it VAX, MIPS,
SPARC, RS/6000, 370, 80386, whatever).

Preston is also assuming, probably correctly, that it requires less
hardware to expose the registers and let the compiler juggle them than
it does to hide the registers and manage the juggling on the fly in
hardware.  Less hardware means a number of things (possibly):

  a) higher yield
  b) more room for other stuff (cache, tlb, whatever)
  c) choice of a faster, less compact, or more power-hungry
     technology (i.e., GaAs or ECL)
  d) shorter critical path for clock cycle
  e) shorter pipeline.

Thus, if you sell enough chips to recover your investment in
high-powered compiler technology, then you win.  Big IF there, of
course, and you'll be selling a new architecture, and we haven't said
much about context switching or any other OS-related issues yet,
either.

David Chase
Sun Microsystems, Inc.

schow@bcarh185.bnr.ca (Stanley T.H. Chow) (10/11/90)

In article <12905@encore.Encore.COM> jkenton@pinocchio.encore.com (Jeff Kenton) writes:
>From article <1990Oct9.212103.363@rice.edu>, by preston@titan.rice.edu (Preston Briggs):
>> 
>> In general, we need to be careful about fatally increasing
>> register pressure.  The i860's exposed pipeline provides an
>> elegant way out, allowing simple aborts of optimistic 
>> computations by ignoring what's partially computed in
>> the pipe.
>> 
>
>It would take a lot to convince me that the i860 is an elegant solution
>to anything.  No one has produced a compiler which can take advantage of
>the theoretically possible parallelism of the i860.  It's a very fast
>chip for certain kinds of applications, but I wouldn't call it elegant,
>or general purpose.

In the context of this discussion - Speculative Execution, exposing the
pipeline AND letting S/W control the flow of oeprand/result through it
makes sense. As Preston Briggs points out, it is easy (on the i860) to
start execution, there is no cost in terms of register usage or issuing
instructions. Other current micros will tie up the result register 
and/or require a seperate instruction issue slot.

Anyway, wasn't the i860 designed as a numeric coprocessor? I think of
it as DSP-like but with enough "General Purpose" that it does not need
a "controller".

Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!uunet!bnrgate!bcarh185!schow
(613) 763-2831               ..!psuvax1!BNR.CA.bitnet!schow
Me? Represent other people? Don't make them laugh so hard.

schow@bcarh185.bnr.ca (Stanley T.H. Chow) (10/11/90)

In article <1990Oct10.170424.21489@rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>In article <3432@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes:
>>I assume you mean given two machines, otherwise identical, differing
>>only in that M1 does hardware speculative execution and M2 does not,
>>you can write a compiler that will make M2 run faster than M1.
>
>That's what I believe.
>
>>This sounds wrong. I do not doubt that you can speedup M2 by speculative
>>execution (with hoisting, etc). But surely the same technology can be
>>applied to M1 with the same result. A priori, I would expect the benifit
>>to be the same for both machines.
>
>Well, I wan't going to let M1 use my fabulous scheduling ideas.
>It had to be satidfied with hardware.  Further, M2 ought to have a
>higher clock speed since its hardware is simpler.

Ah, but that is not very fair, is it? If code scheduling works for both
M1 & M2, why restrict it to M2 only?

It sounds like we are starting to get into the philosophical aspects of
the problem - like the RISC/CISC wars of old (may be even now? :-).

There are certainly many possible trade-off and design points. Some will
rely exclusivly on H/W, others on S/W, and yet others on both. It is not
clear how the H/W and S/W efforts interact. Do they go after the same
"parallellism" or do they exploit different types of "parallellism"? Which
kinds of S/W speculative execution work well with which kinds of H/W?
Do some H/W designs make the S/W impossible?

>I'm advocating a wide instruction word so that I can, in a single
>instruction, specify enough work to keep all the resources busy.
>Ways of accomplishing this include trace scheduling,
>global compaction, and software pipelining.
>
>Speculative execution hardware notices that it has resources
>that aren't fully utilized and tries to find work for them to do.

The difference is that the compiler has the "global" view and can do
"higher" level shuffling and optimization. On the other hand, H/W has
the advantage that it actually "knows" the total instantaneous values
of all the registers. As someone (Andy?) suggests, clever H/W could
resolve aliasing at run time.

>
>I suppose the advantage of speculative hardware is that you can use
>a skinny instruction word to get some of the same effect, 
>non-deterministically.
>
>So fancy hardware can get some parallelism without fancy software,
>and fancy software can get some parallelism without fancy hardware.
>Given one, I don't think you need the other.  So is it cheaper
>to build the compiler or the chip?  Don't forget to make them
>correct.

It is not clear to me that the compiler would be able to extract all
the possible parallelism (even perfect aliasing analysis is not as good
as "instantaneous" alias detection). Surely the H/W can contribute some
more parallelism by "knowing" all the current values!

Often, the actual time required is non-deterministic, e.g., time to fetch
a word, time for a divide. This looks like another area where the H/W may
augment the compiler.

Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!uunet!bnrgate!bcarh185!schow
(613) 763-2831               ..!psuvax1!BNR.CA.bitnet!schow
Me? Represent other people? Don't make them laugh so hard.

preston@titan.rice.edu (Preston Briggs) (10/11/90)

In article <1990Oct10.164353.21070@rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>I wrote, about the i860

>Regarding compilers, I believe The Portland Group and Ardent both have
>compilers that will take advantage of the pipelined instructions.

Instead of Ardent, I should have said Alliant.
Ardent (or rather Stardent) of course have a nice compiler,
but it makes code for their MIPS based machine.

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

gillies@m.cs.uiuc.edu (10/11/90)

> No one has produced a compiler which can take advantage of the
> theoretically possible parallelism of the i860.  It's a very fast chip
> for certain kinds of applications, but I wouldn't call it elegant, or
> general purpose.

One of the problems in new CPU designs is that designers don't realize
which architecture enhancements are pointless, because we don't and
may never have the optimization technology to take advantage of them.

Many enhancements to "increase parallelism" cannot be exploited by
compilers unless P = NP, or unless you are willing to wait {hours,
days, weeks} for your compiler to finish its job.  So don't just
relegate EVERYTHING to the compiler!

When you do something like add dedicated functional units to a CPU,
you are (potentially) increasing the complexity of the scheduling from
say, triangle-inequality traveling salesman (for which a heuristic
with bounded performance exists), to full generality traveling
salesman (for which no bounded heuristic exists unless P = NP).

I believe you are much better off spending the extra real estate to
implement identical functional units, or add vector hardware, since
the compiler technology for exploiting these features runs in
polynomial time.


Don W. Gillies, Dept. of Computer Science, University of Illinois
1304 W. Springfield, Urbana, Ill 61801      
ARPA: gillies@cs.uiuc.edu   UUCP: {uunet,harvard}!uiucdcs!gillies

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/11/90)

  Software can look at the global structure of a program and make
decisions based on that.

  Hardware can see what's happening at execution time, which may include
poor memory performance due to cycle stealing, variable FPU performance
for algorithms which are value sensitive, and other things which can
vary from run to run, or with data.

  If people would just drop the idea that either can solve all
performance problems, and start making hardware and software work better
together, then cost/performance would be improved with a lot lower noise
level.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

philip@beeblebrox.dle.dg.com (Philip Gladstone) (10/11/90)

In article <1990Oct10.164353.21070@rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
[stuff deleted]
>   We can however, hoist the initial pipeline stages (perhaps overlapping
>   them with earlier pipeline compuations).
>
>           int-1, pfmul.ss f3,f4,f0     <--- These floats hoisted from
>           int-2, pfmul.ss f0,f0,f0     <--- from the if
>           int-3, pfmul.ss f0,f0,f0
>           if (something) {
>               pfmul.ss f0,f0,f5
>           }
>           fst.l	f5,somewhere
>
>   The true path get much shorter.
>   No increase in the path length of the false path.
>   And no extra register required.

It seems to me that it is very difficult to spot when this is legal or not.
It is assumed in this example that none of the hoisted instructions
will cause any problems if they are executed when the programmer did
not intend them to be executed. Consider the case of overflow! I must
confess to knowing little or nothing about the i860, but similar
examples exist for the 88100 (which I do understand). Consider the
following code fragment:

            if (i > 0)
            {
                volatile_global = b / i;
            }

On the 88000 at least, divides are very slow (36 cycles) and there is
a great temptation to move them back up the instruction stream so as
to get greater parallelism. In this case, the divide could blow up (if
i was equal to 0) when the source would not have blown up.

This whole problem arises from the fact that most instructions have an
implicit conditional branch (interrupt) as part of their
specification. You can only (reliably) hoist instructions that cannot
trap -- unless you can perform major flow analysis to check that the
trap conditions never arise (such as adding two shorts in long
registers with an overflow causing instruction).

Removing the implicit trapping action of these instructions would be a
way out of this problem, but you would have to introduce some sort of
'trap on previous overflow' instruction. Even then you would have to
take great care that you only detected the right overflows.

Enough of this
---
--
Philip Gladstone                        philip@dle.dg.com
Development Lab Europe                  C=gb/AD=gold 400/PR=dgc/O=dle
Data General, Cambridge                   /SN=gladstone/GN=philip
England.  +44 223-67600

dik@cwi.nl (Dik T. Winter) (10/11/90)

In article <PHILIP.90Oct11102731@beeblebrox.dle.dg.com> philip@beeblebrox.dle.dg.com (Philip Gladstone) writes:
 > In article <1990Oct10.164353.21070@rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
 > [stuff deleted]
[more stuff deleted]
 > >           int-1, pfmul.ss f3,f4,f0     <--- These floats hoisted from
 > >           int-2, pfmul.ss f0,f0,f0     <--- from the if
 > >           int-3, pfmul.ss f0,f0,f0
 > >           if (something) {
 > >               pfmul.ss f0,f0,f5
 > >           }
 > >           fst.l	f5,somewhere
 > >   The true path get much shorter.
 > >   No increase in the path length of the false path.
 > >   And no extra register required.
 > It seems to me that it is very difficult to spot when this is legal or not.
 > It is assumed in this example that none of the hoisted instructions
 > will cause any problems if they are executed when the programmer did
 > not intend them to be executed.
This raises a point, although for different reasons.
I wil start with some explanations:  on the i860, if (for instance) a
floating-point instruction overflows, a status bit is set and if this
status bit is set the next floating-point instruction will trap.
If we assume the pipeline was empty when this sequence started the
hoisted instructions will not set status bits and do not deliver a result.  It
is only the last (conditional) instruction that can set a status bit and
deliver a result; but that instruction does not take any source operands.
So in this aspect the example is valid.  There is another aspect: after
the three pfmul instructions the pipeline is filled in an indeterminate
state if the false path is to be taken.  There might be an overflow status
or somesuch pending in the last stage, and that must be cleared.  The only
ways I see to clear this is either to perform a non pipelined multiply
after such a sequence (which voids the pipeline) or else you need in the
false path an instruction to drain the pipeline (pfmul.ss f0,f0,f0) and an
instruction to clear status bits in the fsr.

As far as I see, in both cases you will lose in some cases what you gain
in others.

And yes, the Alliant compiler generates pipelined instructions and dual
instruction mode instructions.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

chased@rbbb.Eng.Sun.COM (David Chase) (10/12/90)

>  = bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow)
>> = preston@titan.rice.edu (Preston Briggs)

>>Well, I wan't going to let M1 use my fabulous scheduling ideas.
>>It had to be satidfied with hardware.  Further, M2 ought to have a
>>higher clock speed since its hardware is simpler.

>Ah, but that is not very fair, is it? If code scheduling works for both
>M1 & M2, why restrict it to M2 only?

Agreed, it is unfair to run the code scheduling only on M2.  However,
I think the claim is that scheduled code won't run any faster on M1
than unscheduled code -- that's what you added all that hardware for,
right?  You can think of M1 as a machine that looks far ahead for
anything that it can execute (and fires off stalled instructions in a
dataflow fashion), whereas M2 is a machine that will only suck up
instructions for which the operands are valid and the functional unit
is idle (i.e., the first stall stops all subsequent execution) or
(Stanford-MIPS style) doesn't even stall -- just rely on the compiler
to insert NOPS.

"Clearly" the hardware for M2 is simpler, but it will run just as
quickly if the compiler does the lookahead and reorders the
instructions into the order that M1 would have executed them.  Run
that same code through M1, and "in theory" it executes the instructions
in the same way as it did before they were reordered -- no speedup,
but lots more chip real estate, and a possibly slower cycle time.

Now, M1 does win in a couple of ways, including one very important one
that I forgot to mention:

1) M1 can look like a chip that people already use (customers are a
   minor detail). 

2) M1 can take advantage of early instruction completion (i.e.,
   division, or a first-level-cache hit on load) instead of assuming
   worst case.

3) M1 can do (dynamic) prefetch prediction in some limited way, and
   thus only needs to suck up instructions from *one* potential
   successor, instead of all of them.

Counter-proposals to (3) for M2 include profiling feedback to the
compiler, and the possibility (I haven't worked out the details) of
exposing the branch history as a value for the compiler to use.  I
suspect this involves a wee bit of code expansion, or the use of
"conditional execution" bits on every instruction in the style of the
Acorn Risc Machine.  Use your history of the impending branch to
determine which instructions to execute on your way to it -- if you
guessed wrong, then you use the history bits to figure out which
fix-up code to execute (actually, you could have several different
conditional branches, themselves conditional on the history bits, so
that you'd choose the right one with respect to fixup code).

David Chase
Sun Microsystems

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (10/12/90)

In article <1990Oct10.212713.28260@rice.edu> preston@titan.rice.edu 
	(Preston Briggs) writes:
>>Regarding compilers, I believe The Portland Group and Ardent both have
>>compilers that will take advantage of the pipelined instructions.
>Instead of Ardent, I should have said Alliant.

I've been waiting for someone else to rain on this parade.  Since no
one has, I will point out that Alliant's compiler was _not_ shipped
on schedule.  In general, i860 compilers so far have justified the
pessimists more than the optimists.

It would be interesting to find out _which_ of the i860's features
causes the most compiler difficulty.  The T, KI, KR registers?  The
non-self-draining pipeline?  Scheduling? IEEE issues?
-- 
Don		D.C.Lindsay

preston@titan.rice.edu (Preston Briggs) (10/12/90)

I suggested the trickiness of hoisting pipeline filling instructions
above a conditional.  The result looked like this

>>>           int-1, pfmul.ss f3,f4,f0     <--- These floats hoisted from
>>>           int-2, pfmul.ss f0,f0,f0     <--- from the if
>>>           int-3, pfmul.ss f0,f0,f0
>>>           if (something) {
>>>               pfmul.ss f0,f0,f5
>>>           }
>>>           fst.l	f5,somewhere
>>>   The true path get much shorter.
>>>   No increase in the path length of the false path.
>>>   And no extra register required.

>philip@beeblebrox.dle.dg.com (Philip Gladstone) writes:

>> It seems to me that it is very difficult to spot when this is legal or not.
>> It is assumed in this example that none of the hoisted instructions
>> will cause any problems if they are executed when the programmer did
>> not intend them to be executed.

But we *can* check for legality.  Generally, I must not overwrite any
registers or memory locations that are live on the other path.
Further, I should trigger no traps.  This means being careful
about loads, etc...

and
dik@cwi.nl (Dik T. Winter) writes:

[says that I'm reasonably ok on result traps]

>There is another aspect: after
>the three pfmul instructions the pipeline is filled in an indeterminate
>state if the false path is to be taken.  There might be an overflow status
>or somesuch pending in the last stage, and that must be cleared.

The pipe would have to be drained, but we must always drain the
pipe anyway.  So, if at some later point in the program,
I perform a multiply, the first 3 pfmul's must direct
their (unspecified) values into f0.  But we always have to
do this anyway.  And, as you point out, a scalar (non-pipelined)
multiply will overwrite the pipe.

So, if we'd had an if-then-else instead, we might get this result.

           int-1, pfmul.ss f3,f4,f0     <--- These floats hoisted from
           int-2, pfmul.ss f0,f0,f0     <--- from the then clause
           int-3, pfmul.ss f0,f0,f0
           if (something) {
               pfmul.ss f0,f0,f5
           }
	   else {
               pfmul.ss f7,f8,f0	-- draining the pipe
               pfmul.ss f0,f0,f0
               pfmul.ss f0,f0,f0
               pfmul.ss f0,f0,f5
	   }
           fst.l	f5,somewhere

The point is that the else clause is unchanged.  Actually,
we could build this form

           int-1, pfmul.ss f3,f4,f0     <--- These floats hoisted from
           int-2, pfmul.ss f7,f8,f0     <--- from the then clause
           int-3, pfmul.ss f0,f0,f0
           if (something) {
               pfmul.ss f0,f0,f5
           }
	   else {
               pfmul.ss f0,f0,f0
               pfmul.ss f0,f0,f5
	   }
           fst.l	f5,somewhere

Note also that the scalar multiply doesn't "clear the pipe"; it just
overwrites it with new junk.  We need to be careful to drain this
new junk to avoid traps.  Recently I tried to use the "r2pt.ss f10,f0,f0"
instruction to set the KR register to f10.  Unfortunately, this may
provoke a source exception trap from the adder if the multiply
pipeline is not in some nice state.

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

mash@mips.COM (John Mashey) (10/12/90)

In article <PHILIP.90Oct11102731@beeblebrox.dle.dg.com> philip@beeblebrox.dle.dg.com (Philip Gladstone) writes:
...
>It is assumed in this example that none of the hoisted instructions
>will cause any problems if they are executed when the programmer did
>not intend them to be executed. Consider the case of overflow! I must

>This whole problem arises from the fact that most instructions have an
>implicit conditional branch (interrupt) as part of their
>specification. You can only (reliably) hoist instructions that cannot
>trap -- unless you can perform major flow analysis to check that the
>trap conditions never arise (such as adding two shorts in long
>registers with an overflow causing instruction).

Yes, and there many subtle cases.  Mike O'Dell talked about some
in his talks about the Prisma machine at the last USENIX, i.e.,
about the exact state presented to the user upon a signal.

As another example, consider the effects of hoisting a load
instruction, which is something you'd like to do in a machine
with out-of-order characteristics, especially if it has multiple
memory paths.  Here are a few cases of interest:

	a) The load is an uncached load to a volatile device register.
	Hoisting is of course a no-no, as this will drive OS programmers
	nuts, and in fact, must be carefully synchronized with
	surrounding operations, or else.

	So, of course, you don't hoist these things, or allow the hardware
	to do any hoisting for you via speculation.  For instance, in
	the code that looks like:
	if (flag)
		x = p->device1;
	else
		y = p->device2;
	Any hardware that speculatively issues EITHER of the assignments
	before being sure which one, may well have trouble if the
	read actually gets to the memory system.  In fact, I think this
	means that any uncached load stops any speculative path,
	because you dare not issue it for the side-effect.

	b) Clearly stores can't be speculatively issued.
	Even cached loads may take care, any time a load can cause
	a fault, such as an MMU miss, or an out-of-range access
	(like to location 0 on many machines.) A good example to 
	always consider, evil though it may be, is the Bourne shell's
	expectation of fixing up a memory fault and returning to
	the instruction that did it...  I believe this can be handled,
	but if there is any case that doesn't, soembody will get bitten.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

dik@cwi.nl (Dik T. Winter) (10/12/90)

I write:
 >There is another aspect: after
 >the three pfmul instructions the pipeline is filled in an indeterminate
 >state if the false path is to be taken.  There might be an overflow status
 >or somesuch pending in the last stage, and that must be cleared.
 > 
Preston Briggs responds:
 > The pipe would have to be drained, but we must always drain the
 > pipe anyway.  So, if at some later point in the program,
 > I perform a multiply, the first 3 pfmul's must direct
 > their (unspecified) values into f0.  But we always have to
 > do this anyway.  And, as you point out, a scalar (non-pipelined)
 > multiply will overwrite the pipe.
As far as I understand the processor this is still wrong:
 > So, if we'd had an if-then-else instead, we might get this result.
 >            int-1, pfmul.ss f3,f4,f0     <--- These floats hoisted from
 >            int-2, pfmul.ss f0,f0,f0     <--- from the then clause
 >            int-3, pfmul.ss f0,f0,f0
 >            if (something) {
 >                pfmul.ss f0,f0,f5
 >            }
 > 	   else {
 >                pfmul.ss f7,f8,f0	-- draining the pipe
Note the above instruction may generate a result trap because the result of the
multiplication of f3 and f4 is stored in this stage!  (Nowhere in the
documentation do I find that result traps are discarded if the result is
stored in f0.  Documentation even carefully tells us that 'f0 always reads
as 0 whatever is stored in it.')
 > 
 > Note also that the scalar multiply doesn't "clear the pipe"; it just
 > overwrites it with new junk.  We need to be careful to drain this
 > new junk to avoid traps.
Yes, we have to drain this to avoid source traps but not to avoid result
traps: 'After a scalar operation, the values of all pipeline stages of the
affected unit are undefined.  No spurious result-exception traps result
when the undefined values are subsequently stored by pipelined operations;
however, the values should not be referenced as source operands.'

 >                           Recently I tried to use the "r2pt.ss f10,f0,f0"
 > instruction to set the KR register to f10.  Unfortunately, this may
 > provoke a source exception trap from the adder if the multiply
 > pipeline is not in some nice state.
Yes, source exceptions occur when you do use undefined results.

So a scalar multiply does indeed not totally clear the pipe, but clears all
status information.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

preston@titan.rice.edu (Preston Briggs) (10/12/90)

In article <2326@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:

>As far as I understand the processor this is still wrong:
> > So, if we'd had an if-then-else instead, we might get this result.
> >            int-1, pfmul.ss f3,f4,f0     <--- These floats hoisted from
> >            int-2, pfmul.ss f0,f0,f0     <--- from the then clause
> >            int-3, pfmul.ss f0,f0,f0
> >            if (something) {
> >                pfmul.ss f0,f0,f5
> >            }
> > 	   else {
> >                pfmul.ss f7,f8,f0	-- draining the pipe
>Note the above instruction may generate a result trap because the result of the
>multiplication of f3 and f4 is stored in this stage!  (Nowhere in the
>documentation do I find that result traps are discarded if the result is
>stored in f0.

It looks like you're right.  Silly me. (That's what I say instead of
typing a long string of cuss words).  If somebody at Intel knows differently,
I'd appreciate hearing about it!

I guess I must therefore advocate interpreting stores to f0 as "discard
everything".  Any reason not to?  (Besides the cost of changing the 
chip, documentation, ...)

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

ckp@grebyn.com (Checkpoint Technologies) (10/12/90)

In article <PHILIP.90Oct11102731@beeblebrox.dle.dg.com> philip@beeblebrox.dle.dg.com (Philip Gladstone) writes:
>This whole problem arises from the fact that most instructions have an
>implicit conditional branch (interrupt) as part of their
>specification. You can only (reliably) hoist instructions that cannot
>trap -- unless you can perform major flow analysis to check that the
>trap conditions never arise (such as adding two shorts in long
>registers with an overflow causing instruction).
>
>Removing the implicit trapping action of these instructions would be a
>way out of this problem, but you would have to introduce some sort of
>'trap on previous overflow' instruction. Even then you would have to
>take great care that you only detected the right overflows.

#pragma MY_TWO_CENTS

Forgive my insolence - I'm not an architecture designer, be gentle
with me, but this just occured to me..

Why not include condition code bits with each register? Store these when
the register value is stored, when it pops out the end of a pipeline.
Include a flag for NAN (not a new idea, right?), and store this for
things like divide-by-0.  Then you can trap when the register is used as
a source operand, and you can code explicit tests (and maybe traps) when
you want to, on operations that could very well be taking place
simultaneously.
-- 
First comes the logo: C H E C K P O I N T  T E C H N O L O G I E S      / /  
                                                                    \\ / /    
Then, the disclaimer:  All expressed opinions are, indeed, opinions. \  / o
Now for the witty part:    I'm pink, therefore, I'm spam!             \/

schow@bcarh185.bnr.ca (Stanley T.H. Chow) (10/12/90)

In article <42062@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>As another example, consider the effects of hoisting a load
>instruction, which is something you'd like to do in a machine
>with out-of-order characteristics, especially if it has multiple
>memory paths.  Here are a few cases of interest:
>
>	a) The load is an uncached load to a volatile device register.
>	Hoisting is of course a no-no, as this will drive OS programmers
>	nuts, and in fact, must be carefully synchronized with
>	surrounding operations, or else.

This should not be particularly hard, just use "volatile" in C. (I know,
the zillions of lines of existing code may break, but as some people say -
those programs are brain-damaged :-).

It is also possible for (very clever) hardware to handle it with some
sort of MMU attribute per page. Several existing processors already have
this. Speculative loads could proceed blindly until it hits one of these
stop bits. The affected load would be put "on-hold" pending a "really-do-it"
or a "forget-it". (So it's hard, but we gotta use up those millions of
transistors :-)
>
>	So, of course, you don't hoist these things, or allow the hardware
>	to do any hoisting for you via speculation.  For instance, in
>	the code that looks like:
>	if (flag)
>		x = p->device1;
>	else
>		y = p->device2;
>	Any hardware that speculatively issues EITHER of the assignments
>	before being sure which one, may well have trouble if the
>	read actually gets to the memory system.  In fact, I think this
>	means that any uncached load stops any speculative path,
					   ^^^
>	because you dare not issue it for the side-effect.

This is not so.

An uncached load (at worse) stops *its* speculative path. In other words,
paths without uncached loads can still proceed. E.g.,

if (flag)
	x = p->device1;
else
	x = somelocal;

The fetch of "somelocal" can happen before the "if". In general, after
figuring out side effects and aliasing, the only restriction should be
that volatile accesses cannot "pass" each other.

Also, as suggested in the paragraphs above, with sufficiently clever
hardware, one can still "conservatively" speculatively fetch volatile 
addresses. (Hmm, is that an oxymoron?)
>
>	b) Clearly stores can't be speculatively issued.
>	Even cached loads may take care, any time a load can cause
>	a fault, such as an MMU miss, or an out-of-range access
>	(like to location 0 on many machines.) [..]

Again, issue the store early to get through the address translation and
protection checking, then wait for the "really-do-it". 

Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!uunet!bnrgate!bcarh185!schow
(613) 763-2831               ..!psuvax1!BNR.CA.bitnet!schow
Me? Represent other people? Don't make them laugh so hard.

spot@TR5.GP.CS.CMU.EDU (Scott Draves) (10/13/90)

In article <3445@bnr-rsc.UUCP>, schow@bcarh185.bnr.ca (Stanley T.H.
Chow) writes:
|> In article <42062@mips.mips.COM> mash@mips.COM (John Mashey) writes:
|> >
|> >	a) The load is an uncached load to a volatile device register.
|> >	Hoisting is of course a no-no, as this will drive OS programmers
|> >	nuts, and in fact, must be carefully synchronized with
|> >	surrounding operations, or else.
|> 
|> This should not be particularly hard, just use "volatile" in C. (I know,
|> the zillions of lines of existing code may break, but as some people say -
|> those programs are brain-damaged :-).
|> 

that looks like a pretty good argument that this should be done by the
compiler rather than the hardware.  the same goes for the exception
stuff that other people have mentioned; a good compiler can determine
when exceptions are impossible and proceed to speculate.

			Consume
Scott Draves		Be Silent
spot@cs.cmu.edu		Die

brett@cayman.amd.com (Brett Stewart) (10/13/90)

>From article <1990Oct9.212103.363@rice.edu>, by preston@titan.rice.edu (Preston Briggs):
> 
> In general, we need to be careful about fatally increasing
> register pressure.  The i860's exposed pipeline provides an
> elegant way out, allowing simple aborts of optimistic 
> computations by ignoring what's partially computed in
> the pipe.
>
One of the advantages of variable register-stack cache window frames
of the 29000 family is the instant availability of relief of
register pressure.  Michael Tiemann, of Cygnus Support, also has
some very interesting ideas about compiler techniques for
superscalar architectures that would allow the possibly large
numbers of registers of a procedure in the 29K to be used for
optimistic computation in the gnu c or other Fraser-Davidson UVa
type compiler schemes.  Comments, Michael?
Best Regards; Brett Stewart
Advanced Micro Devices, Inc.           1-512-462-5321  FAX
5900 E. Ben White Blvd MS561           1-512-462-4336  Telephone
Austin, Texas 78741      USA           brett@cayman.amd.com

dik@cwi.nl (Dik T. Winter) (10/13/90)

In article <22587@grebyn.com> ckp@grebyn.UUCP (Checkpoint Technologies) writes:
 > Why not include condition code bits with each register? Store these when
 > the register value is stored, when it pops out the end of a pipeline.
 > Include a flag for NAN (not a new idea, right?), and store this for
 > things like divide-by-0.  Then you can trap when the register is used as
 > a source operand, and you can code explicit tests (and maybe traps) when
 > you want to, on operations that could very well be taking place
 > simultaneously.
Something similar is done in all machines designed by Seymour Cray.  There
is no instruction that generates a result trap, only source traps are
generated.  There are advantages and disadvantages.
Advantages:
1.  It is possible to initialize all variables at a trapping value.  This
    will very fast eleminate the use of uninitialized values.  (CDC users
    will know this:  LDSET(PRESET=UNDEF), although this would not work
    with integer variables.)
2.  It makes it possible for the compiler to generate spurious results.
    I.e. if the array B has never been initialized the following loop:
	for I from 1 to N do if A[I] != 0.0 then B[I] = 1.0/A[I]
    can be modified by omitting the test without adverse effects.
    This can be beneficial on vector processors.  In fact the Fortran
    compiler on the NEC SX/2 generates spurious results in some cases;
    alas, they trigger a result trap.
Disadvantage:
1.  Debugging is very problematical.  I know, I have debugged a number of
    programs on the CDC Cyber.  It is very difficult to see where a bad
    result originates.  This is even worse if (as on the Cyber) there are
    multiple functional units so that it is even problematical which unit
    did generate the trap.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

mash@mips.COM (John Mashey) (10/14/90)

In article <3445@bnr-rsc.UUCP> bcarh185!schow@bnr-rsc.UUCP (Stanley T.H. Chow) writes:
>In article <42062@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>As another example, consider the effects of hoisting a load
>>instruction, which is something you'd like to do in a machine
>>with out-of-order characteristics, especially if it has multiple
>>memory paths.  Here are a few cases of interest:
>>
>>	a) The load is an uncached load to a volatile device register.
>>	Hoisting is of course a no-no, as this will drive OS programmers
>>	nuts, and in fact, must be carefully synchronized with
>>	surrounding operations, or else.
>
>This should not be particularly hard, just use "volatile" in C. (I know,
>the zillions of lines of existing code may break, but as some people say -
>those programs are brain-damaged :-).

>It is also possible for (very clever) hardware to handle it with some
>sort of MMU attribute per page. Several existing processors already have
>this. Speculative loads could proceed blindly until it hits one of these
>stop bits. The affected load would be put "on-hold" pending a "really-do-it"
>or a "forget-it". (So it's hard, but we gotta use up those millions of
>transistors :-)
I think we have some terminology confusion here, at least some my fault
for use of term "issue".  I've observed that people often note that
one canot speculatively issue stores, but can so issue loads.
I observed that one cannot in general "issue" loads, if you believe
that "issue" actually means:  run it through the cache/MMU system,
and fetch from memory.  I agree with the style that says:
	run it to the point where you realize its unsafe, then hold
	it until you're sure.
>>	if (flag)
>>		x = p->device1;
>>	else
>>		y = p->device2;
>>	Any hardware that speculatively issues EITHER of the assignments
>>	before being sure which one, may well have trouble if the
>>	read actually gets to the memory system.  In fact, I think this
>>	means that any uncached load stops any speculative path,
					   ^^^
>>	because you dare not issue it for the side-effect.

>This is not so.
Sorry, wording again; we agree: I meant what you said ("its path"),
and would have said "all paths" otherwise; "any" was imprecise.

>An uncached load (at worse) stops *its* speculative path. In other words,
>paths without uncached loads can still proceed. E.g.,
>The fetch of "somelocal" can happen before the "if". In general, after
>figuring out side effects and aliasing, the only restriction should be
>that volatile accesses cannot "pass" each other.
>
>Also, as suggested in the paragraphs above, with sufficiently clever
>hardware, one can still "conservatively" speculatively fetch volatile 
>addresses. (Hmm, is that an oxymoron?)
>>
>>	b) Clearly stores can't be speculatively issued.
>>	Even cached loads may take care, any time a load can cause
>>	a fault, such as an MMU miss, or an out-of-range access
>>	(like to location 0 on many machines.) [..]

Again, I don't think we disagree. The only real point was that there
were loads, and then there were loads, and you must have some way to
distinguish the ones that can be safely completed, versus the ones
that cause catastrophe if they get to memory.   In particular, you must
especially be careful if doing a speculative-issue machine supposed to
be binary-compatible at kernel level with existing machines.
Such may, or may not be, a goal.

thanx for the clarification; soem of my languauge was sloppy.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

rbw00@ccc.amdahl.com ( 213 Richard Wilmot) (10/15/90)

In article <22587@grebyn> ckp@grebyn.com says
>#pragma MY_TWO_CENTS

>Forgive my insolence - I'm not an architecture designer, be gentle
>with me, but this just occured to me..

>Why not include condition code bits with each register? Store these when
>the register value is stored, when it pops out the end of a pipeline.
>Include a flag for NAN (not a new idea, right?), and store this for
>things like divide-by-0.  Then you can trap when the register is used as
>a source operand, and you can code explicit tests (and maybe traps) when
>you want to, on operations that could very well be taking place
>simultaneously.
>--
>First comes the logo: C H E C K P O I N T  T E C H N O L O G I E S      / /
                                                                            \\ / /    
>Then, the disclaimer:  All expressed opinions are, indeed, opinions. \  / o
>Now for the witty part:    I'm pink, therefore, I'm spam!             \/

Please NO. Save me from deferred traps. The CDC 6600 and some others
could produce several FP results that were illegal values FOR ANY OTHER
OPERATIONS. So exception traps took place not when/where you produced the
result but later when you tried to use it. Then the game was to guess where
you produced this value.

Well, I finally got used to coding explicit tests (usually not needed)
after every potentially violative FP operation (that might produce +-
infinity or +- indefinite). This is highly wasteful of code space for
something that is most often rare - I can always check in cases where I need
to know but please give me the trap when I create the bad result. PRECISELY.
-- 
  Dick Wilmot  | I declaim that Amdahl might disclaim any of my claims.
                 (408) 746-6108

jkenton@pinocchio.encore.com (Jeff Kenton) (10/17/90)

From article <cf2f02Ea02o001@JUTS.ccc.amdahl.com>, by rbw00@ccc.amdahl.com (  213  Richard Wilmot):
> 
> Please NO. Save me from deferred traps. The CDC 6600 and some others
> could produce several FP results that were illegal values FOR ANY OTHER
> OPERATIONS. So exception traps took place not when/where you produced the
> result but later when you tried to use it. Then the game was to guess where
> you produced this value.
> 

What you are describing is fairly close to the idea of NaN (Not a Number)
required by the IEEE spec for floating point calculations.  The operation
which produces a NaN ( 0/0, infinity - infinity, inf / inf ) *will* trap
when it occurs, but the NaNs will trap when they are used (if the trap
is enabled).


----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----
-----  jeff kenton:    	consulting at jkenton@pinocchio.encore.com  -----
-----		                 always at (617) 894-4508 	    -----
----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -----

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/18/90)

In article <cf2f02Ea02o001@JUTS.ccc.amdahl.com> rbw00@JUTS.ccc.amdahl.com (  213  Richard Wilmot) writes:

| Please NO. Save me from deferred traps. The CDC 6600 and some others
| could produce several FP results that were illegal values FOR ANY OTHER
| OPERATIONS. So exception traps took place not when/where you produced the
| result but later when you tried to use it. Then the game was to guess where
| you produced this value.

  It depends on how the system is designed. If I were designing a CPU
(and I don't), I would allow parallelism by allowing the FPU to have a
delay mode in which an instruction, let's call it DTRAP, sets a flag
such that if a trap would occur, the FPU enters a wait state until the
CPU issues the sync instruction, say WAIT, at which point the trap
occurs.

  This assumes that either you don't care about pinpointing the problem,
or you have some hook in the FPU to report it in a meaningful way.
Obviously you want an ITRAP for immediate trapping in other cases.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

schow@bcarh185.bnr.ca (Stanley T.H. Chow) (10/18/90)

In article <2772@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>In article <cf2f02Ea02o001@JUTS.ccc.amdahl.com> rbw00@JUTS.ccc.amdahl.com (  213  Richard Wilmot) writes:
>
>| Please NO. Save me from deferred traps. The CDC 6600 and some others
>| could produce several FP results that were illegal values FOR ANY OTHER
>| OPERATIONS. So exception traps took place not when/where you produced the
>| result but later when you tried to use it. Then the game was to guess where
>| you produced this value.
>
>  It depends on how the system is designed. If I were designing a CPU
>(and I don't), I would allow parallelism by allowing the FPU to have a
>delay mode in which an instruction, let's call it DTRAP, sets a flag
>such that if a trap would occur, the FPU enters a wait state until the
>CPU issues the sync instruction, say WAIT, at which point the trap
>occurs.
>
>  This assumes that either you don't care about pinpointing the problem,
>or you have some hook in the FPU to report it in a meaningful way.
>Obviously you want an ITRAP for immediate trapping in other cases.

There is another way. Assuming one wanted to allow speculative execution
and don't want traps gumming up the works, one could have traps happen
at *assignment*. This means each stage of the (say the FPU) pipe could
generate a "Trap Flag". This flag would be propagated through each stage
and finally causing a trap when you try to catch value from the exit end
of the pipe.

Possible refinements include not trapping if catching into r0 (or other
"non-real" registers); making the trap flag include the trap reason and
or stage that caused the trap.

It seems to me this solves the problem (of traps gumming up speculative
execution) and is more debuggable than many architectures.

Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!uunet!bnrgate!bcarh185!schow
(613) 763-2831               ..!psuvax1!BNR.CA.bitnet!schow
Me? Represent other people? Don't make them laugh so hard.

aglew@crhc.uiuc.edu (Andy Glew) (10/19/90)

>| Please NO. Save me from deferred traps. The CDC 6600 and some others
>| could produce several FP results that were illegal values FOR ANY OTHER
>| OPERATIONS. So exception traps took place not when/where you produced the
>| result but later when you tried to use it. Then the game was to guess where
>| you produced this value.

Why not split the expensive possibly trap produciung instruction into two parts:
(1) expensive, deferred trap producing instruction that is percolated upwards and
speculatively executed; and (2) an inexpensive instruction that takes the deferred
trap and actually produces an immediate trap, and is not subjected to code motion?
--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

cik@l.cc.purdue.edu (Herman Rubin) (10/19/90)

In article <12976@encore.Encore.COM>, jkenton@pinocchio.encore.com (Jeff Kenton) writes:
> From article <cf2f02Ea02o001@JUTS.ccc.amdahl.com>, by rbw00@ccc.amdahl.com (  213  Richard Wilmot):
 
> > Please NO. Save me from deferred traps. The CDC 6600 and some others
> > could produce several FP results that were illegal values FOR ANY OTHER
> > OPERATIONS. So exception traps took place not when/where you produced the
> > result but later when you tried to use it. Then the game was to guess where
> > you produced this value.
 
 
> What you are describing is fairly close to the idea of NaN (Not a Number)
> required by the IEEE spec for floating point calculations.  The operation
> which produces a NaN ( 0/0, infinity - infinity, inf / inf ) *will* trap
> when it occurs, but the NaNs will trap when they are used (if the trap
> is enabled).

The 6600 was really RISCy, and had no provision whatever for optional
traps.  These conditions were testable, however, and many algorithms 
did such tests.  It was also hoped that the increased exponent range
would decrease the occurrence.

These results were only illegal in FP operations.  One way this trap
was used as a feature was to initialize memory to negative indefinite
(not produced by any FP operation) to detect when it was read before
being written.  BTW, these traps could be turned off globally.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

cik@l.cc.purdue.edu (Herman Rubin) (10/23/90)

In article <3483@bnr-rsc.UUCP>, schow@bcarh185.bnr.ca (Stanley T.H. Chow) writes:
> In article <2772@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
> >In article <cf2f02Ea02o001@JUTS.ccc.amdahl.com> rbw00@JUTS.ccc.amdahl.com (  213  Richard Wilmot) writes:
> >
> >| Please NO. Save me from deferred traps. The CDC 6600 and some others
> >| could produce several FP results that were illegal values FOR ANY OTHER
> >| OPERATIONS. So exception traps took place not when/where you produced the
> >| result but later when you tried to use it. Then the game was to guess where
> >| you produced this value.

			....................

> There is another way. Assuming one wanted to allow speculative execution
> and don't want traps gumming up the works, one could have traps happen
> at *assignment*. This means each stage of the (say the FPU) pipe could
> generate a "Trap Flag". This flag would be propagated through each stage
> and finally causing a trap when you try to catch value from the exit end
> of the pipe.
> 
> Possible refinements include not trapping if catching into r0 (or other
> "non-real" registers); making the trap flag include the trap reason and
> or stage that caused the trap.
> 
> It seems to me this solves the problem (of traps gumming up speculative
> execution) and is more debuggable than many architectures.

We need more, not less, user control.  The reasonable programmer knows what
the program is doing, what odd things should occur, and what apparently odd
things should be ignored.  Not being ominiscient, the programmer will miss a
few.  The "optimal" thing to do would be to have these handled efficiently
in accordance with the available information.  To do this efficiently would
require the appropriate hardware for traceback, and a rarely used part of the
instruction to handle the enabling and disabling of traps, what to do if they
occur, etc.  Many architectural problems and considerations are involved here,
but every simple solution proposed runs into problems.

It is quite possible, even, that some traps can be anticipated and thereby
avoided.  I have not seen any such architectural features.  One example of
this is buffer read and call a refill procedure when empty.  The reads are
sporadic.  Now an anticipatory procedure would issue a weak interrupt when
the last item is read, but a strong interrupt if an attempt is made to read
when the buffer is empty.

Similar things can occur with arithmetic and logical conditions.  Attempts 
to oversimplify the problem are not likely to be optimal.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

hankd@dynamo.ecn.purdue.edu (Hank Dietz) (10/24/90)

Well, first off let me say that speculative execution consists of two
components:  getting the instructions started and making sure the
program semantics are preserved.  Only a static mechanism, e.g., a
compiler, can look arbitrarily far ahead for instructions to start;
hence, compiler technology beats hardware on this by a large margin.
However, preserving program semantics can be a dynamic problem:  for
example, suppose that you get a floating-point exception in one of a
bunch of speculatively-executed instructions...  you really need a
dynamic mechanism, e.g. hardware, to deal with this -- as in the
Multiflow machines.  Hence, the basic problem is static (compilers),
but some secondary problems could use dynamic (hardware) support.

In article <3300194@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes:
>Many enhancements to "increase parallelism" cannot be exploited by
>compilers unless P = NP, or unless you are willing to wait {hours,
>days, weeks} for your compiler to finish its job.  So don't just
>relegate EVERYTHING to the compiler!
...
>I believe you are much better off spending the extra real estate to
>implement identical functional units, or add vector hardware, since
>the compiler technology for exploiting these features runs in
>polynomial time.

This sounds reasonable, but isn't really correct.  The trick is to
recognize that the comparison here is between spending lots of compile
time to generate *OPTIMAL* solutions and spending no compile time to
implement no solution (i.e., simply discarding the structure).  The
best compromise is generally to invest enough compile time to get most
of the benefits...  and that usually isn't excessive compile time.  As
for vector, etc., being polynomial time compiler problems, that simply
isn't true for optimal solutions -- but nobody cares because you can
get good solutions quickly enough.

Remember that O() complexity measures don't tell you actual times --
i.e., high O() complexity doesn't necessarily imply long runtime for
real cases.  For example, who cares if compile time "blows up" when
given input with 30-level nested loops?  If it really bothers you in
such a rare case, you just turn off the expensive optimization for
that particular piece of code...  no big deal.

My rule is simply to figure out where each thing should be done
(static vs. dynamic mechanisms) and then to do the best we can in
implementing it that way.  In this approach, I have yet to come across
a compiler problem that defied reasonably quick, approximate, solution.

						-hankd@ecn.purdue.edu