[comp.arch] Register Scoreboarding

brooks@maddog.llnl.gov (05/07/89)

A fellow from SUN has asked "What is so magic about register scoreboarding?"

Register scoreboarding allow you to change the pipeline delays of the
functional units in the processor, and tolerate stochastic pipeline
delays for memory, without any hassles.  A consequence of this, for
the Cray line, is that the same binaries can be run on the Cray-1,
Cray-XMP, Cray-YMP series.  The pipeline delays are different for these
three machines.  On all of these machines accesses to main memory are
fully pipelined, and to not stall the processor on a "cache miss" as is
common for their little microchip brethren.  Pipelining cache misses,
providing full memory bandwidth which can keep the functional units of the
processors fed, is a trick that the micros have yet to exploit.  Register
scoreboarding is again part of this game because memory latency is
stochastic.

brooks@maddog.llnl.gov, brooks@maddog.uucp

grunwald@flute.cs.uiuc.edu (05/09/89)

Hennessy recently gave a talk here & in response to a question concerning
either scoreboarding or register windows, posited that object-code
recompilers would be able to solve the same set of problems (binary
compatibility across architectural changes), cheaper.

In his picture, when you have a new architecture with, say, more registers,
different delay costs or deeper pipelines, you translate your .o files and/or
your final binaries. This is essentially what scoreboarding is doing, albeit
dynamically. There are some limits to this approach, some obvious, some not
so obvious.

Comments?
--
Dirk Grunwald
Univ. of Illinois
grunwald@flute.cs.uiuc.edu

andrew@frip.WV.TEK.COM (Andrew Klossner;685-2505;61-201;;frip) (05/10/89)

On using object-code recompilers to port to new implementations of a
non-scoreboarding architecture:

The notion that the problems addressed by scoreboarding can be resolved
at compile time is intellectually appealing, but not completely true --
some latencies just cannot practically be predicted at compile time.
The usual example is the added latency involved in a data cache miss.

Clamping the whole CPU on cache miss isn't a technique that can survive
into the 1990s.  I'm very curious to see what the non-scoreboard folks
will do.

  -=- Andrew Klossner   (uunet!tektronix!orca!frip!andrew)      [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

henry@utzoo.uucp (Henry Spencer) (05/12/89)

In article <3288@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM (Andrew Klossner;685-2505;61-201;;frip) writes:
>Clamping the whole CPU on cache miss isn't a technique that can survive
>into the 1990s.  I'm very curious to see what the non-scoreboard folks
>will do.

Probably pretty much what they do now:  let access and execution proceed
in parallel, assuming the data isn't needed right away.  That technique
didn't survive far into the 80s, never mind the 90s.  Are you assuming
that it's done only by the people who loudly trumpet it as an advantage?
-- 
Mars in 1980s:  USSR, 2 tries, |     Henry Spencer at U of Toronto Zoology
2 failures; USA, 0 tries.      | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

mash@mips.COM (John Mashey) (05/12/89)

In article <3288@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM (Andrew Klossner;685-2505;61-201;;frip) writes:
>On using object-code recompilers to port to new implementations of a
>non-scoreboarding architecture:
>
>The notion that the problems addressed by scoreboarding can be resolved
>at compile time is intellectually appealing, but not completely true --
>some latencies just cannot practically be predicted at compile time.
>The usual example is the added latency involved in a data cache miss.
>
>Clamping the whole CPU on cache miss isn't a technique that can survive
>into the 1990s.  I'm very curious to see what the non-scoreboard folks
>will do.

First, we need to get some definitions clear, and then some facts,
and then do some quick back-of-the-envelope numbers.

1) Of the currently popular RISCs, scoreboarding could be used or not,
on any of them. Scoreboarding, or lack thereof, seldom shows its head into
the architecture of these machines. [The MIPS load-delay slot being the one
obvious exception.]

2) Most of the existing RISCs obtain at least some, and perhaps, substantial
concurrency of FP operations.  As far as I can tell, most of them have
architectures that require interlocks or scoreboarding, as the latencies
and issue rates of FP ops are permitted to move around from implementation
to implementation, and FP ops tend to want to be multi-cycle, and tend to
be more amenable to more silicon.

3) I'm not sure about some of the other RISCs, but R3000s only clamp
the main pipeline on a cache miss, not the independent units, like:
	integer mul/div; FP add; FP mul; FP div
all of which will keep cranking away if they're already started.

4) In the analysis below, it is important to note that typical speed RISC
pipelines try to have a very regular flow of data in and out of the registers,
with as few ports as they can get away with.  Note for example, that if
you're doing out of order execution, and 3 operations finish together,
wanting into a 1-write-port register file, it will take 3 cycles for
them to do that...  This is one of the reasons that most of the RISCs
went with separate integer and FP regs, and maybe even 2-write-port FP regs,
even when using 1-write-port integer regs.
More genericly, suppose you design a machine so that you can start operations
even while others are still pending.  You'd better make sure that what you
gain at the front-end (by not stalling), that you don't lose mostly back
at the back-end (by then stalling waiting for register access.)

5) Back-of-the-envelope: how much would an R3000 gain by continuing execution
beyond a cache miss? (Assume M/2000-style machine; 20% loads, 10% stores;
about 25% memory system degradation on top of instruction cycles, on
real programs.)  Most of this will concern integer programs.
	a) I-cache-miss: it seems pretty hard to get beyond an I-cache miss;
	we already do the instruction-streaming gimmick to do what's possible
	to cover the latency.
	b) Load-misses (R3000s use write-thru caches, so there aren't really
		store misses).
	using the typical rule-of-thumb:
		20% loads
			70% of 1st load-delay slot fillable* (MIPS)
			30% of 2nd load-delay slot fillable (Stanford)
			10% of 3rd load-delay slot fillable (Stanford)
			i.e., these give you an idea of how far you can get
			before you REALLY need that data, because the
			compilers are trying hard already.
		I think that what this says, is that on the average,
		you get .7 * 1 + .3 * 2 + .1 * 1 = .7 + .6 + .1 = 1.4
		instructions beyond the load before you want that data.
	c) Suppose you get 90% hit-rate on data (not unreasonable),
	and you make a bunch of favorable assumptions:
	90% x 20% loads = 1.8% : % instructions that are cache-missing loads.
	Assume 1.4 instruction cycles gained by non-stalling.  This gives
	1.4 * 1.8% = 2.52 % that might be gained.  If you consider multi-cycle
	operations that could be fired up, than you might gain a little more.
	As a cross-check, in this machine, data-cache misses have something
	like 12 cycles of latency, then transfer 16 words of data, and NOT
	doing the scoreboarding is like adding 1.4 cycles to this.  If you
	(real approximate) figure that every extra cycle of latency costs
	you 1% of performance, then this gives you 1.4 x 1% = 1.4%,
	which (for the level of enveloping, is not bad).
	d) Now, why are these favorable assumptions?
		The instructions that follow the cache-missing loads might
		be MORE cache-missing loads, and even if you're willing to have
		multiple outstanding requests, you still have to get the data
		into the regs sooner or later, which probably means more write
		ports than you would have had. [mixing integer & FP will help]
		I.e., it is hard to make a generic description of the added
		stalls on the back-end of the process, as it is very dependent
		on the system partitioning, read/write ports, memory
		interface, etc.  If you had a lot of long-cycle-count
		operations, it would help you to start them while waiting
		for a cache-refill [i.e., long FP ops; of course, most of
		the frequent ones you want to be quick anyway...]

Bottom line: the part of scoreboarding that lets you continue beyond a
	load-cache-miss gets you 1-2%, in an R3000-like architecture,
	in back-of-the-envelope style, which of course really needs a
	lot of simulation to confirm.   Maybe other people can post some
	real DATA about this, as this was a quick guess for one machine.
	This is not to say scoreboarding is good or bad: it is a reasonable
	way to design things.  Note, of course, that one must be extremely
	careful to think thru any error-handling scenarios.  For example,
	if state is actually committed beyond a load-cache-miss [rather than
	register-relabeling games], what do you do if you get a transient
	bus-error when the loaded-data actually arrives?  You cannot just
	back up the PC and try again, unless you track any side-effects and
	undo them, or skip those instructions, or something.
	Of course, out-of-order initiation of FP ops usually implies the
	possibility of imprecise exceptions, which is OK, but does cost
	you somewhere. Out-of-order initiation of anything implies a little
	trickier exception-handling in general.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (05/12/89)

In article 10198 henry@utzoo.uucp (Henry Spencer) writes:

>>Clamping the whole CPU on cache miss isn't a technique that can survive
>>into the 1990s.  I'm very curious to see what the non-scoreboard folks
>>will do.
>
>Probably pretty much what they do now:  let access and execution proceed
>in parallel, assuming the data isn't needed right away.

Ok, now I'm curious. What DOES a R[23]000, an AMD29000 or Sparc really do
if there's a data cache miss on a load but there are several instructions
behind it which aren't interested in the result. I guess machines with
a common I/D cache can't do anything as the cache will be busy getting the
missed data, so Sparc (current implementations) seems out (except for the
floating point coprocessor queue). I also know that at least the multiply
and divide units in R[23]000 will continue to work, but what about other
independent instructions past the load delay slot ?

	Burkhard Neidecker-Lutz, Digital CEC Karlsruhe, Project NESTOR
	neideck@nestvx.dec.com

mpogue@dg.dg.com (Mike Pogue) (05/12/89)

In article <19463@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>
>Bottom line: the part of scoreboarding that lets you continue beyond a
>	load-cache-miss gets you 1-2%, in an R3000-like architecture,

  John, 
	I think you have missed the point here.  The performance improvement
due to register scoreboarding is only of minor interest.  The real point
from an architectural point of view is that all binaries continue to run
in a predictable way, even when implementation details change.

Mike Pogue
Data General Corp.

These are my opinions, of course....

rpw3@amdcad.AMD.COM (Rob Warnock) (05/12/89)

In article <8905120628.AA08299@decwrl.dec.com> neideck@nestvx.dec.com writes:
+---------------
| In article 10198 henry@utzoo.uucp (Henry Spencer) writes:
| >>Clamping the whole CPU on cache miss isn't a technique that can survive...
| >>  I'm very curious to see what the non-scoreboard folks will do.
| >Probably pretty much what they do now:  let access and execution proceed
| >in parallel, assuming the data isn't needed right away.
| Ok, now I'm curious. What DOES a R[23]000, an AMD29000 or Sparc really do
| if there's a data cache miss on a load but there are several instructions
| behind it which aren't interested in the result.
+---------------

Assuming the natural split-I/D memory system (split caches, in this case),
and burst-mode instruction memory/cache, an Am29000 will continue to execute
instructions until either (a) the result of the load is needed, or (b) the
pipeline stalls waiting to start another "primary access" [since there's
only one address bus], which could be due either to a jump which misses
in the Branch Target Cache or another load/store which wants to start.

[While in theory, with really good load scheduling, you could thus avoid
the stall entirely, in practice you typically run on for only a few more
cycles before stalling, since in "general-purpose" code it's hard to schedule
a load far enough ahead to absorb a cache miss.]


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

tim@crackle.amd.com (Tim Olson) (05/12/89)

In article <8905120628.AA08299@decwrl.dec.com> neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) writes:
| Ok, now I'm curious. What DOES a R[23]000, an AMD29000 or Sparc really do
| if there's a data cache miss on a load but there are several instructions
| behind it which aren't interested in the result. I guess machines with
| a common I/D cache can't do anything as the cache will be busy getting the
| missed data, so Sparc (current implementations) seems out (except for the
| floating point coprocessor queue). I also know that at least the multiply
| and divide units in R[23]000 will continue to work, but what about other
| independent instructions past the load delay slot ?

Since loads are fully interlocked on the Am29000, the processor will
continue to execute instructions until

	1) The results of the missed load are required
	2) The channel is required for another load or store




	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

marc@oahu.cs.ucla.edu (Marc Tremblay) (05/13/89)

In article <19463@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>	b) Load-misses (R3000s use write-thru caches, so there aren't really
>		store misses).
Since the R3000 uses write-thru caches and 
since the depth of the write buffer is 4, 
I assume that "stores" stall the processor only when more 
than 4 consecutive writes occur or whenever the buffer overflows. 
Saving the register file on a context switch could certainly
generate stalls if it is done sequentially (i.e without interleaving
register stores with non-store intructions).
From your comment I suppose that those situations do not occur very often?

					Marc Tremblay
					marc@CS.UCLA.EDU
					Computer Science Department, UCLA

les@unicads.UUCP (Les Milash) (05/13/89)

In article <8905120628.AA08299@decwrl.dec.com> neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) writes:
>In article 10198 henry@utzoo.uucp (Henry Spencer) writes:
>>>Clamping the whole CPU on cache miss [sucks]
>>
>>Probably pretty much what they do now:  let access and execution proceed
>>in parallel, assuming the data isn't needed right away.
> 
>Ok, now I'm curious. What DOES a R[23]000, an AMD29000 or Sparc really do
>if there's a data cache miss on a load but there are several instructions
>behind it ....

i'm curious too, but about scoreboarding. 
will somebody please point me at some article that shows what kind of
logic implements scoreboarding?  i have questoins like
*	can existing scoreboards handle chained dependencies?
*	does an instruction (like R1 OP R2, where R1 ain't ready yet, but where
	the next instruction could use the same alu and is ready-to-go) 
	allocate an alu?  or do ya queue the decoded instruction and wait
	for all of {r1 OP R2} to be available simultaneously?
*	an assignment to R[12] must therefore be held off till all of the
	R[12]s contents are either in the alu or in yet
*	a bad scenario:
	Rg = some constant
	[lots of time]
	mem = R1	(perhaps a "many"-cycle thing)
	r2 = R1 op Rg
	r3 = R2 op Rg	(perhaps this is "too hairy")
	r4 = R3 op Rg
	r5 = R4 op Rg
	r6 = R5 op Rg
	r7 = R6 op Rg	(perhaps generate a "hostile user" trap :-)
	so seems like if you had finite (not a dataflow machine) scoreboarding 
	you'd need to teach your compiler how to not exceed it, or maybe
	you could quit issuing instructions if things got "too hairy", and
	expect that that'd be infrequent.

i kind of see scoreboarding as "idiotproofing", (where a MIPSCo compiler is
an example of a non-idiot), in that on "machines that have to have a load
delay slot filled with an independent instruction by the compiler" can run
(wrong) programs that compute with numbers that aren't fetched yet.

i can't myself imagine what kind of logic could be built to handle all the
pathological cases i could think up, unless you can depend on the compiler
to not generate too bad a case.  i can imagine that some finite logic
could help in the "M% of cases that get you the easiest N%", and give up
otherwise by quitting issuing more instructions.
although Mr. Mashey had an argument that you wouldn't get that much (except
of course for being able to R2000 code on the GaAs R7000)

that kind of stuff led me to my attraction with transputers.  i think of 'em 
as the ultimate scoreboarding; but it's at the system level, not in the 
processor chip.  so i realize that doesn't help us run many mips out of slow 
dram.  but the whole program, in all its millions of bytes of glory, is one 
big dependency graph, and the processor assures you that it will be executed 
in a "correctness-preserving" order, even if multiple processors are executing 
on your program.

mash@mips.COM (John Mashey) (05/13/89)

In article <23884@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes:
>In article <19463@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>	b) Load-misses (R3000s use write-thru caches, so there aren't really
>>		store misses).
>Since the R3000 uses write-thru caches and 
>since the depth of the write buffer is 4, 
>I assume that "stores" stall the processor only when more 
>than 4 consecutive writes occur or whenever the buffer overflows. 
>Saving the register file on a context switch could certainly
>generate stalls if it is done sequentially (i.e without interleaving
>register stores with non-store intructions).
>From your comment I suppose that those situations do not occur very often?

Stores stall the processor whenever the write-buffer is full.
it actually takes a bit longer than 4 consecutive stores, in the simple
case, as (for example, on an M/2000), writes are retired to memory
1 word / 2 cycles.  You actually get something like 6 consecutive
writes before going into a stall-every-other-cycle state.

The register-save sequence typically has just a few stalls in it, as there
are other instructions interleaved, and it first saves the registers
that C won't, while letting those that C would be saved later.
I think the FP-regs-save sequence stalls a little more, as there's less
to do in the middle of it, but then it happens less often also.

The issue is statistically nonexistent for function calls, as relatively
few (dynamic) function calls save 6 or more registers.

Block-copies do their best to interleave the loop overhead and reads and
writes, so they don't see it either.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

aglew@mcdurb.Urbana.Gould.COM (05/13/89)

>In article <19463@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>	b) Load-misses (R3000s use write-thru caches, so there aren't really
>>		store misses).
>Since the R3000 uses write-thru caches and 
>since the depth of the write buffer is 4, 
>I assume that "stores" stall the processor only when more 
>than 4 consecutive writes occur or whenever the buffer overflows. 
>Saving the register file on a context switch could certainly
>generate stalls if it is done sequentially (i.e without interleaving
>register stores with non-store intructions).
>From your comment I suppose that those situations do not occur very often?
>
>					Marc Tremblay
>					marc@CS.UCLA.EDU
>					Computer Science Department, UCLA

What about block copies (like, copying I/O data from kernel to user)?
(SUNOS 4 mapped files reduce this a bit)

Sometimes there's no substitute for having a good, interleaved,
memory system (which I believe MIPS does(?))

aglew@mcdurb.Urbana.Gould.COM (05/13/89)

>i'm curious too, but about scoreboarding. 
>will somebody please point me at some article that shows what kind of
>logic implements scoreboarding?  i have questoins like

Mark Johnson pointed out an Intel ISSCC (I always get those initials wrong)
paper on circuit implementations of scoreboarding for the '960.

Here are a few references at a slightly higher level of abstraction:

Keller, "Look-Ahead Processors", Computing Surveys Vol 7 No 4 December 1975,
	pp 177-195 (good bibliography).
Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units",
	IBM Journal Resarch and DEvelopment, 11, 1 (Jan 1967) pp 25-33
Hwu, "Exploiting Instruction Concurrency to Acheive High Performance
	in a Single-chip Microprocessor", PhD thesis, Report No. UCB/CSD 
	88/398, University of California, Berkeley, January 1988
	(sinilar papers in COMP ARCH 13 and IEEE Trans Comp)

The above references are for varieties of "Tomasulo" scheduling, which is a bit
more extreme than scoreboarding in, say, CDC or Cray processors, or the 88K.

I can't find any "regular" scoreboarding references at hand, but
Sieworik, Bell, and Newell, _Computer Structures, Principles and Examples_
contains papers on the CDC 6600 and CRAY-1.

The distinction between "regular" scoreboarding and "Tomasulo" scoreboarding
is important; roughly described, dumb interlocking halts the entire processor
if a later stage's result isn't ready (at some fixed, designated time)
even if the next instruction doesn't need the output;
regular scoreboarding only halts an instruction if its operands aren't
ready, but doesn't necessarily provide execute past; and Tomasulo scheduling
allows execution of subsequent instructions.
    That's really rough; and I have also learned that "Tomasulo" scheduling
means different things to different people. IE. to some people it means 
Tomasulo's implementation with reservation stations and a common data bus,
but to me it means, basically, a limited form of dataflow computation
in a small window.
    Plus, of course, you don't need to implement a scheduling technique
everywhere. I would call the AMD29000's load scheduling a limited form
of scoreboarding, but in the rest of the pipeline they don't bother.
    And, of course, nobody's pure - below, when I describe "regular"
scoreboarding, I do not mean the scoreboarding on the 88K. The 88K is a bit
more aggressive than that, but not quite as aggressive as Tomasulo.


>*	can existing scoreboards handle chained dependencies?
Tomasulo: yes, limited only by resource constraints (the number of reservation
	stations)
Regular scoreboarding: exactly what do you mean? A->B->C type dependencies?
        Tomasulo would dispatch all instructions to wait for their result;
        regular scoreboarding would halt at the second until the first is
	ready, then halt at the third until the second is ready.
    Or do you mean vector chaining? See how Cray does things. The only reason
        for not applying scoreboarding to vectors is that there are a *lot*
	of registers => a lot of scheduling hardware. But, things like
	N. Jouppi's pseudo-vector operations for TITAN can help.


>*	does an instruction (like R1 OP R2, where R1 ain't ready yet, but where
>	the next instruction could use the same alu and is ready-to-go) 
>	allocate an alu?  or do ya queue the decoded instruction and wait
>	for all of {r1 OP R2} to be available simultaneously?

In Tomasulo scheduling, you queue the instruction in a reservation station.
Typically, you also snarf the inputs as they become ready, so as not to
prevent other instructions that may write to those from dispatching.
(I use a notation like O->I dependencies satisfied, with execute past;
O->O dependencies satisfied, in the extreme with execute past, in less
extreme cases without)
	Of course, if ALUs are cheap enough then your ALU input latches 
become your reservation station. That's OK for integer units, but floating
point units and multipliers still are complicated enough to warrant multiple
reservation stations.
	BTW, reservation stations do not necessarily imply the serial
dependencies of queues.

>*	an assignment to R[12] must therefore be held off till all of the
>	R[12]s contents are either in the alu or in yet
>*	a bad scenario:
>	Rg = some constant
>	[lots of time]
>	mem = R1	(perhaps a "many"-cycle thing)
>	r2 = R1 op Rg
>	r3 = R2 op Rg	(perhaps this is "too hairy")
>	r4 = R3 op Rg
>	r5 = R4 op Rg
>	r6 = R5 op Rg
>	r7 = R6 op Rg	(perhaps generate a "hostile user" trap :-)
>	so seems like if you had finite (not a dataflow machine) scoreboarding 
>	you'd need to teach your compiler how to not exceed it, or maybe
>	you could quit issuing instructions if things got "too hairy", and
>	expect that that'd be infrequent.

You can't dispatch a instruction unless it has a reservation station ready
-- that's your throttling.  It doesn't happen too often (see Hwu). 

There is still benefit from teaching your compiler about your machine.

I think of Tomasulo sheduling as the route to realistic dataflow machines.
The bottleneck in dataflow machines has always been the routing of results
to inputs; with Tomasulo we are learning how to do that easily and efficiently
in VLSI, and then, hopefully, it will scale (note that I do not consider
the common data bus a defining feature).

For this I had better sign myself (I better get used to being a student 
again)

Andy Glew  
IMPACT - Illinois Microprocessor Project using Advanced Compiler Technology
"The Hwuligans"
aglew@urbana.mcd.mot.com

Disclaimer: neither my advisor nor my employer share the opinions above.

mash@mips.COM (John Mashey) (05/14/89)

In article <170@dg.dg.com> mpogue@dg.UUCP (Mike Pogue) writes:
>In article <19463@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>
>>Bottom line: the part of scoreboarding that lets you continue beyond a
>>	load-cache-miss gets you 1-2%, in an R3000-like architecture,
>
>  John, 
>	I think you have missed the point here.  The performance improvement
>due to register scoreboarding is only of minor interest.  The real point
>from an architectural point of view is that all binaries continue to run
>in a predictable way, even when implementation details change.

I MUST NOTE BE COMMUNICATING RIGHT, OR ELSE PEOPLE AREN'T HEARING
WHAT I'M SAYING, OR THE NET IS LOSING POSTINGS:
My posting: <19162@winchester.mips.COM> included:
-In particular, Motorola & co are persistent in claiming that the world
-will fall apart for MIPS if the timings of the floating-point operations
-change, despite the fact that it has clearly been stated many times
-that we have complete interlocking on ALL of the multi-cycle operations.
-Really, the only things that don't have interlocking are loads and
-equivalents (i.e., move-from-coprocessor), and they all have a 1-cycle
-delay that is predictable to the compilers.  The (Without) in
-Microprocessor (Without) Interlocking Pipeline Stages, which may have
-been appropriate for the Stanford MIPS, is pretty much irrelevant
-when it comes to MIPSco MIPS.
-As I've said here before, if we ended up with loads that had another
-cycle of latency, we;d build a machine with an interlock on the extra
-cycle.  If we decided to put in load interlocks, that would
-be upward-compatible, although we'd likely compile 3rd-party executables
-with R3000-style forever. (Of course, if we did add load interlocks at
-some point, and if there got to be more of those machines around, at some
-point maybe we'd start advising peopel to compile for that, and then do
-a reverse-translate on R3000-machines!)
-If the timings of floating-point operations
-are different (and they are) in forthcoming products, the existing object
-code works fine.  However, even with completely interlocked and/or
-scoreboarded code, you STILL want the compilers to be as aggressive
-as possible.  Fortunately, the way most of these things work, if you
-try to optimize for the version with the longest latencies, it usually
-works pretty well for ones with shorter latencies as well.  To see this,
-suppose you had a 5-cycle FP multiply, and so you'd been generating code
-that tried to issue 4 more instructions before using the result of the
-multiply.  IF the multiply expanded to 10 cycles, the compiler folks
-would try to work harder and find more things to do while the multiply
-were running, which wouldn't usually hurt the machine with the 5-cycle
-multiply.  It's just a question of the number of stall cycles, and
-it's obvious that it almost always pays to spread the computation of a
-multi-cycle result, and the use of that result as far apart as possible.
-
-This, of course, is not remotely a new issue: any of the long-lived
-computer product lines has faced this, especially those that
-cover a range of implementation technologies, such as VAXen or S/360s.
-The solutions are the same, except that the simplicity of RISC-style
-instructions makes it marginally easier to manipulate object code.
-Our experience with these methods tends to make us more willing to
-consider object code translation as one more trick to use when it makes
-sense, and it's really not that weird once you get used to it.

So now, I'll try again.  Here are some assertions that ahave appeared:

ASSERTION 1: scoreboarding lets you modify latencies for
different operations while using the same object code.

ASSERTION 2: scoreboarding is the ONLY way to do this; if a
fully-interlocked machine doesn't use scoreboarding, somehow, it is deemed
impossible for the object of ASSERTION 1 to be accomplished.

ASSERTION 3: well, if interlocking works after all, then scoreboarding
is better for performance reasons.
	3A: in supercomputers
	3B: in current microprocessors
------
ASSERTION 1 is clearly true.
ASSERTION 2 is silly; there are too many machines implemented without
scoreboards, but with interlocks, that let you modify the latencies.  I'll say,
one more time, there is exactly one kind of user-visible latency in a MIPS R3000
that is not interlocked, and that's the {load, move-from-coprocessor},
and it's one cycle that needs to be covered by the compiler.  The
the likely-to-be-variable-and/or-long latencies [FP, int mul/div] are
all fully-interlocked, even though the compilers work hard to do their
best to schedule the pipelines well.  If we did an implementation where
or simulations claimed that out-of-order instruction initiation was a
sensible overall design choice, then we might use scorebaording as an
implementation choice.  My back-of-the-envelope analysis showed why we
weren't yet overly excited about that.
It is CRAZY to build a machine with multiple-overlapped-functional-units,
and NOT do pipeline-scheduling compilers [which, after all, were present
in early CDC compilers, for example] whether one uses scoreboarding
(to permit out-of-order initiation) or interlocking (that expects
in-order initiation, but freezes the instruction-issue unit (not necessarily the
other functional units) until the needed result is ready.
The amount of code in MIPS compilers to put a nop after a load if
no other instruction ca be found is trivial [when I looked last,
I found 3 lines of code that were doing that] compared to the rest of
the reorganizations.  People who worry about this being a cause of bugs
or complexity in the compilers (compared with everything else)
should NEVER, EVER fly on a Boeing 747: after all, going from SanFrancisco
to London, you might fall out of your seat and break your leg due to a
defective seat-belt buckle :-)

ASSERTION 3
	3A: supercomputers
		Probably true
	3B: current microprocessors
		Seems unlikelym until they start having memory systems
		like 3A.

I'm out of time & getting the SPEC bench-a-thon together is going to
occupy most of my time for the next couple weeks, so maybe somebody
else can continue this.  Specifically, maybe brooks@maddog.llnl.gov
(or anybody else really familiar with supercomputer architecture)
could describe the memory systems of such things.  There are some
fairly sensible reasons why the answers on 3A & 3B might be
opposite....

Finally, maybe somebody from 88K-land could describe how far into
out-of-order execution the 88K goes, i.e., assuming no scoreboard
block,
	1) how many instructions can be issued beyond a load that
	cache-misses, or tlb-misses, or both?
	2) how many instructions beyond a stalled-FP-multiply
	(for example) can you execute?
(I haven't seen anything that said definitely what the current 88K's do).

Like both 88K and MIPS, SPARC is defined to allow different-latency
FP implementations, and in fact, 3 different ones are already extant.
Perhaps the SPARC guys would care to join the fun and talk about
differences in latencies, overlap, etc.  [If you haven't noticed it,
SUn-4s recent got the FPU2 that raised FP performance in the same
systems.]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (05/15/89)

In article <28200311@mcdurb> aglew@mcdurb.Urbana.Gould.COM writes:

>What about block copies (like, copying I/O data from kernel to user)?
>(SUNOS 4 mapped files reduce this a bit)
>
>Sometimes there's no substitute for having a good, interleaved,
>memory system (which I believe MIPS does(?))

The older machines had various flavors of interleaving.
The M/2000 uses 2-way interleaving per memory board to sustain
read traffic of 1 word/cycle, or write traffic of 1 word/2 cycles.
Bcopy works fine; there's enough reading to do to not have an
issue with write stalls.  The only case that write-stalls much is
bzero, because there isn't much else to do; the only answer to that is
a more expensive memory system, which of course is possible.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

khb%chiba@Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS) (05/16/89)

In article <8905120628.AA08299@decwrl.dec.com> neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) writes:
>In article 10198 henry@utzoo.uucp (Henry Spencer) writes:
> 
> 
>Ok, now I'm curious. What DOES a ... or Sparc really do
>if there's a data cache miss on a load but there are several
>instructions ..

A good instruction scheduler places the "don't cares" after the load
so the load miss is simply irrelevant.

I can't comment on how well the current production SPARC compilers do
at this; but the Cydra 5 compiler did quite well at the equivalent
(keeping the memory pipe busy). Since the Cydra 5 had a minimum read
latency of 17 cycles this indicates (to me at least :>) that keeping
the current SPARC's from stalling should be acheivable.

Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist    ! kbierman@sun.com
I Voted for Bill &    |   Languages and Performance Tools. 
Opus  (* strange as it may seem, I do more engineering now     *)

khb%chiba@Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS) (05/16/89)

In article <19661@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>
>ASSERTION 3: well, if interlocking works after all, then scoreboarding
>is better for performance reasons.
>	3A: in supercomputers
>	3B: in current microprocessors
>------ 

.... 

>ASSERTION 3
>	3A: supercomputers
>		Probably true

???

I can't think of a supercomputer which _does_ use scoreboarding, at
least not as I understand the term. Supercompuers tend to not have
data caches... instead they have very fast high bandwidth memory
systems (using many banks of memory) and rely on the fact that real
scientific programs tend to have well behaved, in some sense, (i.e.
sort of vectorizable) key loops. The fact that these are typically
FORTRAN machines, alters the program statistics somewhat. As has been
demonstrated by many vendors, lots of registers, software and
sometimes hardware pipelines, loop transformations (percolation
scheduling, etc.) and other unnatural acts are key to getting good
performance. Note that these remarks apply to machines like the
CDC6600, the various Crays, misc. Japanese vector machines, the Cydra
5 and the Multiflow machines (which is why I said "sort of
vectorizable"). 

These machines typically have interlocks; but nothing like the
scoreboard scheme of the 88K (unless my memory is very leaky this
week). 

>	3B: current microprocessors
>		Seems unlikelym until they start having memory systems
>		like 3A.

Scoreboarding probably works; but there seems to be a certain lack of
evidence that it is necessary. Seems overly complex to me .... but
what do I know... I studied math and grew up working in Kalman
filtering applications ... :>

>
>
>Like both 88K and MIPS, SPARC is defined to allow different-latency
>FP implementations, and in fact, 3 different ones are already extant.
>Perhaps the SPARC guys would care to join the fun and talk about
>differences in latencies, overlap, etc.  [If you haven't noticed it,
>SUn-4s recent got the FPU2 that raised FP performance in the same
>systems.]

I don't know what to say... other than that it works just fine. All
the binaries I tested (and it was more than a few) worked up and down
the line. It is true that it is possible to tickle the compiler into
generating code which is better for one chip or another and this is
likely to continue. There will always be a compatible mode (which most
people will use exclusively) which will have good performance for the
most popular implementations and different implementations will have
some special code ordering/other implementor neat stuff ... 

As John pointed out in an ealier posting, SUN has chosen to give
implementors more leeway (the N-design teams notion), and folks at
Prisma probably will have lots of neat stories to tell around Jan 1990
(4nsec SPARC ... design goal of 100Mflops) about how well SPARC really
scales, and how the cleverly picked interlocks were just right (or
caused them endless grief :>).

Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist    ! kbierman@sun.com
I Voted for Bill &    |   Languages and Performance Tools. 
Opus  (* strange as it may seem, I do more engineering now     *)

andrew@frip.WV.TEK.COM (Andrew Klossner) (05/17/89)

> Finally, maybe somebody from 88K-land could describe how far into
> out-of-order execution the 88K goes, i.e., assuming no scoreboard
> block,
> 	1) how many instructions can be issued beyond a load that
> 	cache-misses, or tlb-misses, or both?
> 	2) how many instructions beyond a stalled-FP-multiply
> 	(for example) can you execute?

[I'm a user, not a designer, of the 88k architecture.]

The instruction fetch unit, the data fetch/store unit, and the two FPU
pipelines (one multiply, one add/subtract/convert/divide) operate
fairly independently.  There is no inherent reason why a stall by one
would force the others to stall, although, as John pointed out, if you
stall on instruction fetch, the other units are going to starve pretty
quickly.

1) Caching and MMU translation are done external to the CPU, in the
CMMU (code/MMU), so cache- and TLB-misses are all the same to the CPU;
they just mean that the CMMU will take a little longer to finish the
load.  If you don't use the load target and you don't issue another
load or store, you can continue executing forever.  On our system, the
code- and data-CMMUs talk to the same memory bus ("M bus" in Motorola
parlance), so if you take a code-cache miss during a load, the code
CMMU will have to wait for the bus.

2) Again, if you don't use the FP-multiply target and you don't issue
another FP-multiply, you can continue executing forever.

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (05/17/89)

In article <104999@sun.Eng.Sun.COM> khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes:

>I can't think of a supercomputer which _does_ use scoreboarding, at
>least not as I understand the term. Supercompuers tend to not have

I don't know much about the Japanese machines, but all the CDC and Cray
machines I am familiar with do register scoreboarding.  They do not do
the full "Tomasulo" scoreboarding, in that instruction issue stalls if
registers are busy.  

Note that what scoreboarding buys you is the ability to continue
issuing instructions after a memory operation without knowing how long
a particular memory access will take.  This is very important in a machine
which has 8 processors, 32 memory ports, 256 memory banks, and memory
latencies of at least 4-5 clock periods, and sometimes longer.

Some architectures permit a certain number of instruction issues to continue
if registers are busy, and this is a sort of local dataflow, as someone
mentioned previously.  I don't think any of the current crop of machines do
that, although I assume that the IBM 360/91 did since that was the machine
that Tomasulo wrote his paper about (it is reprinted in Siewiorek, Bell, and
Newell's "Computer Structures" that probably most people have).

Specifically, current machines do not seem to have "reservation stations" which
permit instruction issue to continue (on the 360/91 it was floating point
add and multiply only, I believe) if the register is busy.  Anyway, with
instruction scheduling in the compilers (for the last 15 years ...) 
the claim is that this sort of multiple outstanding instruction issue
capability buys you very little over the current technique (stall if busy).
I believe that the level of scoreboarding used plus instruction scheduling
is believed to be "locally optimal" for a pipelined machine with segmented
functional units.  Reservation stations may increase the parallelism
slightly at some cost in functional unit latency.

The main point is that if you want to have multiple outstanding loads,
indeterminate memory delays due to bank busy conditions, multiple
outstanding instructions (all or almost all the ALU operations on these
machines are fully segmented- typically at one per clock cycle, though
not always with divide etc.), and, fairly long memory latencies 
(e.g. 5-13 clock periods) scoreboarding is a nice way to handle it.

>These machines typically have interlocks; but nothing like the
>scoreboard scheme of the 88K (unless my memory is very leaky this
>week). 

The first use of the term "scoreboard" that I know of was Thornton's
1964 paper on the CDC 6600, also in Siewiorek, Bell, and Newell.

>Scoreboarding probably works; but there seems to be a certain lack of
>evidence that it is necessary. Seems overly complex to me .... but
>what do I know... I studied math and grew up working in Kalman
>filtering applications ... :>

I am not sure what is "overly complex" about it.  If you have one path to
memory on a one CPU system, it isn't necessary.  If you have multiple paths
on a multiple CPU system, it turns out to be just what you need.  If you
are concerned about increasing the critical path in instruction issue by
by one level, you are probably right, though clever hiding of these things
may occur.  Anyway, as always, the question is: Do you get what you pay for
out of it.  The fact that it has been used for 25 years on machines with
multiple paths between the CPU and memory should say something.

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

kds@blabla.intel.com (Ken Shoemaker) (05/18/89)

Perhaps I'm missing something here, but I'd like to make a few observations:

1) a large register array seems to be useful as a user-managed data cache
2) caches, in general, don't freeze up instruction execution while doing
	replaces (or, at least, they don't need to)
3) register scoreboarding provides for very simple support for cache (i.e.,
	register) reloads by an autonomous unit (kicked off by the
	processor, but not requiring exclusive use of the resource until
	the operation is complete).

This requires that the "register reloader" has a write port to the registers
aside from the processor write port, otherwise you have a resource
constraint.  I don't think this approach is appropriate to the MIPS machines 
because of their bus organization, i.e., they have a resource constraint 
in the use of the external data bus.  But to paraphrase John's comments,
such a machine, with the register reloader autonomous functional unit,
splits the operation of loading the registers into two functional units
whereas most current machines use a single functional unit (the execution
unit).  If you have two such units, it makes no sense to keep the execution 
unit frozen until you actually have the registers reloaded, much like it
doesn't make sense to keep the floating point unit frozen while doing
integer memory loads.

It would also have consequences for code reorganization.  Procedures, or
maybe even whole programs (because, like Henry Spencer says, compilers are
real smart these days), would have a preamble which would load the entire
complement of variables into the registers.  The concept of "load delay
slots" really becomes a don't care, because the registers are loaded long
before they are used.

This is really just a smart data prefetch algorithm driven by software
(imagine that).  The same kind of thing can be done to force replaces in
the external cache so that by the time you get around to using a piece
of data, it will be a cache hit and not require the latency time to
system memory.  Or course, this assumes that your register space isn't
sufficiently large to hold all the variables that are going to be used.

I won't even worry about volatile variables.  They are such infrequent
cases as to be a don't cares.  For the time being, assume that there are no
problems with multi-processor data consistancy between the registers, ne
user-managed data cache, of different processors or I/O devices.

I'm sorry if this is obvious.  It certainly seems so to me, so there must be
something I am not seeing.  I'd appreciate if someone could straighten me
out.  But as I am leaving for two months at the end of the week, and as we
don't keep news around on the system that long, I'd appreciate a note in my
mail instead.  Thanks!
----------
I've decided to take George Bush's advice and watch his press conferences
	with the sound turned down...			-- Ian Shoales
Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|pur-ee|hacgate|oliveb}!intelca!mipos3!kds

cik@l.cc.purdue.edu (Herman Rubin) (05/18/89)

In article <25484@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
> 
> In article <104999@sun.Eng.Sun.COM> khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes:

>			.........................
 
> The main point is that if you want to have multiple outstanding loads,
> indeterminate memory delays due to bank busy conditions, multiple
> outstanding instructions (all or almost all the ALU operations on these
> machines are fully segmented- typically at one per clock cycle, though
> not always with divide etc.), and, fairly long memory latencies 
> (e.g. 5-13 clock periods) scoreboarding is a nice way to handle it.

Scoreboarding and related items can be important even if load time is not
the bottleneck.  Much mathematical code can use this, as operations need
the result of previous operations.  This is especially the case if operations
do not take constant amounts of time, or if branches occur.  It is sometimes
even possible to arrange operations to take into account unusual timings,
letting scoreboarding keep things efficient.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

bruceh@zygot.UUCP (Bruce Henderson) (05/20/89)

I am looking for a source of high bandwidth VRAMS.  I am trying to
remember who makes the chips with the built in BITBLTs?  
To put it short, I am trying to design a high speed graphics engine
with some rather interesting [aka. untried] architecture, and I have
found a scaleable hardware method that is elegant in its design.
Therefore the only limit I have right now is how fast I can stuff
Pixels into memory.  Any suggestions [I know this is a classical
Engineering Problem] would be greatly appreciated!