[comp.arch] Horizontal Pipelining -- a pair

aglew@ccvaxa.UUCP (11/21/87)

..> Superlinear speedups

> >      Just for drill, consider the fact that a 2-CPU S/370-168-MP ran a
> > throughput of 2.1 to 2.25 over a 1-CPU system, depending on just what we
> > gave it for workload.
>      If I understand correctly, you are saying that going from 1 to 2 CPUs
> _more than doubled_ the throughput of the system? This is counter intuitive.
> If you could expand on it, I would appreciate it.
>      For instance, what are we talking about as a measure of throughput, and
> was the job mix similar?

Many people and companies are finding superlinear speedups, where N processors
are more than N times better than a single processor, especially on job mixes
but also in single programs, on many different performance indices including 
the most important one, elapsed time.

The reason is basically reduced context switching - in ways that you can't
necessarily get by applying tricks such as interrupt stacking on a single
processor.

Now, it should be obvious how multiple processors reduce context switching
wrt. a single processor when it's running a timeshared system, but how does
this improve the performance of a single program?

Well, programs typically have a lot of hidden, internal, context. Consider
a program that has four sections, A1, A2, B1, B2, where both A1 and B1 have
to run before either A2 or B2. But consider that A1 and A2 share a lot of
context - eg. they manipulate a huge array - and B1 and B2 share a lot of 
context, but there is little context shared between the Ai and Bi families.
Eg. A1 does enough processing on the huge array it shares with A2 that it
will completely fill the cache, and flush out any data that B1 might have 
loaded, and so on.
	Now say that the dependency from B1 to A2 is just a small amount
of data.

On a single processor, the time for this program to run is
	P(A1) + P(B1) + P(A2) + P(B2) + 2*C(AB)
where P(jobstep) = processing time, and C(AB) = time to change context
between A and B (you can analyze this in more ways, but that's the basic
idea).

On a multi processor, the total run time is
	P(A1) +..+ P(B2) + 2*M(AB)
but the elapsed time is
	max(P(A1),P(B1)) + max(P(A2),P(B2)) + M(AB)
where M(AB) is the cost to send a message between the two processors
containing the information from B1 that A2 depends on, and vice versa.
If C(AB) is large, it is quite easy for the total elapsed time on the
multiprocessor to be less than half that on the uniprocessor.

The key point here appears to be context. Here we are comparing N processors
with N*M memory to 1 processor with M memory, not N processors with small
memories M/N. Or actually not memories, but "context" - caches can do the
same role for shared memory machines, as can private register sets.
	The small memory multiprocessor can still win out in certain 
cases, but I suspect fewer than the large memory machine.

So, not only has superlinear speedup been observed in practice, but a
theoretical understanding of it is possible too.


Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms.arpa

I always felt that disclaimers were silly and affected, but there are people
who let themselves be affected by silly things, so: my opinions are my own,
and not the opinions of my employer, or any other organisation with which I am
affiliated. I indicate my employer only so that other people may account for
any possible bias I may have towards my employer's products or systems.

aglew@ccvaxa.UUCP (11/28/87)

..> John Mashey talking about barrel architectures
Just making sure - barrel architectures are systems that multiplex
each pipeline stage between different threads of execution?
I assume this in my response.

>Assume we're using split I & D caches. Assume that the cache line
>is N words long, filled 1 word/cycle after a latency of L cycles.
>One would expect that efficient cache designs have L <= N.
I expect this is quite basic, but how do you show this for barrel
architectures (the demonstrations I've seen have required regular 
architectures).

>When filling an I-cache miss, you can do L more barrel slots,
>then you must stall for N slots (or equivalent), because it doesn't
>make sense to have the I-cache run faster than the chip (if it did,
>you would run the chip faster). 
Why not have the I-cache run faster than the chip? I-caches are more regular
structures than the cpu, and are probably that much easier to make run
faster.
    Also, why stall for N slots? There are several schemes to deliver data
from a partially filled cache line as soon as it is available. 
    Finally, why not continue on a completely separate thread while
the cache is filling for the thread that caused the cache miss?
Barrel need not imply round robin.

I think that I've missed part of John's point here.


Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
aglew@mycroft.gould.com    ihnp4!uiucdcs!ccvaxa!aglew    aglew@gswd-vms.arpa
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

mash@mips.UUCP (11/30/87)

In article <28200073@ccvaxa> aglew@ccvaxa.UUCP writes:
>..> John Mashey talking about barrel architectures

>>Assume we're using split I & D caches. Assume that the cache line
>>is N words long, filled 1 word/cycle after a latency of L cycles.
>>One would expect that efficient cache designs have L <= N.
>I expect this is quite basic, but how do you show this for barrel>..

This wasn't intended as a proof or design, just background for a
first-order back-of-the-envelope analysis.

>>When filling an I-cache miss, you can do L more barrel slots,
>>then you must stall for N slots (or equivalent), because it doesn't
>>make sense to have the I-cache run faster than the chip (if it did,
>>you would run the chip faster). 
>Why not have the I-cache run faster than the chip? I-caches are more regular
>structures than the cpu, and are probably that much easier to make run
>faster.
The original assumption, perhaps not stated strongly enough, is that
the I-cache was built from ordinary SRAMS (to ride the SRAM cost curve).
(There are other ways to do this, however, this is The Right Way :-) see below).

>    Also, why stall for N slots? There are several schemes to deliver data
>from a partially filled cache line as soon as it is available. 
There are, of course, all sorts of schemes.  All I meant was that if you want
to fill N words of I-cache, 1/cycle, you will chew up N cycles where either:
	a) No process will be making progress, because all of the bandwidth
	will be eaten up.
	OR
	b) At best, the process that caused the I-cache miss gets to
	progress, up to N cycles (but this looks suspiciously like a single
	pipelined processor).
However, you cut it, given the original assumpotions, I think you end up
eating N cycles of I-cache bandwidth.
>    Finally, why not continue on a completely separate thread while
>the cache is filling for the thread that caused the cache miss?
>Barrel need not imply round robin.
Again, the assumption was that each access to write a word into the cache
blocked either all threads, or at least, all threads except the one that
caused the fill.

re: The Right Way (with SRAMs)
I'm no expert in this either, but there are a lot of people here who are
world-class experts in this, and maybe one will correct me if the
following summary of what they've said is too wrong:

1) CMOS VLSI CPU and SRAM technologies are very closely coupled:
	a) SRAMs are what VLSI folks often use to debug new processes.
	b) SRAMs are usually a VLSI-process generation ahead of CPUs.

2) One design process is to drive the CPU design process entirely by
your model of projected SRAM performance curves (sort of like surfboarding):
	a) Aim a CPU design to come out about when the SRAMs have debugged the
	technology you want to use.  The first CPUs work with some expensive
	SRAMS obtained in small quantities.
	b) By the time the CPU is yielding reasonably, the SRAMS you need are
	getting to be  reasonably available.
	c) By the time you want to ship lots, the SRAM prices have started
	dropping reasonably.
MEANWHILE:
	e) The SRAM guys are working on their next generation and
	f) You're working on your next generation.  Goto a)

Note: although the CPU is more irregular:
- Some critical things, like register files, are pretty similar to SRAMs.
- The CPU is all there on one chip, and it doesn't have chip-crossing
time penalties.  The cycle thru the SRAMs costs you delays, hence in some sense,
the SRAMs need to be faster just to stay even with the CPUs.

This strategy assumes that one can track the SRAM performance curves,
and generally get better cost/performance by using standard SRAMs,
whose cost rapidly drops, and whose technology is constantly being
pushed by multiple vendors.  It also assumes that the highest-performance
systems of the future will want large caches.  Of course, this
strategy remains to be verified, although the early evidence seems in
favor of it.  Of delivered RISC machines, the ones that use standard
SRAMs {MIPS, SPARC} outperform those that have special-purpose
cache-mmu chips {Clipper}.  The AMD29000 (no special RAM chip designs)
and Motorola 78000 {cache/mmu chips} will add a few more data points.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086