[comp.arch] Performance

mash@mips.COM (John Mashey) (05/02/90)

In article <412@dg.dg.com> publius@dg-pag.webo.dg.com (Publius) writes:
>In article <38348@mips.mips.COM> mash@mips.COM (John Mashey) writes:

>SPEC benchmarks are more realistic programs than Dhrystone.  How can you
>explain the unexpected low SPECmark and SPECthruput of DECsystem 5810?

It's all in the memory system design, and for any serious info, somebody from
DEC would have to answer it. It is quite easy to lose a lot of performance
when matching an existing bus/memory system environment, but for the specific
details, somebody from DEC should post.  Whatever's going, the SPEC numbers
are probably reasonable indicators.

>Do you mean that the memory system of 5810 was designed in this manner?
>Then, how about other members in the DECsystem 5800 series?
I think so, but again, somebody from DEC...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015 or 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

kahn@batcomputer.tn.cornell.edu (Shahin Kahn) (05/06/90)

In article <38303@mips.mips.COM> mash@mips.COM (John Mashey) writes:
<In general, with any of these higher-performance machines, one can NEVER
<assume that 2 machiens with the same clock rate will even exhibit the same
<uniprocessor performance on single-user benchmarks, simply because
<memory systems can differ greatly:

And I would add that the difference in performance for milti-user
benchmarks is probably more likely and greater.
Talking about which, what is the story with SPEC-throughput??
I had never heard of it until a few articles ago.
I hope they are not running jyst two copies of a benchmark.
They should run N=1,..,M,..,M+4 copies where M is the 
number of jobs at which paging starts.  One needs to look
at the slopes before and after paging (and hopefully you get
two straight lines with not much of a knee).
You might get surprized at what you see!

<	cache sizes, types, and hierarchy
<	memory latency
<	refill size
<	depth and nature of write-buffers
<	parity versus ECC, sometimes

Well said.  I kept these lines so they get read again!
Aren't these the parts of a system that cost the most?
What is your estimate of the cost split among various components
of a 'system'?

<Typical (but not always) rules of thumb are:
<	a) A single processor within a multiprocessor is usually slower than
<	an equivalent thing designed as a uniprocessor.

Why?  I'd have guessed it to be the opposite since a multi-processor-capable
machine with only a single processor in it may have access to a
better or more available memory system.  No?  And multiple processors may not
all want the memory system at the same time.

<	b) It is often HARDER work to make a big-machine run as fast as
<	a workstation, or even better, an embedded-control design using the
<	same chips.   Why should this be?
<		1) ECC is often slower than parity.
<		2) Sometimes you can easily build memory expandability
<		up to X at a given speed, but if you want to get to 2X<
<		you end up doing a different design that may cost you,
<		because busses get longer, etc.

Yes, cost does not grow linearly with performance (and on top of it,
price does not grow linearly with cost! (or does it?!)

Could you expand on point 1 a little.  I don't know the performance
aspects of ECC and parity.  
Actually, if you could clarify this point (b), it would be nice.
'Harder' or more expensive?
'big-machine' ?
Thanks.

<-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

Shahin.

mash@mips.COM (John Mashey) (05/07/90)

In article <10213@batcomputer.tn.cornell.edu> kahn@tcgould.tn.cornell.edu (Shahin Kahn) writes:
>Talking about which, what is the story with SPEC-throughput??
>I hope they are not running jyst two copies of a benchmark.
>They should run N=1,..,M,..,M+4 copies where M is the 
>number of jobs at which paging starts.  One needs to look
>at the slopes before and after paging (and hopefully you get
>two straight lines with not much of a knee).
>You might get surprized at what you see!
SPECthruputs aren't that: your run 2xN processes (where N = # of CPUs),
and people generally run them in enough memory to avoid paging.
They're intended to measure, at least somewhat, the overhead of having
multiple CPUs & some context-switching.  I.e., they are still mostly
CPU-bound benchmarks, on purpose.  I/O related benchmarks are being
worked on...
>
><	cache sizes, types, and hierarchy
><	memory latency
><	refill size
><	depth and nature of write-buffers
><	parity versus ECC, sometimes
>
>Well said.  I kept these lines so they get read again!
>Aren't these the parts of a system that cost the most?
>What is your estimate of the cost split among various components
>of a 'system'?
See Hennessy & Patterson's chapters 1 & 2, but basically, if you can
use a microprocessor, it's almost "free", except in smaller systems.
More money in SRAM, DRAM, disk [or CRT+power supply+keyboard for workstations].
In bigger systems, spend money on busses & I/O adaptors.

><Typical (but not always) rules of thumb are:
><	a) A single processor within a multiprocessor is usually slower than
><	an equivalent thing designed as a uniprocessor.
>
>Why?  I'd have guessed it to be the opposite since a multi-processor-capable
>machine with only a single processor in it may have access to a
>better or more available memory system.  No?  And multiple processors may not
>all want the memory system at the same time.
Like I said, not always.  However, consider the fastest machine you can
build from the fastest micro you can buy at a given time.
There is generally 1 bus from CPU to memory, and I/O:
	a) Either occurs into separate RAM, and CPU does copies.
	b) Shares the CPU<->memory bus, although possibly thru I/O
	adaptors that aggregated requests into bigger blocks.
	c) Has a separate bus to memory, with multi-ported memories.
(Mainframes and supercomputers are more complicated, but above pretty well
covers most micro<->superminis.)

Now, given a microprocessor running at its fastest clock rate, you can
build a uniprocessor that:
	a) Could use buffered write-thru if it makes sense for the given CPU.
	For some parts of the design space, this might be more efficient,
		if the cache design is appropriate, and
		if the memory system can retire writes fast enough.
	(more efficient because most write-back caches often require
	an extra cycle on some writes to check tags before doing the writes.
	Some uniprocessors may prefer WB-caches, of course.
	b) That can use most of the bus bandwidth for CPU<->memory transfers,
	and if convenient, even do low-latency, short transfers on the bus
	(as in the write-thru cache case above).  There is fairly minimal
	overhead for bus arbitration.
	c) Can use, if convenient, non-snooping caches.
A shared-memory MP, built with the same chip, at same clock:
	a) Will almost always use write-back, to avoid burning up the bus
	with writes.
	b) Probably needs more complex bus arbitration, and quite often
	needs to use longer cache blocks to preserve bus bandwidth,
	but which cost latency, and, in some cases, may reduce performance.
	(In a given a design, if the only variable is cache line size,
	the performance is usually an upside-down U-shaped curve, and
	if you must use a bigger one than optimal, you may lose a little perf.)
	c) Will almost certainly have snooping caches to preserve the
	OS people's sanity.

Although not necessarily the case, very often, all of the above turn into
an extra cycle or two of latency on a cache miss, or some extra stalls on
writes, some extra stalls due to snooping, etc.  This is either because:
	a) The uniprocessor had very tight timing, into a memory board  that
	was already using page-mode DRAM (or equivalent), and was already
	interleaved with as many banks as could fit onto the board, to give
	peak bandwidth to match the CPU+Cache system.  If this system is
	just making it for N cycles of read latency, for example, it is
	quite possible that the equivalent MP will need at least N+1, because
	the timing goes over the edge becasue there are more gate delays
	somewhere.
	b) Maybe it is possible to do it, but saving the extra cycle(s)
	of latency requires the use of exceptionally fast (read:
	expensive or hard-to-get or single-sourced) components, and people
	decide that extra latency is better than cost or bravery.

><	b) It is often HARDER work to make a big-machine run as fast as
><	a workstation, or even better, an embedded-control design using the
><	same chips.   Why should this be?
><		1) ECC is often slower than parity.
><		2) Sometimes you can easily build memory expandability
><		up to X at a given speed, but if you want to get to 2X<
><		you end up doing a different design that may cost you,
><		because busses get longer, etc.

>Yes, cost does not grow linearly with performance (and on top of it,
>price does not grow linearly with cost! (or does it?!)

>Could you expand on point 1 a little.  I don't know the performance
>aspects of ECC and parity.  

1) On reads, sometimes ECC is in series with memory access, if the
time can't be overlapped with the access, which is sometimes very hard.
(Doing it requires a CPU than can handle a late error indication well)>
On writes, in some cases ECC memory may require extra cycles,
as it may do a read-modify-write for any write that is less than
the size of the data on which ECC is computed.
(Of course, this is one reason to go to cache/write buffer designs
that lower the frequency of partial-word stores, for example).

>Actually, if you could clarify this point (b), it would be nice.
Well, more memory, sooner or later, either means bigger boards, or
more boards, or both, and one way or another, the "wires" from
memory to CPU gets longer.  This is bad, for two reasons I can explain
cogently, and one that some circuit people should expand on:
	a) One nanosecond == 1 foot (c/o Grace Hopper :-).  Nanoseconds
	count, even in microprocessor-based systems.
	b) The more the wires run around, the more chance for skew,
	where the bits don't all arrive at the same time, quite.
	c) Electrical issues: this is where I get lost: as a software
	guy, I like machines to be digital, i.e., 1's and 0's, nice
	square waveforms :-)  Our circuit folks show me plots of
	the waveforms on the faster busses.  If there's any trace of
	square-looking waveforms, I can't see it.....

Anyway, the bottom line is that expandability often costs your money,
or performance, or both, in ways that simply don't show up in
straightline CPU benchmarks.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015 or 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

phil@mips.COM (Phil Arellano) (05/08/90)

In article <38550@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>Well, more memory, sooner or later, either means bigger boards, or
>more boards, or both, and one way or another, the "wires" from
>memory to CPU gets longer.  This is bad, for two reasons I can explain
>cogently, and one that some circuit people should expand on:
>	a) One nanosecond == 1 foot (c/o Grace Hopper :-).  Nanoseconds
>	count, even in microprocessor-based systems.

A typical unloaded bus would run much closer to 2 nanoseconds per foot.  After
loading (install more boards), the time of flight can increase to double that
or more.  Also, the additional loading slows the signal edge rates which could
further reduce the maximum clock frequency.

>	b) The more the wires run around, the more chance for skew,
>	where the bits don't all arrive at the same time, quite.

In larger systems, critical signals must be replicated to reduce the loading.
This typically means buffering signals, which means adding components with their
associated delays and skews.

>	c) Electrical issues: this is where I get lost: as a software
>	guy, I like machines to be digital, i.e., 1's and 0's, nice
>	square waveforms :-)  Our circuit folks show me plots of
>	the waveforms on the faster busses.  If there's any trace of
>	square-looking waveforms, I can't see it.....

High speed digital is just a manifestation of VERY high speed analog.  Fourier
teaches us that a square wave is just a weighted sum of all odd harmonics of the
fundamental frequency.  This means that the bandwidth for any digital signal must
be many times the desired frequency.  The end result of all this is that as you
push up the frequency, you must either:

	Shrink the system's physical size so you don't have to worry about
	transmission line effects.

		   or

	Begin treating signal lines as transmission lines; paying close attention
	to characteristic impedance, loaded impedance, line termination, and
	crosstalk.  These are things you have to do when a signal line has an
	electrical length longer than the rise/fall time of the signal driving the
	line.  For a signal with 1 nanosecond transition times, lines longer than
	about FIVE inches should be handled as a transmission line.

>Anyway, the bottom line is that expandability often costs your money,
>or performance, or both, in ways that simply don't show up in
>straightline CPU benchmarks.

	Ditto.

	phil
-- 
UUCP: {ames,decwrl}!mips!phil  -OR-  phil@mips.com
USPS: MIPS Computer Systems, 930 Arques, Sunnyvale, CA 94086, (408) 524-8258

mo@messy.bellcore.com (Michael O'Dell) (05/08/90)

Hear hear!  Nice summary!  I can only add that if you think
1ns risetimes are dicey, you should see what happens with 150ps risetimes!

	Cheers!
	-Mike