[comp.arch] RISC vs CISC simple load benchmark; amazing ! Not really

mash@mips.COM (John Mashey) (06/12/90)

In article <8019@mirsa.inria.fr> jlf@mirsa.inria.fr (Jean-Louis Faraut) writes:
>Therefore, we are looking at new RISC technology and how big CPU
>manufacturers announced performances are is a matter of wondering for
>me :-?
>I try it with different commands to test CPU, I/O etc ... but for the
>sake of brevity, I'll only present here results obtained from a
>slightly modified version of the famous Bill Joy's test program, where
>RISC are supposed to be better than CISC .
>
>Here is my version of the Joy's program :
>========================================
>#!/bin/sh
>echo 99k65536vvvvvp8op | dc 
>========================================

Well, unfortunately, there is un unfortunate bug in this benchmark,
in that the behavior of this program in no way resembles most code
typically run on general-purpose UNIX sytems, and you absolutely do NOT
want to use it to help choose computers unless your workload happens to
be multiple-precisions arithmetic doing lots of ulitplies and divides.

Specifically, most RISC designers, after studying many programs, decided
that integer multiply (and especially divide) were used less frequently
than many other operations, and there is substantial data that backs this
up from many vendors.  RISC designers, depending on the benchmarks used,
and amount of silicon available, allocated various amounts of silicon to
support these operations, from zero up. The SPARC designers included
a Multiply-Step, and no Divide-Step (i.e., divides are done by fair-sized
hunk of code); HP PA included M-S and D-S; MIPS & 88K included both
integer mult & divide in hardware, etc.  However, for example, a typical
integer divide on a MIPS takes about 35 cycles.... and probably about
the same on a typical CISC.

IF YOU WANT TO PROVE THAT A RISC IS NOT VERY MUCH BETTER THAN A CISC,
OR EVEN WORSE, AT CPU PERFORMANCE:

Use a program consisting of integer divides of 2 variables that the
optimizers can't get rid of.
	a) This will show a MIPS or 88K at their least advantage.
	b) This will prove that a SPARC is the slowest thing in existence
	(well almost).  While we (MIPS) thought divide was good to have,\
	I must defend the SPARCers as not being irrational in leaving it
	out, given the constraints of the first implementation.
Attached to the end of this is a brief sample of the MIPS prof/pixstats,
which shows that on a MIPS-based machine (i.e., including DEC 5810),
multiply/divide accounts for 1/3 of the cycles.  This is NOT as bad
as the even more classic "2^4096" dc benchmark, but it's still very high.
(I tried email to jlf, but it bounced for some reason)
jlf: if you want some solid CPU (single-user) data on realistic
benchmarks, call the MIPS Paris office (33-1-42-04-0311) and ask for
"SPEC Data Helps Normalize Vendor Mips-Ratings for Sensible Comparison
OR
Your Mileage May Vary, But If Your Car Were A Computer, It Would Vary More"
Issue 2.0, May 1990.  This is a giant analysis of all the published SPECmark
data, and includes plenty of RISCs & CISCs.....  IT doesn't tell you about
multi-user stuff...

PROF & PIXSTATS: stop here if you don't want gory details.
Profile listing generated Mon Jun 11 19:49:24 1990 with:
   prof -pixie dc 
*  -p[rocedures] using basic-block counts;                                 *
*  sorted in descending order by the number of cycles executed in each     *
*  procedure; unexecuted procedures are excluded                           *

18830106 cycles **INSTRUCTION CYCLES, NO CACHE MISSES, STALLS**

    cycles %cycles  cum %     cycles  bytes procedure (file)
                               /call  /line

  15732582   83.55  83.55      53695     37 div (dc.c)
   1359698    7.22  90.77       5938     36 mult (dc.c)
    342478    1.82  92.59       3142     32 getdec (dc.c)
    307083    1.63  94.22         32     20 seekc (dc.c)
    294895    1.57  95.79       3597     43 add (dc.c)
**ALL REMAINING FUNCTIONS USED LESS THAN 1% OF THE INSTRUCTION CYCLES**
*  -h[eavy] using basic-block counts;                                      *
*  sorted in descending order by the number of cycles executed in each     *
*  line; unexecuted lines are excluded                                     *

procedure (file)                           line bytes     cycles      %  cum %

div (dc.c)                                  665   144    3260586  17.32  17.32
div (dc.c)                                  657   124    2781234  14.77  32.09
div (dc.c)                                  671    92    1911378  10.15  42.24
div (dc.c)                                  655    72    1648096   8.75  50.99
div (dc.c)                                  664    52    1124340   5.97  56.96
div (dc.c)                                  672    40     805894   4.28  61.24
div (dc.c)                                  658    40     739898   3.93  65.17
div (dc.c)                                  656    16     412024   2.19  67.36
div (dc.c)                                  667    12     337302   1.79  69.15
div (dc.c)                                  659   116     245129   1.30  70.45
mult (dc.c)                                1097   100     233586   1.24  71.69
**ALL REMAINING STATEMENTS USED LESS THAN 1% OF THE INSTRUCTION CYCLES.  
pixstats dc:
  27600801 (1.466) cycles (1.1s @ 25.0MHz) **INSTR CYCLES, INCL /* STALLS
  18830106 (1.000) instructions ** INSTRUCTIONS **
   6807625 (0.362) loads
   1994843 (0.106) stores
   8802468 (0.467) loads+stores
   8802468 (0.467) data bus use
   8770695 (0.466) multiply/divide interlock cycles (12/35 cycles)
**EXCEPTIONALLY HIGH: VERY FEW REAL PROGRAMS LOOK THIS WAY: 1/3 cycles
	IS WAITING FOR MUL OR DIV TO COMPLETE! **

0.448 load nops per load
**ALSO VERY HIGH: more typical is .25-.30 lnops/load**

opcode distribution:
      lw    6350450   33.72%
    lnop    3051497   16.21%
      sw    1686506    8.96%
    bnop    1237018    6.57%
     bne    1162741    6.17%
    addu     893980    4.75%
   addiu     799847    4.25%
    beqz     719618    3.82%
      lb     449888    2.39%
      li     326381    1.73%
      sb     307835    1.63%
    sltu     303554    1.61%
    subu     256783    1.36%
    mflo     238028    1.26%
     div     238028    1.26%	**YEP, RIGHT UP THERE**
   bcond     139925    0.74%
    bgez     129581    0.69%
    mfhi     114251    0.61%
   multu     114251    0.61%	**AND THERE'S THE MULTIPLY**
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

ian@sibyl.eleceng.ua.OZ (Ian Dall) (06/13/90)

In article <39319@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <8019@mirsa.inria.fr> jlf@mirsa.inria.fr (Jean-Louis Faraut) writes:
>>Here is my version of the Joy's program :
>>========================================
>>#!/bin/sh
>>echo 99k65536vvvvvp8op | dc 
>>========================================
>
>Well, unfortunately, there is un unfortunate bug in this benchmark,
>in that the behavior of this program in no way resembles most code
>typically run on general-purpose UNIX sytems, and you absolutely do NOT
>want to use it to help choose computers unless your workload happens to
>be multiple-precisions arithmetic doing lots of ulitplies and divides.

True, and undoubtedly why the sparc faired so badly. What I think is
more interesting is that the 5810 slows down faster than the Vax with
increasing number of processes.

VAX 8650 (8 mips CISC)     Ultrix 1.2 Joy
                user  0.02  0.02  0.05  0.11   0.21   0.48

DEC 5810 (18.7 mips RISC/MIPS)     Ultrix 3.0 Joy
                user  0.01  0.01  0.08  0.10   0.11   0.46

user/VAX              2.00  2.00  0.63  1.10   1.91   1.04 average=1.45

The 1 process case is a reasonable approximation to the ratio of the
rated speeds (especially given the resolution of the measurement). The
32 process case shows the 5810 in a poor light.  Does this have
anything to do with the frequency of multiply instructions or is it a
feature of frequent context switches? Cache flushes maybe?

>Specifically, most RISC designers, after studying many programs, decided
>that integer multiply (and especially divide) were used less frequently
>than many other operations, and there is substantial data that backs this
>up from many vendors.

I can't help thinking that average speed (over an instruction mix) is,
(like most statistics) an inadequate measure. The trouble is, if you have
a multiply intensive application, it is a pain if it runs dramatically
slower than you would expect for a machine of that class. In a sense, one
would like to know the worst case "speed" as well as the "average" speed
of a machine (lots of hand waving here).

-- 
Ian Dall     life (n). A sexually transmitted disease which afflicts
                       some people more severely than others.       

rcpieter@tuegate.tue.nl (Tiggr) (06/13/90)

In article <675@sibyl.eleceng.ua.OZ> ian@sibyl.OZ (Ian Dall) writes:
>[...] What I think is
>more interesting is that the 5810 slows down faster than the Vax with
>increasing number of processes.

Note that the VAX8650 is running Ultrix 1.2 (gosh!) and the DEC5810
Ultrix 3.0.  I thought it was well known that every release of Ultrix
is slower and larger then the previous one.

Tiggr
--
Don't believe me, I'm only a student.

mash@mips.COM (John Mashey) (06/15/90)

In article <675@sibyl.eleceng.ua.OZ> ian@sibyl.OZ (Ian Dall) writes:

>I can't help thinking that average speed (over an instruction mix) is,
>(like most statistics) an inadequate measure. The trouble is, if you have
>a multiply intensive application, it is a pain if it runs dramatically
>slower than you would expect for a machine of that class. In a sense, one
>would like to know the worst case "speed" as well as the "average" speed
>of a machine (lots of hand waving here).

Really, what you want is enough data points that you think you know 
not only some measure of centrality but some measure of variation,
but those are not enough.  You always really want enough benchmarks to
see the patterns of difference: this is why SPEC has alwasy insisted
on making ALL of the benchmark nubmers available, because quite
different patterns can be found.

The worst case performance is not all that interesting: for two cached
machines with different cache organization, you can usually "prove"
different ratios of relative performance by careful selection of the
most relevant cache-busting code.

For instance, the "compress" program is often a good example of something
that will drag most machines down to DRAM speed.
Another good one is, on a direct-mapped, virtual cache machine, is
to copy, 1 byte at a time, between two areas that collide in the cache.
This causes every single byte read to:
	writeback the (dirty) cache line to memory
	read the new cache line
and then each byte write:
	flushes the (clean) cache line
	reads the new cache line
	writes the byte into that cache line
(i.e., if you want to artifically show off a SPARC 490 at its worst,
you can probably prove its slower than a 68020 with such a benchmark).
Of course, any given machine can be done in this way.

What is useful is to have some cases that show performance in different
usage patterns: the mean&std deviation, or mean and min alone just don't
tell you much about hwat's happening.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

dik@cwi.nl (Dik T. Winter) (06/15/90)

In article <39397@mips.mips.COM> mash@mips.COM (John Mashey) writes:
 > In article <675@sibyl.eleceng.ua.OZ> ian@sibyl.OZ (Ian Dall) writes:
 > 
 > >I can't help thinking that average speed (over an instruction mix) is,
 > >(like most statistics) an inadequate measure. The trouble is, if you have
 > >a multiply intensive application, it is a pain if it runs dramatically
 > >slower than you would expect for a machine of that class. In a sense, one
 > >would like to know the worst case "speed" as well as the "average" speed
 > >of a machine (lots of hand waving here).
 > 
 > Really, what you want is enough data points that you think you know 
 > not only some measure of centrality but some measure of variation,
 > but those are not enough.  You always really want enough benchmarks to
 > see the patterns of difference: this is why SPEC has alwasy insisted
 > on making ALL of the benchmark nubmers available, because quite
 > different patterns can be found.
 > 
I do not think so.  In my opinion you also want to know the worst case.
But not only the performance for the worst case, but also what is the
worst case.  Every machine has a kind of program where it does not work as
expected.  Well known cases are integer multiply and cache problems (as
outlined by John Mashey).  I think that the interested customer needs to
know the cases where this happens.  The SPEC figures go already in that
direction, as outlined above, but more might be done in this respect.
The alternative is that there are customers that are disgusted at the
performance for their special application (like Robert Silverman for his
integer arithmetic programs some time ago).

What I would like is that the SPEC required also some statement that
mentioned fields where the machine would get below expectations with
respect to the SPEC figures.  I posted already about the SPARC; I have
not had enough time on MIPS processors to get at their weaknesses (but
I understand you might get in problems with the cache); and so on.
All machines I have used upto now have their strong points and their
weak points.  Of course, the trade rags contain the strong points only,
but as SPEC is trying to give some objective information on processors,
in my opinion a mention of weak points might be given.

Of course, this is a very rough idea (and might become controversial),
but some thoughts in this direction are very desirable.  And it might
help the decision process about the machine.

How many customers do know how to benchmark a machine?  Let alone,
how to benchmark a machine for their particular situation?  Of course
the manufacturers make use of this (always lay the blame at the
manufacturer :-)).  As a rough estimate, the numbers for the first and
second case above are (I may be 100% off) respectively: zero and zero.
But I hope those numbers will grow.  (As a parallel example; how many
people do know how to calculate the speedup on a multi-processor
system?)

Currently you read the promotional material, the SPEC figures (if available),
and try your own favourite programs, and decide; taking in account the
computing environment, which might be completely irrelevant.  For larger
machines you may simulate the workload you expect (which is probably wrong, or,
alternatively outdated by the time you have the new system installed).
(And in this context, you is a committee; each member with his own
preferences and so on.)

Alas, in these days the only way to assess the viability of a machine is
to install it, let it run a few years, and than say that it was the
proper decision (or not of course).  (Unless it appears after
two months that the machine is not used; in that case you can make an
early decision; but of course that is too late.)

Do I sound pessimistic?  Yes.
Am I pessimistic?  No.
How come?  In my opinion the SPEC figures are a giant step in the good
direction, and I hope that some time they may become really useful.
Still, the biggest problem is of course that decisions are made by management,
and not by the users (this is not directed at dcab :-)).

Oh well, these were some rambling thoughts...
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

casey@gauss.llnl.gov (Casey Leedom) (06/16/90)

| From: mash@mips.COM (John Mashey)
| 
| The worst case performance is not all that interesting: for two cached
| machines with different cache organization, you can usually "prove"
| different ratios of relative performance by careful selection of the
| most relevant cache-busting code.
| 
| [A good example] on a direct-mapped, virtual cache machine, is
| to copy, 1 byte at a time, between two areas that collide in the cache.
| 
| (i.e., if you want to artificially show off a SPARC 490 at its worst,
| you can probably prove its slower than a 68020 with such a benchmark).
| Of course, any given machine can be done in this way.

  While I agree with you that one can always come up with cache-busting code,
I think that you picked a particularly bad example as I think that the cache
design in question is just brain dead.  If you design a direct mapped cache
you should have at least two buckets for each cache line.  Linearly doing
things to two different arrays is so common that you're bound to run into
the problem you mention.

	(As proof, I was called in on a problem with a Sun 3/280 that was
	bought for image processing.  Part of their processing involved,
	essentially, copying a 1/4Mb array 30 times a second.  The group had
	justified buying the 280 on the grounds that a 180 just wouldn't be
	fast enough.  Imagine their horror when they ran their code on their
	brand new 280 and found that it ran 3 times slower than on a 180!
	The problem turned out to be that the two arrays were an exact
	multiple of 64Kb away from each other -- the size of the 280's cache.
	Eventually I was able to bring the 280 up to the speed of a 180 by
	offsetting the arrays by 24 bytes from 64Kb.  (There were actually
	a bunch of fast nodal points, but you get the idea.))

  A cache shouldn't break on common operations.

Casey

mash@mips.COM (John Mashey) (06/18/90)

In article <137441@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes:
>>| Of course, any given machine can be done in this way.
>>
>>  While I agree with you that one can always come up with cache-busting code,
>>I think that you picked a particularly bad example as I think that the cache
>>design in question is just brain dead.  If you design a direct mapped cache
>>you should have at least two buckets for each cache line.

>Excuse me, but isn't what you are describing a two way set associative cache?

>Furthermore, I believe that John's point was that you can take *any* caching
>scheme and break it.  Suppose that we had a cache like you described.  It
Yes, this was not picking on Sun (all of MIPS' caches are direct-mapped, too,
except the RC6280's second-level cache, although we do tend to use
physical-tagged caches most of the time for other reasons, not relevant
to the single-threaded discussion of arrays & such).

>Finally, you may or may not be correct in your assessment that direct mapped
>caching schemes are "brain dead" but you ought to consider what is involved
>in making a cache N-way set associative.  It takes up a lot of real estate
>and if N is even slightly large, it can add extra gate delays in the
>critical path of looking for a hit (last I looked fan in on AND gates wasn't
>that large, but maybe that's changed).

Yes.  Look at Steve Pryzblksi's thesis and/or the Morgan Kaufmann book;
also, there have been several papers analyzing direct-mapped caches
versus set-associative ones in various design environments, and finind
that they are very competitive in those environments where
set-associative is not already strongly imnplied for other reasons.

Set-associativity is always nice when you can get it, and sometimes
it's the right thing, sometimes not.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086