[comp.sys.nsc.32k] Performance of the 532

grenley@nsc.nsc.com (George Grenley) (05/07/87)

Well, discussion on the '532 seems to have died down a bit, so I guess
it's time to stir things up.  The 532 will run at 30mhz (maybe faster),
and many instructions execute in two clocks.  This lead to some eager
magazine types to claim "15 MIP Performance".  I guess in the wake of
the hoopla about AMD's new chip, it's understandable the editors got
a little excited.

So, here are some simulated facts:  Our design team has done simulations
of the chip's performance, both with ideal 0 wait state memory and with
"real world" typical VME bus memory.  We ran some unix utilities, including
our own compilers, etc.  I will divulge a few of these numbers now, and more
later.  (If I don't get burned for this):

Grep ran at 8.4 mips from 0 ws memory, 7.9 from VME.  Grep was one of the
best.  One of the worst was our assembler, it hit 5 mips from 0 ws, and
4.5 mips from VME.  On the average (these two plus several other CPU 
intensive programs) the '532 hit 6.1 mips from 0 ws, 5.3 mips from VME.

So, here's the deal.  I invite Mot, Intel, and other interested parties
to work with me in defining some sort of realistic benchmark, which we'll
run (in public).  I expect to have system level hardware late this year,
so if we get started now, we'll have very interesting Xmas presents...

May the best CPU win!  (Not that having the best CPU is a requirement for
having the most design wins - just ask Intel).

Regards,

George Grenley
Manager, '532 systems development (or something like that)

Disclaimer:  I work for NSC. I used to work for Intel, selling 8086s.
Before that, I sold 68000s for Mostek.  I would rather drive steam trains.

mash@mips.UUCP (John Mashey) (05/07/87)

In article <4294@nsc.nsc.com> grenley@nsc.UUCP (George Grenley) writes:
>
>So, here are some simulated facts:  Our design team has done simulations
>of the chip's performance, both with ideal 0 wait state memory and with
>"real world" typical VME bus memory.  We ran some unix utilities, including
>our own compilers, etc.  I will divulge a few of these numbers now, and more
>later.  (If I don't get burned for this):

>Grep ran at 8.4 mips from 0 ws memory, 7.9 from VME.  Grep was one of the
>best.  One of the worst was our assembler, it hit 5 mips from 0 ws, and
>4.5 mips from VME.  On the average (these two plus several other CPU 
>intensive programs) the '532 hit 6.1 mips from 0 ws, 5.3 mips from VME.

Could you say a little more on the configurations:
	cache size, nature [write-back or write-thru]
	if write-thru, did you use write buffers, and if so, how deep.
	exactly what the assumptions were on the VME memories

It would also be interesting [although I realize this might be
sensitive info] to get more info on the simulations, to be able to
make a read on the accuracy of the simulations:

	instruction cycles
	TLB-miss cycles
	cache-miss cycles
	[if present] write-buffer stall & write/read interlock cycles

>So, here's the deal.  I invite Mot, Intel, and other interested parties
>to work with me in defining some sort of realistic benchmark, which we'll
>run (in public).  I expect to have system level hardware late this year,
>so if we get started now, we'll have very interesting Xmas presents...

I think that's a great idea and am delighted that somebody has suggested it.
Presumably there will be 68030s benchmarkable in hardware by then,
and certainly 386s, Clippers, and WE32200s.  As a first suggestion,
I'd observe that there are at least the following classes of realistic
benchmarks:
	1) Large FORTRAN / C floating-point ones [and there are many of these
	that are widely available].  One probably needs at least 5-10 of these
	to cover the different sorts of things that people do.
	2) Large integer benchmarks: this is the real tough category:
	most of the larger, realistic ones tend to be proprietary codes,
	or else things where the code [like for assemblers, compilers, etc]
	inherently differs among systems.  this also needs 5-10 of them,
	and could at least include a few of the larger UNIX utilities,
	although most of them fit into reasonable-sized caches, and hence
	don't stress things the way larger applications do.
	3) Multi-user and/or systems benchmarks, using UNIX.  Run shell
	scripts, etc.  I'dthink there should at least be a few of these.
One might want to focus on 1&2, if only to avoid the arguments on 3
regarding different peripheral choices, operating system tuning, etc,
unless the shootout is intended as an OS shootout also.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

lm@cottage.WISC.EDU (Larry McVoy) (05/08/87)

In article <374@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>Could you say a little more on the configurations:
>	cache size, nature [write-back or write-thru]
>	if write-thru, did you use write buffers, and if so, how deep.
>	exactly what the assumptions were on the VME memories

If they are following their data book, then

    Icache is 512 bytes, 32 lines, direct mapped.
    Dcache is 1024 bytes, 64 lines, 2 way set associative, write through.
    TLB is 64 entries, fully associative.

>It would also be interesting [although I realize this might be
>sensitive info] to get more info on the simulations, to be able to
>make a read on the accuracy of the simulations:
>
>	instruction cycles
>	TLB-miss cycles
>	cache-miss cycles
>	[if present] write-buffer stall & write/read interlock cycles

I'd like to see this too.

Also, I'm a little surprised at the figures.  I just spent a bit of time
going over the data book and I would have expected better numbers.  
Closer to 8-10 MIPS...  Bummer....

---
Larry McVoy 	        lm@cottage.wisc.edu  or  uwvax!mcvoy

"What a wonderful world it is that has girls in it!"  -L.L.

grenley@nsc.nsc.com (George Grenley) (05/08/87)

My 532 performance posting has generated some response.  Good!  Herewith,
more details:  For those of you who wish to run the grep benchmark yourself,
the test was to search for the string "int" in file chroot.c in Unix V source.
I realize this means you nedd the source....

In article <374@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>In article <4294@nsc.nsc.com> grenley@nsc.UUCP (George Grenley) writes:
>> [deleted lengthy reference to 32532 performance - see previous posting]
>Could you say a little more on the configurations:
>	cache size, nature [write-back or write-thru]
>	if write-thru, did you use write buffers, and if so, how deep.
>	exactly what the assumptions were on the VME memories

Okay, good questions.  Here are some answers:

The 532 itself has a 512 byte (16 byte line size) instruction cache, and
a 2 way set associative data cache of 1024 bytes - see our published 
advance data sheet for details.  I'm sure our NSC sales people would be
happy to hear from you...

One purpose of building the simulator was to evaluate EXTERNAL cache designs.
Because of real-estate restrictions, we settled on a direct-mapped cache of
64 Kbytes.  Both internal and external caches are write through.

We use a write buffer of depth 8.  Our analysis shows the possibility of
this filling w/ typical VME memories is < 1%.

For VME memory performance, we used published DRAM specs from several vendors
such as Clearpoint, Microproject, etc.  We assumed high availability of
the memories (and VMEbus) - i.e., no heavy DMA or multiprocessing.

>It would also be interesting [although I realize this might be
>sensitive info] to get more info on the simulations, to be able to
>make a read on the accuracy of the simulations:
>
>	instruction cycles
>	TLB-miss cycles
>	cache-miss cycles
>	[if present] write-buffer stall & write/read interlock cycles

You are right on two counts:  It would be interesting, and it is also
sensitive.  Such data would give our competitors too good an idea how
good our cach(es) really are, so I shall refrain from any detailed 
discussion of this for awhile.  

I will say this:  I am a hardware designer, not a CPU architect or a
software guy.  With the rapid rise in CPU clock rate (30 meg for the
'532) and memory demands (2 clock cycle = 66 nsec GROSS), internal
caches are becoming a necessity.  I would not want to have to design
the cache for the 532 if it did not already have one - it would be
expensive to do without seriously compromising performance.

But, just to tease people a bit, the combined internal and external
caches used in the above simulations have an overall read hit rate of
better than 93%.  As a result, the system is relatively insensitive
to main memory performance.

[deleted, req to the world to get together for a lil old-fashioned horse-race]

>I think that's a great idea and am delighted that somebody has suggested it.
>Presumably there will be 68030s benchmarkable in hardware by then,
>and certainly 386s, Clippers, and WE32200s.  As a first suggestion,
>I'd observe that there are at least the following classes of realistic
>benchmarks:
>	1) Large FORTRAN / C floating-point ones [and there are many of these
>	that are widely available].  One probably needs at least 5-10 of these
>	to cover the different sorts of things that people do.

My understanding is that the LinPak benchmark has become the standard for
the number crunching guys.  We used it at Jetas Technology when I worked
there as our standard of reference for numeric performance.  It seems
to well-represent the kind of array-oriented math typical of FORTRAN.
(good old fortran - only HLL I was ever really good at 8-))

>	2) Large integer benchmarks: this is the real tough category:
>	most of the larger, realistic ones tend to be proprietary codes,
>	or else things where the code [like for assemblers, compilers, etc]
>	inherently differs among systems.  this also needs 5-10 of them,
>	and could at least include a few of the larger UNIX utilities,
>	although most of them fit into reasonable-sized caches, and hence
>	don't stress things the way larger applications do.

How about, instead, compiles?  They are usually CPU intense (unless you 
have a REALLY terrible disk system, and reflect the general non-numeric
work-load typical of most cpus.  I recall reading analyses of instruction
mixes from many different non-numeric applications that show they don't
vary much.  Since compiles of Unix source files is a "portable test" it
might be suitable...

>	3) Multi-user and/or systems benchmarks, using UNIX.  Run shell
>	scripts, etc.  I'dthink there should at least be a few of these.

>One might want to focus on 1&2, if only to avoid the arguments on 3
>regarding different peripheral choices, operating system tuning, etc,
>unless the shootout is intended as an OS shootout also.

Personally, I see item 3 as being at least as important as 1 and 2,
from a practical point of view.  Overall system performance is 
ultimately the only thing that matters.  Whether system A is faster
than system B because the OS is a better port, or the compiler
optimizes better, or the I/O subsystem is really hot, is of interest
to us as system designers, only insofar as it helps us design better
systems.  This is the primary reason why I'm talking about the 532,
to help other hardware hacks do a good job designing it in.  In
my experience all CPUs are the same speed when they're sitting on
your desk... 8-)

All for now.

Regards,
George Grenley
usual disclaimer - either I'm lying, or not...

grenley@nsc.nsc.com (George Grenley) (05/08/87)

In the interests of brevity,
I have deleted a lot from Larry's posting, much of it comments by John 
Mashey.  If you haven't been following the discussion, do go back and
catch up.

In article <3552@spool.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes:
>    Icache is 512 bytes, 32 lines, direct mapped.
>    Dcache is 1024 bytes, 64 lines, 2 way set associative, write through.
>    TLB is 64 entries, fully associative.

Correct.  However, there is also an external cache, described by me in a
previous posting.

>Also, I'm a little surprised at the figures.  I just spent a bit of time
>going over the data book and I would have expected better numbers.  
>Closer to 8-10 MIPS...  Bummer....

If you can get 10 mips we may have a job for you...
Seriously, our simulation was based on a real-world Unix environment, with
context switches, etc. - Difficult to model by hand.  I am sure that
well-written code without a lot of system overhead and cold-cache problems
will hit 10 mips.