[comp.arch] Error in Posting of SPEC numbers on IBM systems

sritacco@hpdmd48.HP.COM (Steve Ritacco) (02/22/90)

Ok, let's talk some architecture stuff.

Why is it that the RIOS has a bigger data cache than instruction cache?
This defies conventional wisdom.  Data caches are less effective than
instruction caches and are usually made small because their hit ratio
doesn't increase with size as rapidly as instruction cache.  If I had to
guess what is going on, I would guess that access to the I-cache is
very wide, to suport super-scaler, so they crammed all they could on
the CPU chip.  This seems to be pretty effective.  IBM has shown the
super-scaler architecture works, which up to this point I wasn't convinced
of.  The benifits are tangible.  A 20MgHz CPU with 8K I-cache and 32K
D-cache SPECmarked at 22.something.  That is quite impressive.  The R3000
which to date seemed the most efficient CPU/system implementation has
been displaced for the moment.

I don't doubt that the R4000 will out-perform RIOS, there is one thing
that I wonder about though.  Will the R4000 out-perform RIOS brute force
that is high integration and very high clock speeds, or will it beat
it by providing greater architectual efficiency?  The only SPARC imple
mentations that beat mips have a major clock speed difference (not to
mention higher implemenatation cost).  One other concern is the hype
associated with super-scaler.  In the EE-Times article someone from
mips stated that the R4000 was going to be super-scaler.  That's the
first time I had heard that.  Made me wonder if it is true, or just an
attempt to ride the hype wave.  If better performance can be had with
a simpler design (non super-scaler) due to less complexity, why not tell
the world.  That is what the whole RISC-CISC thing was all about in the
first place.  The America chip set seems quite complex to me!  How does
the complexity of a design like it compare with CISC complexity?  Is it
more managable for some reason? ...

On a side note, does IBM think workstation boxes need to be ugly to
be impressive, or was their industrial design team out to lunch?


------------------------------------
These coments are only my own, and do not reflect te views of my
employer ...
____________________________________

aglew@oberon.csg.uiuc.edu (Andy Glew) (02/23/90)

>From: sritacco@hpdmd48.HP.COM (Steve Ritacco)
>
>Why is it that the RIOS has a bigger data cache than instruction cache?
>This defies conventional wisdom.  Data caches are less effective than
>instruction caches and are usually made small because their hit ratio
>doesn't increase with size as rapidly as instruction cache.  If I had to
>guess what is going on, I would guess that access to the I-cache is
>very wide, to suport super-scaler, so they crammed all they could on
>the CPU chip.  This seems to be pretty effective.  IBM has shown the
>super-scaler architecture works, which up to this point I wasn't convinced
>of.  The benifits are tangible.  A 20MgHz CPU with 8K I-cache and 32K
>D-cache SPECmarked at 22.something.  That is quite impressive.  The R3000
>which to date seemed the most efficient CPU/system implementation has
>been displaced for the moment.

Huh?!?

I-caches get better hit rates than D-caches, but you quickly reach a
point of diminishing returns - your I-cache hit rate is so good than
improving it doesn't make too much difference to your performance.
Moreover, if your memory system cycles comparable to your processor,
but just has a large latency, then you can suck instructions out of
memory, except for branches.

Just went to a talk by Mudge of Michigan, who are building a GaAs MIPS
6000, where the speaker had to justify I-cache bigger than D-cache.
In this case, they wanted a direct mapped virtual primary D-cache,
so the D-cache size was limited by the page size of the architecture,
to avoid synonym problems.  Since you don't worry about synonyms in the
I-cache, the I-cache could be made larger.

--
Andy Glew, aglew@uiuc.edu

mash@mips.COM (John Mashey) (02/24/90)

In article <14900004@hpdmd48.HP.COM> sritacco@hpdmd48.HP.COM (Steve Ritacco) writes:
>Ok, let's talk some architecture stuff.

>Why is it that the RIOS has a bigger data cache than instruction cache?
>This defies conventional wisdom.  Data caches are less effective than
>instruction caches and are usually made small because their hit ratio
>doesn't increase with size as rapidly as instruction cache.  If I had to
>guess what is going on, I would guess that access to the I-cache is
>very wide, to suport super-scaler, so they crammed all they could on
>the CPU chip.  This seems to be pretty effective.  IBM has shown the
>super-scaler architecture works, which up to this point I wasn't convinced
>of.  The benifits are tangible.  A 20MgHz CPU with 8K I-cache and 32K
>D-cache SPECmarked at 22.something.  That is quite impressive.  The R3000
>which to date seemed the most efficient CPU/system implementation has
>been displaced for the moment.

1) Super-scalar works, at least for getting at more of the low-level parallism
of FP code.  This is clearly shown by the IBM systems.
2) It's not clear that it works, or their specific case works.
This might be:
	a) Compilers will get better (likely) for integer.
	b) Compilers will get better, a lot (unlikely) for integer.
It is alwasy possible that there's a whole lot of mileage to be gained,
but past experience says to doubt it; it's not like this compiler technology
is a raw new technology: IBM has been doing excellent optimization for
a long time.  Certainly, our experience has been that most of the
micro-level scheduling improvements over the last few years ahve been more
in the FP area, than in the integer area.  Anyway, I'd council keeping
an open mind, but I'd also advise not just believing what some IBM
marketing guy says "Oh, we haven't really taken advantage of that.",
that it's going to get magically better.  On the other hand, if one of their
good technical folks like Marty Hopkins says there's a big jump coming, then
one should pay serious attention.

3) In general, this does raise an interesting issue, which is comparing
cache sizes.  Some people build smaller, special-purpose, N-way-set-associative
caches; some people build various-sized, direct-mapped caches from standard
SRAM.  Both ways are legitimate, and there are various tradeoffs in terms
of power, space, and cost.  One thing to be careful off is saying
that something did it with a small cache, because one would also want
to know how much a special-purpose cache chip costs, also.....
I don't claim to be unbiased: I like using standard SRAMs, because they
always get cheap, and so far, I think that chip costs argue with me,
but there are legitimate reasons for doing it the other way, too.
Of course, we're not likely to know for sure the cost of the IBM chips,
so it's not so easy to compare.

Note, just in case anyone is misled by the following, there is no
announced product from MIPS called an R4000....

>I don't doubt that the R4000 will out-perform RIOS, there is one thing
>that I wonder about though.  Will the R4000 out-perform RIOS brute force
>that is high integration and very high clock speeds, or will it beat
>it by providing greater architectual efficiency?  The only SPARC imple
>mentations that beat mips have a major clock speed difference (not to
>mention higher implemenatation cost).  One other concern is the hype
>associated with super-scaler.  In the EE-Times article someone from
>mips stated that the R4000 was going to be super-scaler.  That's the
As far as I know, no one from MIPS, who knows, has ever publicly said
that it would be super-scalar (or that it wouldn't).  What we generally
say is: there are various flavors of multiple-issue machines: superscalar,
superpipelined, and VLIW (or maybe, short VLIW, which is what I'd call
the i860 and DN10000, sort of), and that anybody who wants to be competitive
in the current round of chips needs to do one of these, and that of course
we've been working on this for several years, and are familiar with the
variations, and are doing one, or some combination, but that we
explicitly refuse to disclose which flavor we're using in the
R?000.  At some panel session within the last year or so, I commented
that architectural simulation is a necessity, because this area was far beyond
human intuition.  I said, for instance, that we'd
burned huge numbers of cycles simulating the effects of being able to
do various pairs of instructions simultaneously, and comparing results,
and that we'd been thinking about "supersonic" pipelines for years.

	Actually, what I think has happened is that super-scalar has become
	like RISC.  I.e., for a while, lots of people got convinced that
	if something had register windows, that was RISC.  Right now,
	almost anything with an aggressive pipeline gets called super-scalar,
	because for most people, the implementation nuances are irrelevant.

>first time I had heard that.  Made me wonder if it is true, or just an
>attempt to ride the hype wave.  If better performance can be had with
>a simpler design (non super-scaler) due to less complexity, why not tell

As usual, I recommend Hennessy's article in the September 89 UNIX Review;
it also contians some references to other articles with good studies of
super-scalar issues.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (02/24/90)

In article <AGLEW.90Feb22110902@oberon.csg.uiuc.edu> aglew@oberon.csg.uiuc.edu (Andy Glew) writes:
>I-caches get better hit rates than D-caches, but you quickly reach a
>point of diminishing returns - your I-cache hit rate is so good than
>improving it doesn't make too much difference to your performance.
>Moreover, if your memory system cycles comparable to your processor,
>but just has a large latency, then you can suck instructions out of
>memory, except for branches.

Note, also, that it also depends on the kind of programs you're
emphasizing.  Systems that run lots of users, executing the kernel a lot,
or running big DBMs, have different characteristics than ones
aimed more at technical number crunch, especially of the
linear algebra type.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mitch@oakhill.UUCP (Mitch Alsup) (02/27/90)

# Donning flame retardent suit

In article <36426@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>Note, just in case anyone is misled by the following, there is no
>announced product from MIPS called an R4000....

Aye, Captain !

>As far as I know, no one from MIPS, who knows, has ever publicly said
>that it would be super-scalar (or that it wouldn't). .....
>............................... and that anybody who wants to be competitive
>in the current round of chips needs to do one of these, ...

Does this imply what it appears to imply?

>............ I said, for instance, that we'd
>burned huge numbers of cycles simulating the effects of being able to
 ^^^^^^^^^^^^^^^^^^^
Presumably all of these CPU cycles are limited in performance by the
capability of the machines to process Double Precision Floating Point
Numbers?  :-)

>do various pairs of instructions simultaneously, and comparing results,
            ^^^^^ only pairs????? c.f. IBM 6000
>and that we'd been thinking about "supersonic" pipelines for years.
                                   ^^^^^^^^^^^^
Do you require afterburners to achieve the "supersonic" transition? ;-)
Or, have you kept the Reynolds Number low?

# Removing flame retardent suit
>-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

Mitch Alsup     M88000 Design Group
Mike  Shebanow  M88000 Design Group
DISCLAIMER:     <We speak for ourselves only>