[comp.arch] Why is SPARC so slow?

baskett@baskett (12/10/87)

I have been asking myself the question, why is SPARC so slow?
I've been sparked by John Mashey's fascinating "Performance Brief"
and by continuing reports from our customers that our own 4D/70
12.5 MHz MIPS based workstations outperform Sun-4's on their CPU
intensive applications including image rendering and mechanical
design and analysis in a manner consistent with the benchmarks
reported in the Performance Brief.

SPARC is not slow compared to traditional microprocessors, granted.
But as a Risc microprocessor it seems to have some problems, at least
in the first two implementations.  Below are my observations so far on
why the Fujitsu version of SPARC is slow compared to the MIPS Risc
microprocessor.  At least some of the problems of the Fujitsu version
(the one in the Sun-4) are also present in the Cypress version,
according to the preliminary data sheets.  These problems don't
necessarily mean that the SPARC architecture has problems but I'd be
reluctant to accept SPARC as the basis for an Application Binary 
Interface standard until I saw some evidence that high performance
implementations of SPARC are possible.

Loads and stores are slow.  Loads on both implementations take two
cycles and stores take 3 cycles for 32-bit words compared to one cycle
for each on a MIPS R2000.  There are several interrelated reasons for
this situation.  Briefly, they are lack of a separate address adder,
lack of split instruction and data caches, and inability to cycle the
address and data bus twice per main clock cycle.  Details follow.

Lack of a separate address adder for loads and stores.  The R2000 can
start the address generation for a load or a store in the second stage
of the pipeline because the register access is fast and an address adder
is present.  Thus the load or store can "execute" in stage 3 of the
pipeline, just like the rest of the instructions.  On SPARCs (so far)
address generation appears to use the regular ALU in the third stage of
the pipeline and then begin the actual cache access in the fourth stage.
For a load, you then need an extra stage to get the data back.

Lack of split instruction and data caches.  Because both SPARCs have a
single cache rather than the separate instruction and data caches of
the R2000, the extra pipeline stage needed to get the data back for a
load can't be used to fetch an instruction anyway.  For a store the
relevant cache line is read on the fourth cycle and updated and written
back on the fifth cycle.  So there are two cycles that can't be used
to fetch instructions, bringing the total cost of a store to three cycles.

Inability to cycle the address and data bus twice per main clock cycle.
The SPARC chips aren't double cycling the address and data bus so that
both loads and stores mean that you can't fetch instructions.  The R2000
also has a single address bus and a single data bus but it can use them
twice per cycle.  This means you can then split your cache into an
instruction cache and a data cache and make use of the extra bandwidth
by fetching an instruction every cycle in spite of loads and stores.

However, if register windows eliminated enough loads and stores, these
two SPARC implementations might represent reasonable engineering design
decisions.  Both benchmarks and careful studies of code sequences
indicate that the load and store savings are not that great, generally
less than five percent.  We can also ask if the overhead of register
windows leaves enough time in the second stage of the pipe to do an
address add assuming we could fit such an adder into the implementation.
(Windowed registers take up a lot of space.)

Branches are slow.  Since taken branches need only one delay slot
there must be an address adder for the program counter.  But with a
single cache you have to decide early what the next instruction address
is.  Both SPARC chips always decide that a branch will be taken so there
is an additional cycle penalty when the condition isn't satisfied and you
have to junk the instruction you fetched and fetch the right one.  On
the R2000, the instruction address comes out in the second half of the
cycle on the double-cycled address bus so you have time to check the
condition in the first half of the cycle and put out the right target
address every time.  The separate instruction and data cache only run
at single cycle rates but they run a half cycle out of phase with each
other so it all works out.  (Pretty slick, don't you think?)  The first
delay slot can be used by a useful instruction a majority of the time
on both architectures so they are even there.  However, the SPARC
architecture requires that conditional branches be based on a value in a
condition code register rather than the value in a regular register, as
in the MIPS architecture.  Honest people can (and do) disagree about
which approach is better.  But the compiler studies I have seen indicate
that, on the average, you need an extra instruction for setting the
condition code a noticable fraction of the time.  So my guesstimate is
that the average conditional branch on a SPARC is 2.5 cycles and on an
R2000 is 1.5 cycles.  (Further study is needed here.)

Floating point is very slow.  Here we only know about the Fujitsu
version of the architecture.  The Cypress version is likely to be
better since the Weitek parts that the Fujitsu version uses are rather
old designs (WTL 1164 and WTL 1165).  Weitek's more recent designs are
faster and so we presume the Cypress version will be better, too.
Nevertheless, here are the numbers (from the data sheets).  I use cycle
counts just to keep it simple.

                   Fujitsu SPARC      MIPS R2000
                      SP    DP         SP    DP

    add/subtract       9    11          2     2

    multiply           9    12          4     5

    divide             34   65         12    19

These are the total latency times from start to finish for both
systems.  Both systems can execute other integer operations in parallel
with floating point operations after the floating point operations are
launched.  However the launch cost on SPARC is two cycles while it is
one cycle on the R2000.  The launch time is included in the above table.
Both systems appear able to do simultaneous multiplies and adds with no
pipelining.

If we summarize these cycles per instruction by looking at a conservative
estimate of instruction frequencies we get the following results, first
for integer programs and then for single precision floating point programs.

                     SPARC      MIPS     frequency
                     cycles     cycles   (percent)

    loads              2          1         20
    stores             3          1         10
    branches           2.5        1.5       15
    most other         1          1         55
    rare other        >1         >1         ~0

    average            1.63       1.08      ratio = 1.51

                     SPARC      MIPS     frequency
                     cycles     cycles   (percent)

    loads              2          1         20
    stores             3          1         10
    branches           2.5        1.5       15
    most other         1          1         45
    sp fp other        9          2         10

    average            2.43       1.18      ratio = 2.06

These ratios are also consistent with the benchmark results in the
Performance Brief.

Since MIPS and Sun seem to be producing these systems with similar
technologies at similar clock rates at similar times in history, these
differences in the cycle counts for our most favorite and popular
instructions seem to go a long way toward explaining why SPARC is so
slow.

Forest Baskett
Silicon Graphics Computer Systems

bcase@apple.UUCP (Brian Case) (12/11/87)

In article <8809@sgi.SGI.COM> baskett@baskett writes:
    [A lot of well-considered stuff about why the current and soon-to-be
     SPARC machines are/will be "so slow."]

Forest, I agree completely with your reasoning on most points:  slow
loads and stores, slow branches, and (intertwined with the previous)
only one bus cycled only once per cycle.

>necessarily mean that the SPARC architecture has problems but I'd be
>reluctant to accept SPARC as the basis for an Application Binary 
>Interface standard until I saw some evidence that high performance
>implementations of SPARC are possible.

I also agree that a standardization of this kind is not the right idea.
But I believe it is possible to have a high-performance implementation
of the SPARC.  By high performance, I mean close enough to others in its
class so as to make the difference not worth too much worry.  Without large,
on-chip caches, processors in the class about which we are speaking need
chip-boundary bandwidth commensurate with on-chip data and instruction
consumption rates.  The lack of such bandwidth is the main, in my opinion,
failing of the SPARC implementation.  Notice that the Cypress version will
be no better, if not worse (the floating-point bus is gone!).

With regards to:

>The R2000
>also has a single address bus and a single data bus but it can use them
>twice per cycle.  This means you can then split your cache into an
>instruction cache and a data cache and make use of the extra bandwidth
>by fetching an instruction every cycle in spite of loads and stores.
...
>The separate instruction and data cache only run
>at single cycle rates but they run a half cycle out of phase with each
>other so it all works out.  (Pretty slick, don't you think?)

Yes, I do think it is pretty slick, but I also think this is a liability
at clock speeds higher than 16 Mhz (and maybe even at 16MHz).  I am sure,
though, that MIPS has a plan to fix this problem.  It sure seems like the
way to go at 8 Mhz.  Preventing bus crashes (i.e. meeting real-world
timing constraints) can be problem.

And:

>Since MIPS and Sun seem to be producing these systems with similar
>technologies at similar clock rates at similar times in history, these
>differences in the cycle counts for our most favorite and popular
>instructions seem to go a long way toward explaining why SPARC is so
>slow.
>Forest Baskett

Thanks again for the analysis.  However, I have one last point of contention.
SUN is not MIPS in many respects, not the least of which is dedication to
working with fabs and process technologies.  SUN's business seems to be
standards.  In light of their constraints, I applaud their success in
squeezing so much on a lowly gate array.  I am sure one of their chief
concerns was future ECL implementation.  Sure the SPARC processor core
(the stuff that actually does the work, minus register file) is virtually
the same as anyone else's in function and in size (at least I think this
is true), and with that in mind, the MIPS R2000, the Am2900, or whatever
are all equally scalable (the other components on chips are largely
implmenetations of integrated system, not architectural, functions; the
Branch Target Cache of the 29000 is NOT an architectural feature.).  But
by choosing register windows (which lets them vary the number of registers,
in window increments, for a given implementation) and a very simple
definition otherwise, SUN simply did the best they could to make future
implementation easy.  However, I am a little dismayed (but happy for SUN)
at the incredible backing SPARC is getting in the world of huge, influential
conglomerates.  I think the standardization of UNIX is good, but the
standardization of processors is BAD.  We should have a way to achieve
processor independence without necessarily transporting source code (and
in fact, I have an idea for this, but can't share it).  We must not bet our
future on a given processor!  Comments?

pf@diab.UUCP (Per Fogelstrom) (12/11/87)

Well, the history repeats once again. A new RISC chip is launched and peopels
expectations reaches new "high scores". A few years ago there was another risc
chip set brougth to the market, called the Clipper. This processors performence
was climed to sweep all competitors off the sceene. Often compared to the
DEC 8x00 computers. For this chip set the picture has cleared now. The perfor-
mence range is not much more than can be achived with a 16-20 Mhz 68020. The
most i have seen of the 33Mhz versions is one running at room temprature.
Intergraph is one of the companys who is still using the Clipper (They recently
bought the rights for the chip set from NS/Fairchild) . From what i recall they
throw out the NS32032 for the Clipper. Well they could have had 2-3 times the
clipper performance with the NS32532 today. And they called the buy a bargin !
It's not suprising that the MIPS 2000 gives most power/Mhz, The architecture has
evolved during many years, without a hard pressure from the marketing such as
'We must have it NOW!!!'. (John Mashey mayby has another opinion, only my guess)

SO: Why is everybody so suprised ????!

davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (12/11/87)

In article <6964@apple.UUCP> bcase@apple.UUCP (Brian Case) writes:

| conglomerates.  I think the standardization of UNIX is good, but the
| standardization of processors is BAD.  We should have a way to achieve
| processor independence without necessarily transporting source code (and
| in fact, I have an idea for this, but can't share it).  We must not bet our
| future on a given processor!  Comments?

The concept of portable object code is not new... the "UCSD Pascal
P-System" allowed compilers to generate pseudo code from a number of
languages, and port the Pcode. Later some Pcode compilers were developed
to give the speed of compiled code without passing source code around.
Then there was a peekhole optimizer for Pcode, and, as I recall, there
was a compatible ADA compiler. I used it, but the name of the vendor has
escapes me, hopefully forever.

Hope your idea for portable code can do better.

-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

baskett@baskett (12/12/87)

In article <6964@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes:
> ...
> >The separate instruction and data cache only run
> >at single cycle rates but they run a half cycle out of phase with each
> >other so it all works out.  (Pretty slick, don't you think?)
> 
> Yes, I do think it is pretty slick, but I also think this is a liability
> at clock speeds higher than 16 Mhz (and maybe even at 16MHz).  I am sure,
> though, that MIPS has a plan to fix this problem.  It sure seems like the
> way to go at 8 Mhz.  Preventing bus crashes (i.e. meeting real-world
> timing constraints) can be problem.

The 16 MHz MIPS parts we have work fine.  If it becomes a problem, the fix
is pretty obvious, too.

> I am sure one of their chief concerns was future ECL implementation.

I have an ECL implementation of an experimental Risc processor (board)
in my office.  My experience with the team that designed and built it
(a great group of people at DEC's Western Research Lab, by the way)
tells me that the MIPS architecture is more suitable for ECL implementation
than the SPARC architecture.  (see next comment)

> by choosing register windows (which lets them vary the number of registers,
> in window increments, for a given implementation) and a very simple
> definition otherwise, SUN simply did the best they could to make future
> implementation easy.

It may have been the best they could do but it looks like a mistake to me.
In higher performance technologies the speed of register access becomes
more and more critical so about the only thing you can do with register
windows is to scale them down.  And as the number of windows goes down,
the small gain that you might have had goes away and procedure call
overhead goes up.  Attacking the procedure call overhead problem at
compile time rather than at run time is a more scalable approach.

Forest Baskett
Silicon Graphics Computer Systems

dennisr@ncr-sd.SanDiego.NCR.COM (Dennis Russell) (12/12/87)

In article <8809@sgi.SGI.COM> baskett@baskett writes:
>
>
>I have been asking myself the question, why is SPARC so slow?
>.......
>Loads and stores are slow.  Loads on both implementations take two
>cycles and stores take 3 cycles for 32-bit words compared to one cycle
>for each on a MIPS R2000.  There are several interrelated reasons for
>this situation.  Briefly, they are lack of a separate address adder,
>lack of split instruction and data caches, and inability to cycle the
>address and data bus twice per main clock cycle.  Details follow.
>
>Lack of a separate address adder for loads and stores.  The R2000 can
>start the address generation for a load or a store in the second stage
>of the pipeline because the register access is fast and an address adder
>is present.  Thus the load or store can "execute" in stage 3 of the
>pipeline, just like the rest of the instructions.  On SPARCs (so far)
>address generation appears to use the regular ALU in the third stage of
>the pipeline and then begin the actual cache access in the fourth stage.
>For a load, you then need an extra stage to get the data back.
>
The block diagram in the data sheet of the Fujitsu SPARC shows an Address
Generation Unit that is separate from the Arithmetic and Logic Unit.  Both
branch target addresses and load/store addresses are calculated in the AGU.

Further on in the data sheet the four stage pipeline is described:  Fetch,
Decode, Execute, and Write.  It is stated explicitly that "Memory addresses
are evaluated for loads, stores, and control transfers" in the Decode
stage.

It can be concluded that the Fujitsu SPARC does indeed have a separate
address adder and that load/store addresses are generated in the second
stage (Decode) of the pipeline.

The R2000 has a five stage pipeline:  Fetch, Decode, Execute, Memory
Access, Write Back.  Memory address generation occurs in the third stage
(Execute) and the load/store "executes" in the fourth stage (Memory
Access).

The reason for the 2 cycle load in the Fujitsu SPARC is the multiplexing of
the external address and data busses between instructions and memory data.
A SPARC load requires 1 cycle of the external busses so that instruction
fetching stalls for this 1 cycle.

>Lack of split instruction and data caches.  Because both SPARCs have a
>single cache rather than the separate instruction and data caches of
>the R2000, the extra pipeline stage needed to get the data back for a
>load can't be used to fetch an instruction anyway.  For a store the
>relevant cache line is read on the fourth cycle and updated and written
>back on the fifth cycle.  So there are two cycles that can't be used
>to fetch instructions, bringing the total cost of a store to three cycles.
>
SPARC supports base register plus index register memory addressing.  During
the first half of the Decode stage the base and index registers are
accessed.  During the second half they are added together to form the
virtual memory address.  Since the register file in the Fujitsu SPARC has
only 2 ports, store data cannot be accessed from the register file until
the third stage (Execute).  Thus, on a store the address goes out during
the third stage (Execute) and the data during the fourth stage (Write).
Since stores use the external busses for two consecutive cycles during
which time fetching of instructions is suspended, the execution time
for stores is 3 cycles.

>Inability to cycle the address and data bus twice per main clock cycle.
>The SPARC chips aren't double cycling the address and data bus so that
>both loads and stores mean that you can't fetch instructions.  The R2000
>also has a single address bus and a single data bus but it can use them
>twice per cycle.  This means you can then split your cache into an
>instruction cache and a data cache and make use of the extra bandwidth
>by fetching an instruction every cycle in spite of loads and stores.
>
This is indeed true.  The price the R2000 pays for this is a complex
clocking scheme whereby a 4 phase input clock at double frequency is
required in order to control the double cycle external busses.

Since at 16.7 MHz the R2000's I/O interface runs at 33.3MHz it remains to
be seen whether the H/W architecture of the R2000 is scaleable - can it be
carried to 25-30MHz where the bus must run at 50-60MHz ?

>Branches are slow.  Since taken branches need only one delay slot
>there must be an address adder for the program counter.  But with a
>single cache you have to decide early what the next instruction address
>is.  Both SPARC chips always decide that a branch will be taken so there
>is an additional cycle penalty when the condition isn't satisfied and you
>have to junk the instruction you fetched and fetch the right one.  On
>
I think there might be some confusion here on the operation of the Annul
Bit during conditional branches.  It is my understanding that when this bit
is 0 then the delay instruction (the instruction following the branch) is
executed whether the branch is taken or not.  When this bit is 1 then the
delay instruction is executed only if the branch is taken - if the branch
is not taken then the delay instruction which is already in the pipeline is
aborted.

Therefore, with the Annul Bit equal to 0 branches execute in 1 cycle
whether the branch is taken or not.  With the Annul Bit at 1 a taken branch
executes in 1 cycle while an untaken branch takes 2 cycles - 1 cycle for the
branch and 1 cycle for the aborted delay instruction.

The advantage of the Annul Bit is in conditional branches that terminate
loops. With the Annul Bit at 1 a loop instruction can be placed in the
delay slot.  This instruction is executed when the loop is executed and is
not executed when you fall thru the loop.

-- 
Dennis Russell                               |      NCR Corp., M/S 4720
phone:    619-485-3214                       |      16550 W. Bernardo Dr.     
UUCP:  ...{ihnp4|pyramid}!ncr-sd!dennisr     |      San Diego, CA 92128       

elh@mips.UUCP (Ed Hudson) (12/13/87)

In article <1941@ncr-sd.SanDiego.NCR.COM>
dennisr@ncr-sd.SanDiego.NCR.COM writes:

>In article <8809@sgi.SGI.COM> baskett@baskett writes:
>>both loads and stores mean that you can't fetch instructions.  The R2000
>>also has a single address bus and a single data bus but it can use them
>>twice per cycle.  This means you can then split your cache into an
>>instruction cache and a data cache and make use of the extra bandwidth
>>by fetching an instruction every cycle in spite of loads and stores.
>>

<dennisr@ncr-sd.SanDiego.NCR.COM writes:>
>This is indeed true.  The price the R2000 pays for this is a complex
>clocking scheme whereby a 4 phase input clock at double frequency is
>required in order to control the double cycle external busses.

	the cost of the 'complex' interface is a few dollars for
	a tapped delay line.  pretty cheap for a scheme that allows
	sufficient control of the processors' io timings well enough
	to double the available pin bandwidth with moderately cheap,
	fast srams.  further, in chip and package design, it's the io
	transistions themselves that are expensive, the rate of the
	transitions is secondary.

>
>Since at 16.7 MHz the R2000's I/O interface runs at 33.3MHz it remains to
>be seen whether the H/W architecture of the R2000 is scaleable - can it be
>carried to 25-30MHz where the bus must run at 50-60MHz ?
>
	20ns ttl-io srams today support transaction rates of 50mhz, and 15ns
	rams hit 67mhz.  although a full cpu subsytem is a little more
	demanding, this is indicative of what is technologically possible.
	the current r2000 interface was the result of carefull optimization
	between the cpu subsystem (ie, 16-64k cache rams, AS ttl), 1985 cpu
	process and packaging technology.  i expect that future mips implement-
	ations will also be similar optimizitions of the then available
	technologies.

-Ed Hudson		DISCLAIMER: I speak only for myself.
elh@mips.com  or  {ames,decwrl,prls}!mips!elh
MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
(408) 720-1700
-- 
-Ed Hudson		DISCLAIMER: I speak only for myself.
elh@mips.com  or  {ames,decwrl,prls}!mips!elh
MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
(408) 720-1700

jesup@pawl22.pawl.rpi.edu (Randell E. Jesup) (12/13/87)

In article <1111@mips.UUCP> elh@mips.UUCP (Ed Hudson) writes:
><dennisr@ncr-sd.SanDiego.NCR.COM writes:>
>>This is indeed true.  The price the R2000 pays for this is a complex
>>clocking scheme whereby a 4 phase input clock at double frequency is
>>required in order to control the double cycle external busses.

>	the cost of the 'complex' interface is a few dollars for
>	a tapped delay line.  pretty cheap for a scheme that allows
>	sufficient control of the processors' io timings well enough
>	to double the available pin bandwidth with moderately cheap,
>	fast srams.  further, in chip and package design, it's the io
>	transistions themselves that are expensive, the rate of the
>	transitions is secondary.

>>Since at 16.7 MHz the R2000's I/O interface runs at 33.3MHz it remains to
>>be seen whether the H/W architecture of the R2000 is scaleable - can it be
>>carried to 25-30MHz where the bus must run at 50-60MHz ?

>	20ns ttl-io srams today support transaction rates of 50mhz, and 15ns
>	rams hit 67mhz.  although a full cpu subsytem is a little more
>	demanding, this is indicative of what is technologically possible.
>	the current r2000 interface was the result of carefull optimization
>	between the cpu subsystem (ie, 16-64k cache rams, AS ttl), 1985 cpu
>	process and packaging technology.  i expect that future mips implement-
>	ations will also be similar optimizitions of the then available
>	technologies.

The real problem is the fact that your chip edge is clocked at twice your
instruction freqency.  Running a higher-speed clock than the instruction rate
is fine, and makes internal design much easier.  However, packaging
technology will be your limiting factor for some time to come, not really
ram speed per se.  For the large number of pins required, it is hard to find
packages certified at that speed.

Given current technology, r2000 could probably be scaled to about 20
MHz.  However, custom RISC designs in CMOS are now reaching 40 MHz, which
would be impossible with the double-clocked interface currently on the
r2000.  Perhaps the interface could be removed, given enough pins, but
that gets you back into the packaging limits.

One of the prime considerations in state-of-the-art RISC design today
HAS to be chip-edge bandwidth, how to improve it and how to minimize
it's usage (conserve it).  Even going to off-chip cache is expensive
at these speeds.

I suspect you'll be seeing interesting attempts in this area soon.

     //	Randell Jesup			Lunge Software Development
    //	Dedicated Amiga Programmer	13 Frear Ave, Troy, NY 12180
 \\//	lunge!jesup@beowulf.UUCP	(518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup)

mash@mips.UUCP (John Mashey) (12/14/87)

In article <1941@ncr-sd.SanDiego.NCR.COM> dennisr@ncr-sd.SanDiego.NCR.COM (0000-Dennis Russell) writes:
>In article <8809@sgi.SGI.COM> baskett@baskett writes:
......
>>Branches are slow.  Since taken branches need only one delay slot
>>there must be an address adder for the program counter.  But with a
>>single cache you have to decide early what the next instruction address
>>is.  Both SPARC chips always decide that a branch will be taken so there
>>is an additional cycle penalty when the condition isn't satisfied and you
>>have to junk the instruction you fetched and fetch the right one.  On
>>
>I think there might be some confusion here on the operation of the Annul
>Bit during conditional branches.  It is my understanding that when this bit
>is 0 then the delay instruction (the instruction following the branch) is
>executed whether the branch is taken or not.  When this bit is 1 then the
>delay instruction is executed only if the branch is taken - if the branch
>is not taken then the delay instruction which is already in the pipeline is
>aborted.
>
>Therefore, with the Annul Bit equal to 0 branches execute in 1 cycle
>whether the branch is taken or not.  With the Annul Bit at 1 a taken branch
>executes in 1 cycle while an untaken branch takes 2 cycles - 1 cycle for the
>branch and 1 cycle for the aborted delay instruction.

Forrest and Dennis are talking about different things.
See Fujitsu SPARC datasheet,and Namjoo&Agrawal, "Preserve high speed in
CPU-to-cache transfers", Electronic Design, August 20, 1987, 91-96.
These are consistent in saying:
Fujitsu: "In performing delayed control transfer, the MB86900 processor always
fetches the next instruction following a control transfer.  Then the processor
either executes this instruction or annuls it....This enables the pipeline to
advance while the control target instruction is being fetched...By assuming
a conditional branch to be taken, the processor minimizes pipeline interlock
by providing one cycle execution for taken branches, or two cycle execution
for untaken branches."

Namjoo,Agrawal: "In this pipeline, the fetch address for instruction n is
generated during the decoding stage of instruction n-2.  Since all
branch instructions are delayed by one cycle, all relative branch instructions
take one cycle if the branch condition is true because the target instruction
is fetched before the condition codes are ready.  If, after condition codes
are evaluated, it was determined that the branch was not taken, the processor
ignores the target instruction and continues to fetch the next instruction
in the sequence."

Thus, given instructions:
1: conditional branch
2: branch delay slot
3: after branch delay slot
N: target of branch

Taken branch:
1, 2*, N   (*= might be annulled)
Untaken branch:
1, 2*, N**, 3  (** = ignored)

The implication is that the CPU doesn't quite know the condition codes
result in time, and thus has to guess. I can't tell from the Cypress
datasheet whether or not they do the same thing.[Does anybody know who can say?]

Given that one has decided to take some hit, this is probably the right way,
in that taken conditional branches are on the order of 15% of instructions
and untaken ones are on the order of 5% (on our machines), although this
does vary: 1/3 of the programs we looked at had more untaken than taken
branches.  [I think earl killian posted this data a while back].
Thus, the SPARC branch design has (in terms of +=good, -=bad):
	+ annul bit
	+ ability to set condition codes on ALU ops
	- extra cycle for untaken conditional branch
	- condition-code based branch, i.e., often requires compare for
	 eq, neq, etc that could actually be done as 1-cycle cmp-branches

ALso, in looking at SPARC assembly code, one notes that cmp's are usually
moved away from the conditional branches, so that perhaps these CPUs,
or later ones, will take advantage of cases where the condition code setting
is early enough to avoid the extra I-fetch.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

bcase@apple.UUCP (Brian Case) (12/15/87)

In article <8885@sgi.SGI.COM> baskett@baskett writes:
>In article <6964@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes:
>> ...
>> >The separate instruction and data cache only run
>> >at single cycle rates but they run a half cycle out of phase with each
>> >other so it all works out.  (Pretty slick, don't you think?)
>> 
>> Yes, I do think it is pretty slick, but I also think this is a liability
>> at clock speeds higher than 16 Mhz (and maybe even at 16MHz).  I am sure,
>> though, that MIPS has a plan to fix this problem.  It sure seems like the
>> way to go at 8 Mhz.  Preventing bus crashes (i.e. meeting real-world
>> timing constraints) can be problem.
>
>The 16 MHz MIPS parts we have work fine.  If it becomes a problem, the fix
>is pretty obvious, too.

Oh, I am sure they work great.  I didn't mean that they would be flaky or
intermittent or something, just that the system design is trickier.

>> I am sure one of their chief concerns was future ECL implementation.
>I have an ECL implementation of an experimental Risc processor (board)

[Yes, that's a good machine!  I hear it is the "DEC Dorado."]

>in my office.  My experience with the team that designed and built it
>(a great group of people at DEC's Western Research Lab, by the way)
>tells me that the MIPS architecture is more suitable for ECL implementation
>than the SPARC architecture.  (see next comment)
>
>> by choosing register windows (which lets them vary the number of registers,
>> in window increments, for a given implementation) and a very simple
>> definition otherwise, SUN simply did the best they could to make future
>> implementation easy.
>
>It may have been the best they could do but it looks like a mistake to me.

Well, notice that it was *I* who said that they were doing "the best they
could."  Please don't take my word as the official SUN position!  Seldom
does anyone really do "the best they could."  One man's mistake is another
man's stroke of genius.

>In higher performance technologies the speed of register access becomes
>more and more critical so about the only thing you can do with register
>windows is to scale them down.

Yes, in the first ECL single-chip implementation.  Then, as the technology
gets denser, you can scale them back up to the desired level.  I was not
talking about discrete ECL implementation; I should have made that clear.
You may think that even single-chip ECL implementations suffer with large
register files, but I don't believe so (but I'm still youngish and naive).

>And as the number of windows goes down,
>the small gain that you might have had goes away and procedure call
>overhead goes up.  Attacking the procedure call overhead problem at
>compile time rather than at run time is a more scalable approach.

Well, I understand what you are saying: "the available density of the
technology is irrelevant, to a degree, with a smallish [my opinion],
fixed-size register file."  On the other hand, *by definition,* the SUN
approach is more scalable since there is at least some opportunity for
scaling; a fixed-size register file cannot, by definition, be scaled.
(Or, have I missed something?  Sorry if so.)

1) Notice that if SUN decides to dump the overlapping register window
approach, they can!  They can treat one procedure context as the only
context available and use a procedure calling mechanism like MIPS. 
Compatibility can be maintained by having the old instructions trap and
do the right thing.  This will allow them to implement a register file
the same size of the MIPS register file.  Presumably, we'll be at such
processing speeds then that old binaries, which use the old procedure
calling mechanism, will run fast enough, even with the trap overhead.
(The idea here makes sense, but I'm not sure I'm communicating it well.)

2) Didn't David Wall do research on register allocation at link time
that showed that lots of registers are better?  Admittedly, his approach
needed a large pool of registers, like in the Am29000, not the overlapping
register windows of the SPARC (couldn't resist!  :-).  Do you now think
that the MIPS 32-entry file is as good as the 64-entry file on the
experiemental machine to which you refer?  I'm genuinely curious here, 
not asking a rhetorical question.  I was under the impression that
register allocation at link time was sorta "the wave of the future"
(I hate that expression); if so, wouldn't 32 be too small?

3) You have to remember that it will be necessary to have at least some
TLB-type or other cache-type function finish in one machine cycle.  True,
the array technology used for TLBs can be denser, and therefore a little
faster, than multi-ported register file array technology.  However, if you
can get your TLB array access and compare in one cycle, why do you think
that you can't get your register-file-array access and address compute
(be it add, or whatever) in one cycle?  What was the cycle-limiting
factor in the experimental machine that you have in your office?

Thanks in advance.

mash@mips.UUCP (John Mashey) (12/16/87)

In article <344@ma.diab.UUCP> pf@ma.UUCP (Per Fogelstrom) writes:
....
>It's not suprising that the MIPS 2000 gives most power/Mhz,The architecture has
>evolved during many years, without a hard pressure from the marketing such as
>'We must have it NOW!!!'.(John Mashey mayby has another opinion, only my guess)

Be serious! In a startup, the marketing pressure is for "we must have it
yesterday!"  As is well-documented in papers and presentations, the
architecture owes a lot to earlier work, like the IBM 801 and Stanford MIPS,
but almost all of the basic architecture work for the existing R2000/R2010
was done in about 6-8 months, starting in Nov-Dec 1984.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

hansen@mips.UUCP (Craig Hansen) (12/16/87)

In article <140@imagine.PAWL.RPI.EDU>, jesup@pawl22.pawl.rpi.edu (Randell E. Jesup) writes:
> The real problem is the fact that your chip edge is clocked at twice your
> instruction freqency.  Running a higher-speed clock than the instruction rate
> is fine, and makes internal design much easier.  However, packaging
> technology will be your limiting factor for some time to come, not really
> ram speed per se.  For the large number of pins required, it is hard to find
> packages certified at that speed.

For speeds well above 40 MHz in CMOS technology, our studies suggest that this
will not be a limiting factor at all, and that multiplexed busses can work as
fast as non-multiplexed buses at least up to that clock rate.  The real enemy
to high clock rates is clock skew, and controlling that skew is the reason why
we use a tapped-delay-line clocking system, and phase-locked-loop technology.
The tapped-delay lines also allow the timing of the chip to be adjusted to
accomodate variations in the timing specifications of SRAM chips; the
phase-locked-loop technology allows the CPU and FPU to have matched timings, no
matter how the CMOS processing causes circuit speed variations.

In our earlier days, some other company that shall remain nameless, (but
recently had their "ship" repossessed), made some wild claims (that we heard
repeatedly but always "second"-hand) that the MIPS chip was no faster (when
used at half speed) than theirs (when used at full speed), but was actually
clocked twice as fast as we said it was. After all, we have double-frequency
clock inputs.  Well, OK. But the other companies chip had double-frequency
clock inputs, too, and when you compared the two chips at their specified clock
rate, ours runs more than twice as fast (benchmark-wise), and theirs had double
the input clock rate. Talk about double-speak!

The multiplexed buses, far from being a limiting factor, are an important
reason why the MIPS chip is "fast" and the SPARC chip is "slow." By putting all
the important cache interface logic on the CPU chip, the critical paths in the
cache are entirely set by the speed of static RAM chips, without interference
by tag comparison logic, and parity generation and checking logic. Because
SRAMs are used as technology drivers for new CMOS and BiCMOS technologies, MIPS
can be assured of a good supply of highly agressive SRAMs that will work with
the MIPS part.

The real problem with the other RISC designs is that they require specialized
cache RAMs or external tag comparitors. SRAM vendors can do good business
selling RAMs with internal tag comparison logic that try to cover up the faults
of the RISC processor designers, yet are based on the previous generation
technology - to be blunt, specialized cache RAMs are a winner for the RAM
vendor (who gets to sell a proprietary part on old technology for a high price)
but a loser for the RAM vendee (who pays too much per bit for slower RAM).
...and the specialized cache chips from the RISC vendors have been even slower,
smaller and more expensive per bit than standard SRAMs.

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...{ames,decwrl,prls}!mips!hansen or hansen@mips.com

lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (12/16/87)

In article <6993@apple.UUCP> bcase@apple.UUCP (Brian Case) writes:
>On the other hand, *by definition,* the SUN
>approach is more scalable since there is at least some opportunity for
>scaling; a fixed-size register file cannot, by definition, be scaled.

There is an optimum size for a windowed register-set. The optimum may be
hard to locate precisely, but it has to exist. For example, there are
programs which don't overflow the window set of current chips. Building
chips with more registers, cannot speed those programs up. 

A more interesting question is, just what makes something scalable ?  Sun's
answer seems mostly to be "complexity" - they tried to minimize the chip
design time so that implementation can track the implementation technology.
But, of course, eventually, someone will build a complex implementation,
because they had all those gates left over ... and so it goes.

Hewlett-Packard just cancelled the project that was building a Spectrum out
of ECL. Reportedly they were $20M in. Does anyone know why ? Does this mean
anything for the ECL SPARC, or for scalability ?
-- 
	Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

ian@esl.UUCP (Ian Kaplan) (12/16/87)

In article <344@ma.diab.UUCP> pf@ma.UUCP (Per Fogelstrom) writes:
>Well, the history repeats once again. A new RISC chip is launched and peopels
>expectations reaches new "high scores". A few years ago there was another risc
>chip set brougth to the market, called the Clipper.  [ deleted text ]
>For this chip set the picture has cleared now. The perfor-
>mence range is not much more than can be achived with a 16-20 Mhz 68020. 
[ deleted text ]
>Well they could have had 2-3 times the
>clipper performance with the NS32532 today. And they called the buy a bargin !


   I think that the discussion on the SPARC vs. the MIPS R2000 centers
   around why the SPARC is not faster than it is - specifically, why it is
   not as fast as the MIPS processor.  What seems to have been missed here,
   if I properly understand Mr. Fogelstrom's article, is that the SPARC is
   quite fast.  The lab I work in has a Sun 4/280 and I can tell you that
   it smokes.  It may be 20% slower than the MIPS processor, but it is by
   no means a failure.  The SPARC is much faster than the Motorola 68020 
   and, I would bet, the National processors.  How the MIPS and SPARC scale 
   remain to be seen.  You should remember that neither Sun nor MIPS will
   keep their hardware architecture static.

   I have greatly enjoyed the discussion of SPARC vs MIPS architecture.
   This sort of interchange makes comp.arch worth reading.  A happy
   holiday season to you all,

           Ian L. Kaplan
           ESL, Advanced Technology Systems
           M/S 302
           495 Java Dr.
           P.O. Box 3510
           Sunnyvale, CA 94088-3510

           decvax!decwrl!borealis!\
                   sdcsvax!seismo!- ames!esl!ian
                    ucbcad!ucbvax!/     /
                          ihnp4!lll-lcc!

garner@gaas.Sun.COM (Robert Garner) (12/16/87)

The expositions on comp.arch about SPARC and the gate array implementation 
are interesting.  Some of the inaccuracies have been addressed 
but others remain unanswered.   Mashey's recent article <1115@winchester.UUCP>
did clear up the confusion surrounding the implementation of conditional
branches that was incorrectly portrayed by Forest Baskett <8809@sgi.SGI.COM>
and Dennis Russell <1941@ncr-sd.SanDiego.NCR.COM>.  Brian Case has taken
an fairly impartial look at the architecture in <6964@apple.UUCP>
and <6993@apple.UUCP>.

Baskett's message was refreshing in that he accurately differentiated
between implementation and architecture.  (Quite unlike previous
criticisms, such as from the so-called "MIPS Performance Brief.")
  
However, Baskett's article continues to incorrectly portray the integer
performance of Sun-4/200 workstations and SPARC in general.
Sun's data on MIPS performance implies that the Sun-4/200
has approximately the same INTEGER performance as the M/1000.
This fact is frequently ignored since the Sun-4/200 floating-point
performance is generally (but not always) less than the M/1000.
Baskett correctly deduces that this is due to the use of the Weitek
1164/54 floating-point chips, which are slow compared to MIPS' custom FPU.   

The Fujitsu gate arrays plus the Weitek chips were a reasonable vehicle 
for a SYSTEMS company like Sun to prove and quickly bring to market an OPEN,
RISC-based workstation/server plus a wide range of application SOFTWARE.
Sun, unlike MIPS, is not organized around the task of designing 
and fine tuning custom-designed ICs.  It has even taken MIPS, 
whose lifeblood depends on a fast processor, more time than expected
to deliver parts at speed (15-16 MHz).  Now that SPARC is
established, Sun is working closely with semiconductor companies
themselves.  This work includes improved floating-point implementations.

Forest concluded his article by saying:

> Since MIPS and Sun seem to be producing these systems with similar
> technologies at similar clock rates at similar times in history, these
> differences in the cycle counts for our most favorite and popular
> instructions seem to go a long way toward explaining why SPARC is so slow.

This hand waving is too fast!  A standard, off-the-shelf gate array is 
NOT in the same league as a custom CMOS design.  Indeed, that a gate 
array has the same integer performance as a tuned, full-custom, 
"similar technology" implementation is an indication of the strength
of the architecture!
 

Forest attempted to deduce the gate-array CPI value for integer 
and floating-point programs.  From this analysis, he concluded:

> These ratios [based on CPIs] are also consistent with the benchmark 
> results in the Performance Brief. 

Yes, floating-point suffers because of the Weitek chips.
And yes, MIPS' "Performance Brief" attempts to stigmatize SPARC 
by dwelling on this:  its benchmark suite and MIPS-rate calculations
are conveniently based almost entirely on floating-point programs!

But no, one can not accurately judge different processors
by comparing their implementation-dependent "cycles per instruction" 
(CPI) values.  Performance also depends on the number of instructions (N) 
issued by a compiler.  For example, MIPS's delayed load does not affect
their CPI but increases their N when NOPs are required, whereas 
SPARC's interlocked load decreases N but counts against its CPI.  
SPARC's register windows and corresponding fewer loads and stores
also decrease its N relative to MIPS.  By avoiding a more detailed
analysis that includes N (via simulations), one ignores the state
of the compilers and associated optimizations (via SPARC's annul bit, 
for instance.)  In general, there is always room for improvement in
compiler generated code.

The Sun-4/200, for LARGE C, integer programs runs at about 1.65 CPI.  
This includes 15% loads and 5% stores AND the miss cost associated
with the 128K-byte cache and the large, asynchronous main memory.
(Baskett's calculation assumed MIPS' distribution, 20% loads and 10% stores, 
which is not applicable to SPARC.  Since cache effects can dominate 
performance, I suspect that the M/1000, large-C-program CPI
could be near 1.6 if its cache/memory is taken into account.)

As processor cycle time shrinks, the CPI for CPUs of all types increases 
because the miss cost rises.  This is because main memory access
times are not scaling as rapidly as processor cycle times.  
This negative effect on CPIs must be offset by improvements 
in CPU pipelines and is even more pronounced in low-CPI
machines.  SPARC implementations are balanced in a way that achieve
shorter cycle times, do not cause an increase in CPI, and carefully
consider chip-edge bandwidth issues.  SPARC implementations include
single-cycle loads and single-cycle untaken branches.

Of course, the most error-free measure of performance is wall clock time.
Until there are more results of some large integer programs running
both on the Sun-4 and the M/1000, speculation can be unproductive.


Now, what about register windows?  In Baskett's second article
<8885@sgi.SGI.COM>, he writes:

> It may have been the best they could do but it looks like a mistake to me.
> In higher performance technologies the speed of register access becomes
> more and more critical so about the only thing you can do with register
> windows is to scale them down.  And as the number of windows goes down,
> the small gain that you might have had goes away and procedure call
> overhead goes up.  Attacking the procedure call overhead problem at
> compile time rather than at run time is a more scalable approach.

Two points:

(1)  It is hard to visualize the future difference between implementing 
1K-bit vs. 4K-bit register files (i.e., 32 registers versus 128 registers).  
Memories can turn out larger and faster than intuition indicates.

(2)  SPARC does NOT PRECLUDE interprocedural register allocation (IRA)
optimizations and thus ALLOWS for "attacking the procedure call 
overhead problem at compile time rather than at run time."
SPARC has two mechanisms to reduce load/store traffic:  
register windows and IRA!   

In SPARC, the procedure call and return instructions are different 
from the ones that increment and decrement the window pointer.  
(SPARC's "save" and "restore" instructions decrement and increment 
the window pointer.  They also perform an "add", which usually adjusts 
the stack pointer.  The pc-relative "call" and register-indirect
"jump-and-link" do NOT effect the window pointer.)

A minimum SPARC implementation could have 40 registers:  8 ins, 
8 locals, 8 outs, 8 globals, and 8 local registers for the trap handler.
Such as implementation is not precluded by the architecture, but
would probably imply IRA-type optimizations.  It would function
as if there were no windows, although window-based code would
properly execute, albeit inefficiently. 

Register windows have several advantages over a fixed set of registers,
besides reducing the number of loads and stores by about 30%:
They work well in LISP (incremental compilation) and object-oriented
environments (type-specific procedure linking) where IRA is impractical.
They can also be used in specialized controller applications that
require extremely fast context switching:  a pair of windows (32 registers)
can be allowed per context.
--------------------------------
Robert Garner
Sun Microsystems             

P.S.  There will be two sessions devoted to SPARC at the IEEE Spring Compcon:
One session will cover the architecture, compilers, and the SunOS port
and the other will cover the Fujitsu, Cypress, and BIT implementations.

DISCLAIMER:  I speak for myself only and do not represent the views 
of Sun Microsystems, or any other company.


			

mash@mips.UUCP (John Mashey) (12/17/87)

In article <538@esl.UUCP> ian@esl.UUCP (Ian Kaplan) writes:
...
>   I think that the discussion on the SPARC vs. the MIPS R2000 centers
>   around why the SPARC is not faster than it is - specifically, why it is
>   not as fast as the MIPS processor.  What seems to have been missed here,
>   if I properly understand Mr. Fogelstrom's article, is that the SPARC is
>   quite fast.  The lab I work in has a Sun 4/280 and I can tell you that
>   it smokes.  It may be 20% slower than the MIPS processor, but it is by
>   no means a failure.  The SPARC is much faster than the Motorola 68020 
>   and, I would bet, the National processors....

I don't think that anyone is arguing that SPARC is slower than a 68K,
or a failure. (It's not, and it isn't.)
What is going on is some serious architectural debate,
(which is what this newsgroup is for!).  Of course, some of this has
been stirred up people finally being able to get some actual data,
and then trying to understand what's going on.  Note that with one
exception [Dave Hough's careful posting a while back of various FP
benchmarks], the only performance appraisals that have been offered by Sun
to the general public [to my knowledge, I'll be glad to hear of others]:

1) The original set of benchmarks in the Sun-4 introduction publicity.
(Dhrystone, Stanford, Linpack SP, Linpack DP, and Spice).

2) Introduction materials, brochures, and advertisements:

The original announcement described the Sun-4 as ``10 mips'',
``ten times faster than a VAX 11/780'' [1], and
``in the same performance class as the VAX 8800''. [2]

"Relative to other manufacturer's high-end offerings,
the Sun-4/200 excels in floating-point performance.
In fact, the Sun-4/200 will execute floating-point-intensive applications
faster than the VAX 8800 superminicomputer." [3]

"Our new Sun-4/260, the first born of our brand new family of supercomputing
workstations and servers.  In computer-ese, it delivers the performance
of 10 MIPS.  For the sake of comparison, that's as much horsepower
as a minicomputer like the DEC VAX 8800." [4]

"SPARC is an open architecture, available today,
that Sun uses to implement the best price/performance system available, [5]
reach as low as $4000 per million instructions per second." [6]

"SPARC is the first [7] RISC architecture to incorporate the features found
in supercomputers such as Cray systems.
Single-cycle access to a large cache memory, [7a]
a large register file, [7b]
and pipelining, [7c]
features pioneered on supercomputers, are part of SPARC.
Register-to-register and load/store design, [7d] along with fixed-format [7e]
instructions and more concurrency [7f] in our architecture,
propel Sun's RISC machines to unmatched performance." [8]

"SPARC is an open, scalable architecture.
It is the most scalable [9] RISC architecture available today."

Now, of the numbered assertions, some are amenable to quantitative analysis,
by gathering data carefully and publishing it.  If properly done,
and if enough real benchmarks can be obtained, people can reach their own
conclusions about whether or not they believe these assertions.

[1], [2], [3], [4] (except horsepower is a little fuzzy: does it include
floating point? does it include multi-user performance?), [8] are
statements about performance, and should be testable.
[5] and [6] are a little more slippery, as one needs to compute performance
first, and then get costs for comparable configurations.
[7] can be analyzed by comparing with all other RISC architectures that
shipped before SPARC.
[9] is hard to evaluate, since scalability is not easily measured.

Anyway, when Forrest asks "why is SPARC slow", I think what he means
is that neither the published data nor customer benchmarking seems to
justify the conclusions reached above, and from the outside,
those are the hypotheses that the rest of us get to deal with.

Ian: since you have a 4/280, perhaps you might offer some benchmarks,
which would add to our knowledge of a (controversial) topic.
In particular, it would be wonderful if you've got any large,
actual integer applications, especially if they can be made public-domain
so that they can be run anywhere. [floating-point ones are fine, too,
but there already exist lots of those, whereas there's a sad lack of
integer ones.]
Unfortunately, saying that a machine "smokes" doesn't help as much!

Also, it would help to specify which Motorola implementation it was much
faster than: a recent Computerworld article fell into the trap of
saying the Sun-4 was exceeding expectations, because it looked 3-5X faster
(than the Sun-3), more even than the claimed 2.5X.  If you consider
2-mips 3/100s and 4-mips 3/200s, you can see what happened.

To summarize: statements about performance are either completely meaningless,
or they're actually supposed to tell something about how computers
behave.  If they're the latter, you should be able to test them.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mike@ivory.SanDiego.NCR.COM (Michael Lodman) (12/17/87)

In article <36626@sun.uucp> garner@sun.UUCP (Robert Garner) writes:
>Sun's data on MIPS performance implies that the Sun-4/200
>has approximately the same INTEGER performance as the M/1000.

The data I've seen in no way backs up this statement. From integer 
benchmarks I've run on the Sun-4 and a 12Mhz MIPS M/800, the MIPS 
is indeed about 20%-30% faster conservatively. I haven't yet run 
any floating-point benchmarks.

-- 
Michael Lodman  (619) 485-3335
Advanced Development NCR Corporation E&M San Diego
mike.lodman@ivory.SanDiego.NCR.COM 
{sdcsvax,cbatt,dcdwest,nosc.ARPA,ihnp4}!ncr-sd!ivory!mike

When you die, if you've been very, very good, you'll go to ... Montana.

henry@utzoo.uucp (Henry Spencer) (12/17/87)

> ALso, in looking at SPARC assembly code, one notes that cmp's are usually
> moved away from the conditional branches, so that perhaps these CPUs,
> or later ones, will take advantage of cases where the condition code setting
> is early enough to avoid the extra I-fetch.

AT&T's CRISP machine in fact takes this to its logical (?) extreme:  it
basically has one condition-code bit, and if you can manage to set that
slightly ahead of time, then the execution time for an in-cache branch
is *zero*.  (The actual story is a bit more complicated, but that's the
general idea, as I recall it from the paper in Sigarch 14.)
-- 
Those who do not understand Unix are |  Henry Spencer @ U of Toronto Zoology
condemned to reinvent it, poorly.    | {allegra,ihnp4,decvax,utai}!utzoo!henry

ian@esl.UUCP (Ian Kaplan) (12/18/87)

In article <1156@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>Ian: since you have a 4/280, perhaps you might offer some benchmarks,
>which would add to our knowledge of a (controversial) topic.
>In particular, it would be wonderful if you've got any large,
>actual integer applications, especially if they can be made public-domain
>so that they can be run anywhere. [floating-point ones are fine, too,
>but there already exist lots of those, whereas there's a sad lack of
>integer ones.]
>Unfortunately, saying that a machine "smokes" doesn't help as much!
>
>To summarize: statements about performance are either completely meaningless,
>or they're actually supposed to tell something about how computers
>behave.  If they're the latter, you should be able to test them.
>-- 

   We had a chance to use a Sun-4 for several months before we actually
   purchased the machine.  We ran the standard benchmarks (e.g., drystone
   and whetstone) and several of our in-house applications.  From this our
   results show us that the Sun-4 is about 8 VAX 11/780 MIPS on some
   benchmarks and as high as 10 VAX MIPS on others.  

   Our VLSI group has been running their design software and they have 
   found that the Sun 4 is almost three times faster than a Sun 3/260, 
   or around 10 MIPS, for one of their design packages, which uses 
   primarily integer arithmetic.  On one of my group's graphics 
   applications which does 3-D rotation (which is fairly floatin point 
   intensive) the Sun-4 is over 4 times the speed of a Sun 3/180 with 
   a floating point accelerator board.  The VLSI group has not run 
   HSPICE on the Sun 4 yet, but if and when they do, I will post the 
   results.  Unfortunately all the code for the applications I have
   mentioned is proprietary and cannot be distributed.

>
>Also, it would help to specify which Motorola implementation it was much
>faster than: a recent Computerworld article fell into the trap of
>saying the Sun-4 was exceeding expectations, because it looked 3-5X faster
>(than the Sun-3), more even than the claimed 2.5X.  If you consider
>2-mips 3/100s and 4-mips 3/200s, you can see what happened.
>

  Well, I guess that I fell into the same "trap" as Computerworld.  My
  remarks regarding the SPARC vs. the 680x0 are relative to the Sun 3
  computer systems.  My understanding is that the Sun 3 family does a
  fairly good job of utilizing the 68020.

  I will state once again, that I have really enjoyed the discussion of
  SPARC vs. MIPS.  This discussion has been an example of comp.arch at its
  best.  There is no question that solid benchmark data is needed to
  evaluate various architectural approaches.  However there are many
  factors that go into making a successful commercial computer system.  One
  of these is system price.  I do not think that Sun is trying to build 
  the fastest computer in its class, but I do think that they are trying 
  to build the computer with the best system price per MIP.

           Ian L. Kaplan
           ESL, Advanced Technology Systems
           M/S 302
           495 Java Dr.
           P.O. Box 3510
           Sunnyvale, CA 94088-3510

                    decvax!decwrl!\
                   sdcsvax!seismo!- ames!esl!ian
                    ucbcad!ucbvax!/     /
                          ihnp4!lll-lcc!
  

irf@kuling.UUCP (Stellan Bergman) (12/18/87)

How does the SPARC architecture differ from HPs HPA/RISC.  The latter has
been around for quite some while now.  I am curious to know since we plan
to move to something faster soon.

Bo Thide', Swedish Institute of Space Physics.  UUCP: ...enea!kuling!irfu!bt

jesup@pawl22.pawl.rpi.edu (Randell E. Jesup) (12/18/87)

<John Mashey posts a long article about Sun and Mips performance, at the
end include unix benchmarks for grep, diff, yacc, etc.>

John, one thing was left unmentioned: the unix benchmarks you give are
VERY dependant on filing-system implementation and, even more so, on
the disks used and how fragmented they are.  I'm sure I could cause a 
factor of two decrease in performance by fragmenting your disks or making you
use slower ones.  Any benchmarks that rely on file-system access should either
use the same disks, preferably the exact same ones so the fragmentation is
the same (I know, impractical), or at least try to quantify these differences
and state exactly what the hardware is, avg seek times, how fragmented, etc.
Better yet is not to rely on any OS and certainly not file-system dependant
benchmarks.
If you are testing the performance of a specific system configuration (CPU,
memory, disk, OS, etc, etc) fine, do so.  If you are addressing the 
performance of the CPU/FPU/cache (which seems to be what is discussed in this
group), don't use those benchmarks or at least try to factor out the 
peripheral/OS/whatever differences.
Just trying to adhere to your philosophy of real data wherever possible. :-)

     //	Randell Jesup			Lunge Software Development
    //	Dedicated Amiga Programmer	13 Frear Ave, Troy, NY 12180
 \\//	lunge!jesup@beowulf.UUCP	(518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup)

mash@mips.UUCP (John Mashey) (12/20/87)

In article <167@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
><John Mashey posts a long article about Sun and Mips performance, at the
>end include unix benchmarks for grep, diff, yacc, etc.>

>John, one thing was left unmentioned: the unix benchmarks you give are
>VERY dependant on filing-system implementation and, even more so, on
>the disks used and how fragmented they are.  I'm sure I could cause a 
>factor of two decrease in performance by fragmenting your disks or making you
>use slower ones.  Any benchmarks that rely on file-system access should either
>use the same disks, preferably the exact same ones so the fragmentation is
>the same (I know, impractical), or at least try to quantify these differences
>and state exactly what the hardware is, avg seek times, how fragmented, etc.
>Better yet is not to rely on any OS and certainly not file-system dependant
>benchmarks.
>If you are testing the performance of a specific system configuration (CPU,
>memory, disk, OS, etc, etc) fine, do so.  If you are addressing the 
>performance of the CPU/FPU/cache (which seems to be what is discussed in this
>group), don't use those benchmarks or at least try to factor out the 
>peripheral/OS/whatever differences.
>Just trying to adhere to your philosophy of real data wherever possible. :-)

Most are perfectly valid points.  Fortunately, I followed all of the
rules.  Unfortunately, I didn't replicate the context, so anybody who
hadn't seen the earlier posting, or hadn't gotten a copy of the Brief,
might be misled.

The benchmarks posted were an update to the last Performance Brief,
posted here, and something we distribute to anyone who asks.  It spells
out in EXCRUCIATING detail how things are measured, what the configurations
were, why we measured what we measured, what we think each benchmark
measures, etc, etc.

The UNIX benchmarks listed have almost nothing to do with the file system
and OS issues listed.  They give user cpu times.  Now, on a cached machine,
user-cpu times are not necesarily independent of kernel CPU times,
and kernel CPU times are certainly not independent of disks.
However, the system CPU times on these range from 1-20% of the user
CPU (specifically: yacc: 1%, diff: 5%, nroff:15%, grep:20%), or,
about a mean of 10%, hence, any system interference is fairly small.
As a guess, the influence of disk fragmentation upon the numbers
presented is at worst in the 1% range, which is probably below the noise level
of UNIX timing.

To predict UNIX command performance (which is highly relevant to some
people), it is difficult to find benchmarks that people understand,
might be able to duplicate, and might relate to real performance,
but which are not OS-dependent, and do no I/O.  We picked ones that
did a fair amount of user-level computing, compared to kernel computing,
to do our best to be clear about what was, and was not being measured.
(OS performance is a whole separate area, quite important also.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

peter@sugar.UUCP (Peter da Silva) (12/22/87)

Ah, benchmarks.

Here's something to consider when you start believing benchmarks.

The Commodore-Amiga can approach the equivalent of 10 MIPS for certain
logical operations on arrays, if one codes a tight loop using the Blitter
(a special purpose processor that does BitBlt operations).

The only non-graphics algorithm that does anywhere near this that I have
heard of is a 20-generation-per-second 320-by-200 LIFE program.

Of course the machine as a whole doesn't do anywhere near even a single
MIP for general purpose work. It's just a 68000, after all.

So think of that next time you believe benchmarks.
-- 
-- Peter da Silva  `-_-'  ...!hoptoad!academ!uhnix1!sugar!peter
-- Disclaimer: These U aren't mere opinions... these are *values*.