[comp.arch] 55 MIPS & 66 MIPS

schow@bcarh61.bnr.ca (Stanley T.H. Chow) (11/13/89)

It seems to me the MIPS 55 MIPS (@ 60 MHz?) ECL system (chip set?)
is the "classical" approach for RISC designs to get higher through-
put. They do it by upping the clock-rate.

Intel has gone the SuperScalar route. Their i960CA is said to be
66 MIPS @ 33 MHz. They have put the cleverness into multiple
execution units.

Here is the $64,000 question:

   Which part is easier to integrate into a real system? 

Please note that we have concrete real examples here. Theoratical
discussion is nice, but real data-points are more interesting.

Other interesting question:

   Which system has a larger "domain" over which it actually
   achives quoted figures?

   What other systems/chips/... are claiming over 50 MIPS?

   How do these systems compare in terms of cost (design and per unit)?



Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!psuvax1!BNR.CA.bitnet!schow
(613) 763-2831		     ..!utgpu!bnr-vpa!bnr-rsc!schow%bcarh61
Me? Represent other people? Don't make them laugh so hard.

hawkes@mips.COM (John Hawkes) (11/14/89)

In article <1358@bnr-rsc.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley T.H. Chow) writes:
>
>It seems to me the MIPS 55 MIPS (@ 60 MHz?) ECL system (chip set?)
>is the "classical" approach for RISC designs to get higher through-
>put. They do it by upping the clock-rate.
>
>Intel has gone the SuperScalar route. Their i960CA is said to be
>66 MIPS @ 33 MHz. They have put the cleverness into multiple
>execution units.

Once again, let's not confuse apples and oranges.  Using the MIPS performance
benchmark suite, the MIPS R6000-based *system* achieves 55 Vax-MIPS at 67-MHz.
Since it's not a superscalar design, the system executes 67 million
*instructions* at 67-MHz.  The ECL chipset is not the limiting factor at this
clock rate.

The i960 *chip* executes a theoretical max of 66 million *instructions* at
33-MHz -- two per cycle.  I haven't heard Intel make any claims about how fast
a Unix *system* would execute real applications.

The Atlantic Research Corporation, an independent group, has done some
comparisons between the MIPS R3000 (25-MHz) and a 20-MHz 80960 executing Ada
programs (the "Common Avionics Processor Ada Benchmark Suite"), and they
discovered that the R3000 was usually more than twice as fast on hand-coded
programs, and overall was more than five times faster on compiled programs.

>Here is the $64,000 question:
>
>   Which part is easier to integrate into a real system? 

What kind of "real system"?  The R6000 is designed to be the heart of large,
general-purpose compute and/or file system server.  I don't think the same is
true of the i960.

-- 

John Hawkes
{ames,decwrl}!mips!hawkes  OR  hawkes@mips.com

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (11/15/89)

In article <1358@bnr-rsc.UUCP>, schow@bcarh61.bnr.ca (Stanley T.H. Chow) writes:

|  Other interesting question:
|  
|     Which system has a larger "domain" over which it actually
|     achives quoted figures?
|  
|     What other systems/chips/... are claiming over 50 MIPS?
|  
|     How do these systems compare in terms of cost (design and per unit)?

  Question one is the kicker. I don't care (as a user/buyer) how many
mips a CPU can perform, just how fast my stuff runs. For some programs
which don't overlap f.p. with other CPU, the Intel will not deliver full
potential. For other which do, particularly if the non-f.p. ops are the
kind which seem to require more than one RISC op to perform but might be
a single op in CISC, I would expect the Intel to look very good.

  Actually this gets beyond RISC/CISC discussion back to the "fast
serial vs. parallel" track, since the Intel gets the rating by putting
execution units in parallel. This implies that there are big losses of
performance if the compiler doesn't keep the mix right, etc.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

scarter@gryphon.COM (Scott Carter) (11/16/89)

In article <31329@winchester.mips.COM> hawkes@mips.COM (John Hawkes) writes:
>In article <1358@bnr-rsc.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley T.H. Chow) writes:
>>
>>It seems to me the MIPS 55 MIPS (@ 60 MHz?) ECL system (chip set?)
>>is the "classical" approach for RISC designs to get higher through-
>>put. They do it by upping the clock-rate.
>>
>>Intel has gone the SuperScalar route. Their i960CA is said to be
>>66 MIPS @ 33 MHz. They have put the cleverness into multiple
>>execution units.
>
>Once again, let's not confuse apples and oranges.  Using the MIPS performance
>benchmark suite, the MIPS R6000-based *system* achieves 55 Vax-MIPS at 67-MHz.
>Since it's not a superscalar design, the system executes 67 million
>*instructions* at 67-MHz.  The ECL chipset is not the limiting factor at this
>clock rate.
>
>The i960 *chip* executes a theoretical max of 66 million *instructions* at
>33-MHz -- two per cycle.  I haven't heard Intel make any claims about how fast
>a Unix *system* would execute real applications.

Note that the above statement applies to the i960_CA_, whereas the quote below 
applies to the i960[KA,KB,MC,XA].  Also, note that at 67 MHz the R6000 can in
theory be executing two integer instructions (it still has the asynch mult/div
unit, no?) as well as I would guess two FP instructions.  However, it can only
ISSUE one instruction per cycle.  The 960 CA can issue three instructions per
cycle to the chosen three of four execute units.  I believe Intel has figures
showing that on the average they could infact issue two instructions per clock
_average_ [over what program set?], hence the 960CA can legitimately be called
66 Native MIPS average with 99 Native MIPS peak.  How this will work out in
"reality" who knows?  I'm looking forward to Specmarks for a 960CA Real
System!

>The Atlantic Research Corporation, an independent group, has done some
>comparisons between the MIPS R3000 (25-MHz) and a 20-MHz 80960 executing Ada
>programs (the "Common Avionics Processor Ada Benchmark Suite"), and they
>discovered that the R3000 was usually more than twice as fast on hand-coded
>programs, and overall was more than five times faster on compiled programs.
>
This comparison was to the 960_XA_, which was crippled by the register port
design needed to get the windows on the chip.  Steve McGeady posted here
a while ago on why Intel made the choices they did - the above comparison
says essentially nothing about how a 960CA, with relatively few register
file / bypass conflicts, would fare.  The JIAWG benchmarks are pretty 
silly anyway.
>John Hawkes
>{ames,decwrl}!mips!hawkes  OR  hawkes@mips.com

preston@titan.rice.edu (Preston Briggs) (11/17/89)

In article <22303@gryphon.COM> scarter@gryphon.COM (Scott Carter) writes:

>ISSUE one instruction per cycle.  The 960 CA can issue three instructions per
>cycle to the chosen three of four execute units.  I believe Intel has figures
>showing that on the average they could infact issue two instructions per clock
>_average_ [over what program set?], hence the 960CA can legitimately be called
>66 Native MIPS average with 99 Native MIPS peak.  

I think that's too optimistic.
We've played some with an i860 on an evaluation board.
The supplied compilers didn't attempt to issue more than 
1 instruction/cycle (out of a max of three).  

On a simple matrix multiply (single precision fp), 

  multiplying 2 100x100 matrices took .52 seconds (3.8 MFlops)
  multiplying 2 400x400 matrices took 86 seconds  (1.5 MFlops)

versus a peak of 66 MFlops.  The poor performance on the larger
size shows the effect of the small on-chip data cache.

Using the VAST front-end, with hand coded vector primitives
gives about 8.5 MFlops.

Reworking by hand, being especially careful of the cache,
gives about 26.5 MFlops, for either size.
(This can be improved, but I think only slightly).
This is fairly hot, though still not 66 MFlops.

The challenge is getting compilers to take advantage of
tiny caches and long pipelines and multi-instruction issue,
as discovered below

>>discovered that the R3000 was usually more than twice as fast on hand-coded
>>programs, and overall was more than five times faster on compiled programs.

Sounds like the MIPS compilers are more mature.  Certainly it's an
easier target.

Preston Briggs

tim@electron.amd.com (Tim Olson) (11/20/89)

In article <22303@gryphon.COM> scarter@gryphon.COM (Scott Carter) writes:
| The 960 CA can issue three instructions per
| cycle to the chosen three of four execute units.  I believe Intel has figures
| showing that on the average they could infact issue two instructions per clock
| _average_ [over what program set?], hence the 960CA can legitimately be called
| 66 Native MIPS average with 99 Native MIPS peak.

The i960CA decoder can dispatch up to 3 instructions per cycle.
However, the decoder looks at 4 instructions at a time, and it appears
that the decoder cannot be loaded with the next set of 4 instructions
until the current set of instructions have all been dispatched.
Therefore, the "99 Native MIPS peak" can only be attained for one
clock cycle; the left-over instruction in the decoder would be
dispatched by itself in the next clock cycle.  In reality, it is "66
Native MIPS peak".

| How this will work out in
| "reality" who knows?  I'm looking forward to Specmarks for a 960CA Real
| System!

Since the 960CA is targeted for embedded control applications, and has
no MMU nor floating-point, I don't think you will ever see Specmarks
for it.  However, Intel released performance numbers for it at the
i960CA announcement.  The numbers were for a 33 MHz i960CA running
with 64KB of 15ns SRAM and 1MB of 4-cycle inital access, 3-cycle
subsequent access DRAM.  The SRAM was used for instruction memory and
the DRAM was used for data memory.  The benchmarks run were Dhrystone
1.1, Buffer Copy, "Travelling Salesman" solution by simulated
annealing, Pi (compute pi to 500 places), quicksort, bubblesort,
integer matrix multiply, CCITT image compression, and Bezier curve
calculation.

Intel compared its i960CA board running this benchmark suite with a
68030 (20MHz), an i960KA(20MHz), and an Am29000(16MHz) board.
However, the board they used to benchmark the Am29000 was not designed
for performance; rather, it was designed to test the functionality of
ADAPT (Advanced Development and Prototyping Tool) hardware debuggers.
To provide a more fair comparison, I requested the benchmark sources
from Intel, to run on a 30MHz Am29000 board (manufactured by YARC
Systems).  This board uses 2-way interleaved, 100ns DRAM memory for
instructions and 35ns SRAM for data.

I received sources for the non-proprietary benchmarks, compiled them
with the current version of the MetaWare HighC29k compiler, and ran
them on the YARC card.  Here are the final results:


     Absolute Performance

      benchmark     68030     960KA    Am29000    960CA
                    20MHz     20MHz     30MHz     33MHz
quicksort (ms)        286       135        51        50
bubblesort (ms)       291       180        65        85
pi-500 (ms)          6999      3510      1398      1624
anneal (ms)         37210     20910      8119      8388
matmult (us)       186552     74873     49135     26898
dhrystone 1.1        5484     14196     44876     41600


     Performance Relative to 68030 Board

      benchmark     68030     960KA    Am29000    960CA
                    20MHz     20MHz     30MHz     33MHz
quicksort            1.00      2.12      5.61      5.72
bubblesort           1.00      1.62      4.48      3.42
pi-500               1.00      1.99      5.01      4.31
anneal               1.00      1.78      4.58      4.44
matmult              1.00      2.49      3.80      6.94
dhrystone 1.1        1.00      2.59      8.18      7.59
-------------------------------------------------------
      geom mean      1.00      2.07      5.11      5.20



     Performance Normalized to 20MHz, Relative to 68030 Board

      benchmark     68030     960KA    Am29000    960CA
quicksort            1.00      2.12      3.74      3.47
bubblesort           1.00      1.62      2.98      2.07
pi-500               1.00      1.99      3.34      2.61
anneal               1.00      1.78      3.06      2.69
matmult              1.00      2.49      2.53      4.20
dhrystone 1.1        1.00      2.59      5.46      4.60
-------------------------------------------------------
      geom mean      1.00      2.07      3.41      3.15


Thus, it would appear that the "66 Native MIPS", 33MHz i960CA
is about the same performance as the 20 Native MIPS (18
VAX-equivalent MIPS), 30MHz Am29000.




	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

scarter@gryphon.COM (Scott Carter) (11/21/89)

In article <3024@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>In article <22303@gryphon.COM> scarter@gryphon.COM (Scott Carter) writes:
>
>>ISSUE one instruction per cycle.  The 960 CA can issue three instructions per
>>cycle to the chosen three of four execute units.  I believe Intel has figures
>>showing that on the average they could infact issue two instructions per clock
>>_average_ [over what program set?], hence the 960CA can legitimately be called
>>66 Native MIPS average with 99 Native MIPS peak.  
>
>I think that's too optimistic.
>We've played some with an i860 on an evaluation board.
>The supplied compilers didn't attempt to issue more than 
>1 instruction/cycle (out of a max of three).  
>
>On a simple matrix multiply (single precision fp), 
>
>  multiplying 2 100x100 matrices took .52 seconds (3.8 MFlops)
>  multiplying 2 400x400 matrices took 86 seconds  (1.5 MFlops)
>
>versus a peak of 66 MFlops.  The poor performance on the larger
>size shows the effect of the small on-chip data cache.
>
>Using the VAST front-end, with hand coded vector primitives
>gives about 8.5 MFlops.
>
>Reworking by hand, being especially careful of the cache,
>gives about 26.5 MFlops, for either size.
>(This can be improved, but I think only slightly).
>This is fairly hot, though still not 66 MFlops.
>
>The challenge is getting compilers to take advantage of
>tiny caches and long pipelines and multi-instruction issue,
>as discovered below
>
>>>discovered that the R3000 was usually more than twice as fast on hand-coded
>>>programs, and overall was more than five times faster on compiled programs.
>
>Sounds like the MIPS compilers are more mature.  Certainly it's an
>easier target.
>
>Preston Briggs

[How did I wind up doing writing something which could be interpreted as
defending the 960?]

1) Thanks for the _Data_ on the 860.  It's on the order of what I would have
guessed - nice to have it confirmed by someone with actual knowledge :).

2) I'm not sure that any meaningful extrapolation can be made from the 860 to
the 960CA, given that their instruction parallelism mechanisms are utterly
different.  Comparison to something like the Super Titan (on integer codes)
would be rather more appropriate.

3) Agreed that comparisons on Real Programs (tm) [or at least Real Becnhmarks
(tm?)] is the only thing to go from.  I merely pointed out that for Intel to
claim 66 Native Mips is not a priori any more illegitimate than most other
vendors native MIPS claims.  Kudos to Mips for trying to not mention anything
other than Real Program numbers.

4) I would disagree about the Mips _Ada_ compiler being better than the
Intel/Biin 960 Ada compiler (agree wholeheartedly on C/Pascal/FORTRAN).  We
found that the performance ratio between the R3000 and the 960XA was much
wider on [somewhat larger than JIAWG] our own benchmarks in C, Pascal, and
FORTRAN than in Ada, either JIAWG or some other internal benchmarks.

5) Based on the code generated for the 960XA for the JIAWG benchmarks, I have
to say I can't believe in two instructions per clock for the 960CA on this
set (this is a GUESS only - any data I might have cannot be posted), but I
do think the 960 CA might well do twice as many useful instructions per clock ON
THIS BENCHMARK SET as an R3000, given what their Ada compilers generated.
Your mileage will undoubtedly vary.

6) If we need to express our religious loyalty, mine is with the R3000.

Scott Carter

mcg@mipon2.intel.com (Steven McGeady) (11/28/89)

In article <1358@bnr-rsc.UUCP>, schow@bcarh61.bnr.ca (Stanley T.H. Chow)
writes:

> > Intel has gone the SuperScalar route. Their i960CA is said to be
> 66 MIPS @ 33 MHz. They have put the cleverness into multiple
> execution units.
> 
> Here is the $64,000 question:
> 
>    Which part is easier to integrate into a real system? 
> 
> Please note that we have concrete real examples here. Theoratical
> discussion is nice, but real data-points are more interesting.

Here is a "real data-point".  Heurikon Corp. (Madison,WI) is now selling
960CA boards with on-board SCSI, Ethernet (82596), 4Mb DRAM (near-zero
wait-state, i.e. 1-0-0-0 read, 0 ws write), multiple serial lines, VME
bus interface, VSB bus, and more for $2995 in quantity 100.  All this
fits on a standard (small, not Sun-sized) VME board.

> Other interesting question:
> 
>    Which system has a larger "domain" over which it actually
>    achives quoted figures?

The 960CA is an embedded controller.  It contains 4-channel DMA,
dynamic, per-region bus sizing, sophisticated interrupt control, etc.
I would suspect that it would perform admirably in most embedded
applications.  The MIPS R6000 is a *system*.  It runs UNIX very well,
apparently.  The 960CA does not now and will never run UNIX, as it
lacks a memory management unit.

S. McGeady
Intel Corp.

mcg@ishark.Berkeley.EDU (Steven McGeady) (11/28/89)

In article <31329@winchester.mips.COM>, hawkes@mips.COM (John Hawkes) writes:

> The Atlantic Research Corporation, an independent group, has done some
> comparisons between the MIPS R3000 (25-MHz) and a 20-MHz 80960 executing >Ada
> programs (the "Common Avionics Processor Ada Benchmark Suite"), and they
> discovered that the R3000 was usually more than twice as fast on hand-
> coded programs, and overall was more than five times faster on
compiled > > programs.

The 20MHz 960 referred to here is the Military 80960MC part, *not* the
960CA. The 960MC hit silicon in 1985 and has not been upgraded since
then. ARC did not measure the 960CA, even though that would have been
a more representative measurement.  The part measured was running in a
PC/AT plug-in board.  The MIPS system it is being compared to is a full
system with a significantly-sized off-chip cache.

The 960CA would perform approximately 2x *faster* than the MIPS R3000
on the handcoded versions of the benchmarks.  For compiled code, if
the code were written in C, we would also perform approximately 2x
faster.  The code in question was compiled with a beta-release Ada
compiler available last spring.  Mr. Hawkes is doing the expected
in attempting to show MIPS' processor in the best light, but not in
Mr. Mashey's spirit of "full disclosure".  If people are more
interested in these tests, I will see how much information JIAWG will
allow to be released, and release it here.
 
S. McGeady
Intel Corp.

mcg@mipon2.intel.com (Steven McGeady) (11/28/89)

In article <28107@amdcad.AMD.COM>, tim@electron.amd.com (Tim Olson) writes:
> 
> In article <22303@gryphon.COM> scarter@gryphon.COM (Scott Carter) writes:
> | The 960 CA can issue three instructions per
> | cycle to the chosen three of four execute units.  I believe Intel
has figures
> | showing that on the average they could infact issue two instructions
per clock
> | _average_ [over what program set?], hence the 960CA can legitimately
be called
> | 66 Native MIPS average with 99 Native MIPS peak.
> 
> The i960CA decoder can dispatch up to 3 instructions per cycle.
> However, the decoder looks at 4 instructions at a time, and it appears
> that the decoder cannot be loaded with the next set of 4 instructions
> until the current set of instructions have all been dispatched.

This is not correct.  The instruction decoder contains a rolling quad-word
window into which instructions are loaded (potentially) every cycle.
The reason that we do not claim 99 MIPS (none of our advertising claims
this number, to the best of my knowledge - those who have heard me speak
hear me say jokingly that we run at 99 MIPS for "one whole cycle") -
is that for three instructions to be dispatched, one must be a branch.
A branch requires that a non-next line of instructions from the i-cache
be loaded, and this is not accomplished at the full rate.

> Intel compared its i960CA board running this benchmark suite with a
> 68030 (20MHz), an i960KA(20MHz), and an Am29000(16MHz) board.
> However, the board they used to benchmark the Am29000 was not designed
> for performance; rather, it was designed to test the functionality of
> ADAPT (Advanced Development and Prototyping Tool) hardware debuggers.

This is an interesting piece of history re-invention.  Step Engineering,
the current manufacturer of the STEB board,  received the design of the
board from AMD (the board has an AMD copyright on it).  Apparently, the
board was designed this way because it is impossible to build a 29K
system using normal DRAMs and achieve better performance.  We attempted
to put faster RAMs inthe STEB board, and to increase the clock speed to
20MHz, and neither worked.  We chose the STEB board not because it was
slow (even we didn't expect it to be so slow) but because it is the only
available board with a prototyping area on which we could add an SBX
connector to interface the graphics cards on which we displayed the
benchmark results.

> To provide a more fair comparison, I requested the benchmark sources
> from Intel, to run on a 30MHz Am29000 board (manufactured by YARC
> Systems).  This board uses 2-way interleaved, 100ns DRAM memory for
> instructions and 35ns SRAM for data.

This board contains separate Instruction and Data memory (using the
29k's Hardvard bus), each of which is interleaved (according to published
data I've been able to find on the board).  The 30MHz 29k's are apparently
hand-sorted - we know of no volume shipments of these parts.
This board is in no way comparable in cost, parts-count, interface
complexity, or usability to the 960CA board that was used.

> I received sources for the non-proprietary benchmarks, compiled them
> with the current version of the MetaWare HighC29k compiler, and ran
> them on the YARC card.  Here are the final results:
> 
> [tables showing the 29k approximately at par with 960CA] 

We supplied Mr. Olson with the sources to these benchmarks, as an effort
to bring an end to the warring that has been going on over benchmarking.
In exchange for freely supplying these, Mr. Olson agreed that we would
be given the resulting source code back, along with a copy of the compiler
that produced it, prior to publication of the results.  Mr. Olson has
chosen to ignore those commitments and publish numbers without noting
what compiler was used, and without providing us (or anyone else - we also
supplied the benchmarks to Michael Sleator of Microprocessor Report)
with the ability to check their validity.

It should be noted that the 960CA benchmarks were compiled with the
current GNU GCC compiler, which does *no* instruction scheduling, and thus
fails to take advantage of the multiple-instruction issue capability of
the 960CA.  We have been working on an instruction-scheduling compiler,
but it is not available for release at this time.

The lesson that this has served to teach me, who argued with our marketing
department that we should release these benchmarks to AMD under the noted
restrictions, is that we were foolish to trust AMD's word regarding feedback
of the results from the benchmarks.  Thus, I place no trust in these
numbers presented as representing any kind of objective reality.
Furthermore, I have learned my lesson with regard to cooperating.

The benchmark wars will now most certiainly be taken out of the hand of
technologists and be placed back in the hands of marketing departments.

I will reiterate here my advice to customers attempting to determine the
relative speed of the two processors:  run your own benchmarks on a board
with a memory system relevant to the design you plan to build.  The Yarc
board's memory design is an example of the most-expensive memory system
design that one can attach to the 29k - it bears no resemblance to what
can be expected with a combined I&D DRAM memory system, which is where
the only true comparison lies.  In short, don't believe AMD's benchmark
numbers, and don't believe ours.  Don't believe simulators, because AMD's
is well known at overstating performance.  Believe your own benchmarks.
And note that the STEB board is much closer to most embedded designs
that the Yarc board, and that the 960 is much more usable in the average
design that the 29k.

S. McGeady
Intel Corp.

mcg@mipon2.intel.com (Steven McGeady) (11/28/89)

In article <22514@gryphon.COM>, scarter@gryphon.COM (Scott Carter) writes:
>
> 2) I'm not sure that any meaningful extrapolation can be made from the 860 to
> the 960CA, given that their instruction parallelism mechanisms are utterly
> different.  Comparison to something like the Super Titan (on integer codes)
> would be rather more appropriate.

No meaningful comparison is useful here.  The 860 is a floating-point near-VLIW
processor, the 960 is an integer superscalar embedded processor.  The 860
achieves parallelism between floating-point and integer operations using
parallel pipelines, the 960 achieves parallelism between integer and memory
operations by using parallel instruction dispatch.

> claim 66 Native Mips is not a priori any more illegitimate than most other
> vendors native MIPS claims.

In technical forums, I have always been careful to distinguish the cases where
the 960CA could be expected to run at this rate.

> 4) I would disagree about the Mips _Ada_ compiler being better than the
> Intel/Biin 960 Ada compiler (agree wholeheartedly on C/Pascal/FORTRAN). 

While the original MIPS/Verdix Ada compiler was not up to snuff with their
C technology, it was still reasonably good.  MIPS has released new numbers
(the ones that Mr. Hawkes referred to) based on a new release of their
compiler.

> We found that the performance ratio between the R3000 and the 960XA was much
> wider on [somewhat larger than JIAWG] our own benchmarks in C, Pascal, and
> FORTRAN than in Ada, either JIAWG or some other internal benchmarks.

As I mentioned in a previous article, this ignores the following facts:

	1) the 960MC/960XA is the original silicon generation of the 960
	   architecture, and is wholly unrelated to the 960CA -- you can
	   expect us to apply the CA's superscalar techniques to other levels
	   of the architecture, but we're not yet saying when;

	2) the benchmarks were run on systems that are in no way comparable:
	   a PC plug-in board (or possibly the execrable Multibus-I EXV board,
	   or the 16MHz BiiN systems), versus the MIPS systems with large
	   caches.

	3) The current compiler does not attempt any CA parallel-dispatch
	   optimizations.  The 960CA was released with working silicon, but
	   unfortunately, the compilers are a little behind.

> 5) Based on the code generated for the 960XA for the JIAWG benchmarks, I have
> to say I can't believe in two instructions per clock for the 960CA on this
> set (this is a GUESS only - any data I might have cannot be posted),

As stated in other articles, I would be astonished if you got a sustained rate
of two instructions per clock over the balance of a large benchmark.
Parallel instruction dispatch is much more complicated than this - the idea is
to reduce the overall latency of instructions.  I have noted several times that
we expect that parallel instruction dispatch will allow us to bring our
cycles per instruction down to very close to 1 instruction per clock in this
generation of chips, which is substantially better than most other archictures
when you consider that 960 code is 20-30% denser than comparable RISCs.

> 6) If we need to express our religious loyalty, mine is with the R3000.

No suprise here - I'll leave my loyalty as an exercise to the reader.

S. McGeady
Intel Corp.

rogerk@mips.COM (Roger B.A. Klorese) (11/28/89)

In article <5277@omepd.UUCP> mcg@mipon2.intel.com (Steven McGeady) writes:
>we also
>supplied the benchmarks to Michael Sleator of Microprocessor Report

Michael Slater of Microprocessor Report is not Michael Sleator of
Stardent.
-- 
ROGER B.A. KLORESE      MIPS Computer Systems, Inc.      phone: +1 408 720-2939
928 E. Arques Ave.  Sunnyvale, CA  94086                        rogerk@mips.COM
{ames,decwrl,pyramid}!mips!rogerk
"I want to live where it's always Saturday."  -- Guadalcanal Diary

aburto@marlin.NOSC.MIL (Alfred A. Aburto) (11/29/89)

In article <28107@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>
>To provide a more fair comparison, I requested the benchmark sources
>from Intel, to run on a 30MHz Am29000 board (manufactured by YARC
>Systems).  This board uses 2-way interleaved, 100ns DRAM memory for
>instructions and 35ns SRAM for data.
>
>	-- Tim Olson
>	Advanced Micro Devices
>	(tim@amd.com)

While your comparative results are very interesting, and useful as a point
of reference, I must throw in some words of caution.  That is, the individual
results are highly dependent upon the machine code generating efficiency of
the various compilers used.  In order to achieve real useful relative
comparisons of performance we must somehow demonstrate that the compilers
generate reasonably optimal (or 'typical', or equally degraded :-))
machine code for each benchmark.  I know for example the 68030 @ 25 MHz 
coupled with 'the right stuff' (hardware and compiler) can achieve roughly
10K (V1.1) Dhrystones/sec (as compared to the 5.5K result posted).

Al Aburto
aburto@marlin.nosc.mil

tim@nucleus.amd.com (Tim Olson) (11/30/89)

In article <1256@marlin.NOSC.MIL> aburto@marlin.nosc.mil.UUCP (Alfred A. Aburto) writes:
| While your comparative results are very interesting, and useful as a point
| of reference, I must throw in some words of caution.  That is, the individual
| results are highly dependent upon the machine code generating efficiency of
| the various compilers used.

Yes, your point is well taken.  However, it is usually quite hard to
isolate these kind of things when benchmarking real systems; about all
you can say is System X, using compiler Y achieved these results on
benchmark Z.

| I know for example the 68030 @ 25 MHz 
| coupled with 'the right stuff' (hardware and compiler) can achieve roughly
| 10K (V1.1) Dhrystones/sec (as compared to the 5.5K result posted).

Right.  This is why I requested the benchmarks from Intel in the first
place.  They also showed the Am29000 running much slower than it
could, so I re-ran the benchmarks on a different board to present
corrected results.  I reported these, along with Intel's original
i960KA, i960CA, and 68030 results, but I didn't feel that it was my
place to also present new numbers for the 68030 -- I invite Motorola
to do so.

Our benchmark philosophy has been to include results for common
systems (i.e. VAX 11/780 running 4.3bsd, and Sun 3/60), for two reasons:

	1) The results can be easily verified by 3rd parties

	2) More people have direct experience with these systems and
	   have a good feel for their performance levels.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

mash@mips.COM (John Mashey) (12/01/89)

In article <5275@omepd.UUCP> mcg@ishark.Berkeley.EDU (Steven McGeady) writes:
>
>In article <31329@winchester.mips.COM>, hawkes@mips.COM (John Hawkes) writes:
>
>> The Atlantic Research Corporation, an independent group, has done some
>> comparisons between the MIPS R3000 (25-MHz) and a 20-MHz 80960 executing >Ada
>> programs (the "Common Avionics Processor Ada Benchmark Suite"), and they
>> discovered that the R3000 was usually more than twice as fast on hand-
>> coded programs, and overall was more than five times faster on
>compiled > > programs.

>The 20MHz 960 referred to here is the Military 80960MC part, *not* the
>960CA. The 960MC hit silicon in 1985 and has not been upgraded since
>then. ARC did not measure the 960CA, even though that would have been
>a more representative measurement.  The part measured was running in a
>PC/AT plug-in board.  The MIPS system it is being compared to is a full
>system with a significantly-sized off-chip cache.
>
>The 960CA would perform approximately 2x *faster* than the MIPS R3000
>on the handcoded versions of the benchmarks.  For compiled code, if
>the code were written in C, we would also perform approximately 2x
>faster.  The code in question was compiled with a beta-release Ada
>compiler available last spring.  Mr. Hawkes is doing the expected
>in attempting to show MIPS' processor in the best light, but not in
>Mr. Mashey's spirit of "full disclosure".  If people are more
>interested in these tests, I will see how much information JIAWG will
>allow to be released, and release it here.

Well, I'm not sure any of these means a whole lot.  The tests were
done in April, so 960CA's weren't available, and of course MCs and CAs
are extremely different (people sometimes get confused by the variations
in Intel nomenclature of versions :-).  John cited one of the few results known
to be available, as such results are not the easiest things to come by.
Also, in the spirit of "full disclosure", note that they may have later
ran it on a 25MHz R3000, but the report I saw used a 16.7MHz R2000 in
an M/120.  The 960 was an EVA-960KB board, including 1MB of DRAM,
and 64KB of 35ns SRAM.  "All benchmarks were run out of SRAM".
Note, of course, that for almost any chip, a single-board-computer
design is FASTER than a larger/expandable design with multiple boards,
because you usually can build a tighter memory system.
Thus, the fact that something is a plugin board to a PC is irrelevant:
the bigger system has to bear performance burdens that a plugin board does not,
and having everything in SRAM is likely to be faster than having SRAM cache
in front of DRAM.

The benchmarks I saw included some floating-point, which I suspect would
not have pleased the 960CA.....

Of course, it is hard to say much about any of this, as the ones I saw were all
small benchmarks anyway, hence taken with a grain of salt.

This certainly relates to the discussion at Microprocessor Forum
regarding the difficulties of benchmarking embedded systems,
i.e., doing UNIX ones is bad enough, but the embedded world is really
a zoo by comparison!  I certainly expect the 960CA to be noticably faster.
I also note that even people are trying hard to do sensible and fair
benchmarking, it's easy to say things that are subject to argument.
(At the M-F, there was an interesting sequence between the i960 & AMD 29K
crews that illustrated this, especially with rgard to evaluation boards.)
anyway, peace.

Finally, it probably doesn't matter a whole bunch, at least in terms of
what fired all of this 960 vs R3000 stuff up in the first place.
As **most** people interested in this area know, the i960 and MIPS were chosen
last summer as the 2 architectures of choice by the JIAWG committee,
(after a "spirited" competition I believe).  For various reasons having
nothing to do with computer architecture, these two choices have tended to
to ripple through the defense community as 32-bit military RISC standards,
which is why, of course, the JIAWG battle was pretty hard-fought in the
first place.

I'll soon post an analysis of a related effort [the SAE committee's
recommendations], as an example relevant to the difficulty of embedded
evaluations, and also relevant to one of my favored topics:
interpretation and (mis)interpretation of data.  In preparation for
that you might want to read the Suntech Journal, Autumn isse, page ST8,
"SPARC Scores in DARPA/SAE Architecture Test", which reminded me of the SAE.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (12/01/89)

This note:
	1) Analyzes the Society of Automotive Engineers (SAE)'s final report
"FINAL REPORT, 32 BIT COMMERCIAL ISA TASK GROUP, AS-5, SAE" .....
which came out in September or October, I think.(?)
	2) Discusses an article representing the results of that report.

The objective was: "the 32 Bit Commercial ISA Task Group was established
to evaluate suitability of existing commercial architectures for use as
general purpose processors in avionic and other embedded applications"

The approach was to request applications from any vendor who wanted to
propose things, and they got AMD29K, Intergraph Clipper, MIPS R3000,
NS32000, Sun SPARC, and Zilog ZS80000.  "A set of criteria were established
and relative weights set."  This was split into:
	60%: functionality of the instruction sets (general)
	20%: capabilities of the current implementation
	20%: performance

What this means is that there were a bunch of criteria, with points assigned
by discussion of the committee, i.e., there could be 10 points for some
section, and chips might be given anywhere from 2 to 8 points,
then normalized to the maximum found, that is, the one with 8 would get
1 point, and the one with 2 would get 2/8 = .25.  Totals were:

"Results:
		29000	R3000	32532	SPARC
General		42.88	40.12	42.56	43.40
Current		10.89	13.52	13.65	13.86
Performance	 4.90	14.50	10.92	16.00
Total:		58.67	68.14	67.14	73.26

Observations:
The most significant point of the results is the very small spread of the
point values."

They go on to note that AMD didn't have an Ada compiler
available at the time, and so got zapped on performance.  They also note
that they scaled up the scores for MIPS and SPARC because faster
chips became available than what had been benchmarked.  They noted the
difficulty of establishing objective criteria, saying:
	"To this end, four meetings and the intervening months were devoted
	to establishing the criteria against which the ISAs would be
	evaluated.  As in any other venture, if we were to start over, we
	would probably produce a somewhat different set of criteria, with
	results that might be more valuable in their ability to differentiate
	between the ISAs....It was also noted that when actual evaluation
	was started, the meaning of several of the criteria were obscure
	and had to be clarified.
Conclusions:
Since these ISAs, and their implementations, are competing in the market
place, it is not surprising that none of the ISAs were exceptionally better or
worse than any of the others...Due to there not being a typical application,
it is not possible to make a definitive general recommendation.  In general,
any of the ISAs will serve well.  Given a specific application, with its
own priorities and constraints, one of the implementations will probably
serve that purpose better than another."

*************************
Thus, the outcome of the study, clearly stated, was:
	a) It's hard to create objective criteria.
	b) They cannot make any definitive recommendations of one over another.
*************************

The next section gives the various details of rating points, for the
first two categories.
These were done by consensus scoring of features.  For example,
"Support for cache coherency
	AMD 29000	2
	MIPS R3000	5
	National 32000	2
	Sun SPARC	8"
(There are pages of such things; some of the numbers make sense, some
are inexplicable to me, but that's OK. This particular one is somewhat
inexplicable... Some of the ratings directly contradict the findings
of people like JMI, whose C Executive runs on many micros, and which
MEASURED things like interrupt-handling and context switching,
rather than consensus-estimating them.
Under "Current implementations", there were good things like:
"How many compatible performance variations are available?
	AMD29000	1
	MIPS R3000	3
	National 32000	5
	Sun SPARC	5"
(Interesting: it doesn't matter whether an implementation covers
a wide range of performance, what counts is the number of different ones.
Note that the .4 difference (5/5 - 3/5 accounts for more than the full
difference in the final ratings for this section.....)

Finally, we come to the benchmark section, which contains additional
ratings of the type above, plus one section for actual benchmarks.
Sun SPARC is given 50 points (24.5 mips), and the R3000 39.1 (19.15 mips).

I deleted the NS32532 column for space reasons, and added the data column
at the right (which was the Ada compiler, -O, and whose results were
available May 1989 and posted shortly thereafter (I think) on the JIAWG
bboard by the TI folks.

	The benchmarks total 2200 Lines Of Code  Ada),
and are mixture of integer and floating point, as follows:

bin_clst	binning & clustering: 135 LOC, integer
boomult	multiplies boolean matrices together, 102 LOC
des1	encryption, 346 LOC
dig_fil	64-bit FFT, 647 LOC
eightqueens	integer, 98 LOC
finite2	char->float conversions, 165 LOC
flmult	float matrix multiplication, 106 LOC
inmult	integer matrix mult, 81 LOC
kalman	flt/integer, matrices, 324 LOC
shell	shell sort, 52 LOC, integer
substrsrch	substring text search, 103 LOC

Now, here is the data presented in the report, plus my addition of
the last column:

                   VAX 11/780   VAX 11/785   R3000          SPARC    R3000 -O
                   DEC          DEC          MIPS Inc.      SUN      MIPS
                                             25 MHZ         25MHz    25MHz
Times in millisec, followed by results in MIPS,
Normalized to VAX 11/780 =1 (Note 3)

bin_clst	        0.51        0.48          0.05        0.08     0.04
boomult               981         658           246          49.99    29
des1                  160         111                        13.33
dig_fil            111000        2830            70         106.66    55
eightqueens            30          21             1.58        1.65     1.29
finite2                12           9             0.70        0.71      .60
flmult                765         429            81          65       24
inmult                789         495           104                   53
kalman                480         330            57          51.66    27
shell                   5           3.1           0.48        0.47      .31
substrsrch             12           9             0.65        0.55      .35


bin_clst                1.00        1.06         10.20        6.38    20.00
boomult                 1.00        1.49          3.99       19.62    33.80
des1                    1.00        1.44                     12.00
dig_fil (note 3)        0.03        1.00         40.43       26.53    51.5
eightqueens             1.00        1.43         18.99       18.18    23.25
finite2                 1.00        1.33         17.14       16.90    20.00
flmult                  1.00        1.78          9.44       11.77    31.87
inmult                  1.00        1.59          7.59                14.89
kalman                  1.00        1.45          8.42        9.29    17.78
shell                   1.00        1.61         10.42       10.64    16.13
substrsrch              1.00        1.33         18.46       21.82    34.28

Average                 0.91        1.41         14.51       15.31    26.35

Average for 33MHz R-3000 and 40MHz SPARC	 19.15       25.16

Note 3) dig-fil results are normalize (sic) to VAX 11/785 results.
Data sources:
VAX results provided by JIAWG/WPAFB
R3000 results provided by TI
SPARC results provided by Sun
-------------------------------------------------------
-------------------------------------------------------

Now, here's a good exercise for the reader: what do you believe from
the data above?  What conclusions can you draw, and why?
What problems might there be?

1. The benchmarks are very short: remember the times are in milliseconds,
that is, numbers as low as 40 microseconds are listed.
	=> benchmarks should be longer

2. There are holes in the data.  The des1 entry for MIPS is missing
(there was an obscure bug in the Ada front-end at that point).
The inmult benchmark for Sun was missing, for reasons I don't know.
It is very difficult to compute averages of data where it's missing, because
some benchmarks are tougher than others, and if your best or worst benchmark
gets left out, it can affect the results. (This is why it's so nice to
have the SPEC benchmarks: it was always a pain getting a complete set of
numbers for the MIPS Performance Brief)>
	=> delete the rows that have missing data.

3. The average is an arithmetic average, NOT a geometric mean.
(Geometric mean is a better measure for analyzing ratios.)
	=> use geometric mean for averaging ratios.
Also, one of the datapoints is normalized differently (to a 785).

4.If you compute the Geometric means, having deleted the two rows that are
missing data, you get: MIPS: 12.63, SPARC: 14.36, MIPS (opt): 25.8.

5. Just scaling up clock rates is meaningless, computers don't work
that way, because the memory systems are relevant.  Suppose you give SPARC
a 40MHz clock rate: that get's its geometric mean = 14.36x40/25 = 22.98,
i.e., not as fast as the MIPS at 25MHz....

6. Of course, the variance of all this data is pretty high: with 9 data
points used, the 95% confidence levels for the 3 are:
	MIPS:	[7, 23.5]
	SPARC: [10.6, 20.7]
	MIPS -O: [18.9, 36.4]

Anyway, this is why the committee carefully said that the overall data
didn't mean very much.  Of course, the committee report came out AFTER
the JIAWG decision was made [i.e., it was irrelevant to that],
and this report explicitly did NOT recommend anything as the architecture
for military projects.

Lessons:
1) It's hard to evaluate things on paper.  I think the committee tried
hard, in a really difficult job, but it's real hard...

2) It's always a good idea to look behind the summaries a bit.

3) It's important to understand the difference between numbers than
mean something, and numbers that don't.  The committee did understand
that there was insufficient difference to prove anything.

Now, everyone interprets data a bit differently,  Just for fun, let's look
at how Frank Yien and Scott Thorpe of Sun interpreted this, in
SunTech Journal, Autumn 1989, page ST8, in the article called:
	"SPARC Scores In DARPA/SAE Architecture Test"

(THERE'S BEEN PLENTY OF DATA; NOW WE GET SOME "MARKETING" ANALYSIS;
QUIT NOW IF YOU DON'T LIKE THAT STUFF. I INCLUDE THIS BECAUSE I'VE
ALREADY GOTTEN QUESTIONS FROM PEOPLE ABOUT IT, AND THE ARTICLE HAS
APPARENTLY GIVEN TO PEOPLE ABROAD TO PROVE THAT SPARC WAS SOMEHOW
A U.S. RECOMMENDED STANDARD....)

The article leads off with:

"In a recent comparison of leading 32-bit architectures by DARPA (the Defense
Advanced Research Projects Agency), the SPARC architecture was ranked
as the top processor architecture for use in military projects."
	Well, it had the highest numbers, but they weren't significant,
	and the committee said so.
	Of course, it didn't matter much anyway, because the key
	decisions were being made somewhere else, and the choices elsewhere
	[MIPS & Intel] reflected what the large contractors decided in
	doing serious evaluations.

"Finally, SPARC won the benchmark category, without using the most powerful
SPARC implementations available from SPARC manufacturers today.  The 80-MHz
ECL SPARC implementation was not used in these comparisons;"
	Of course it wasn't; the embedded avionics market is not excited by 
	ECL, and Sun didn't have an ECL system for them to benchmark anyway.
	So what does ECL SPARC have to do with it?
"instead, the 40-MHz CMOS SPARC implementation was benchmarked and still
won easily, since the others have only 33-Mhz chips."
	They didn't benchmark a 40-MHz implementation, they benchmarked
	a 25Mhz one and then multiplied by 40/25.  Note that no 40Mhz SPARC
	SYSTEM has yet been announced, much less delivered.
	It didn't win easily, it won barely @ 25Mhz, and if they had reported
	the correspondingly-optimized MIPS numbers, a 40MHz SPARC (not yet
	delivered in system) is seen from the chart above
	to be SLOWER than a 25MHz R3000 [slower on the average, and slower on
	8 out of the 11 benchmarks, the only exceptions being
	eightqueens, finite2, and shell, hardly the larger/more realistic
	tests].
"note that military benchmarks are very demanding and closely resemble
compute-intensive engineering/simulation environments."
	Military benchmarks can be demanding all right, but:
	some of these are very small benchmarks.  Some of these benchmarks
	are realistic, and some are pretty small; none have any real-time
	component that I could see.  If you believe there's a correlation
	between these benchmarks and engineering ones, that's good,
	because MIPS is faster.  If you don't believe there's much
	correlation, that's fine too....

"SPARC is winning the technology battle: It is the frequency leader in both
CMOS and ECL technologies and ranks first in independent tests.  SPARC
hardware and software vendors are well positioned for the future."

	Well, each to their own.... Note that the real war for the 32-bit
	RISC embedded defense standard seems to have 2 winners, and SPARC
	wasn't one of them.... It's possible that some people missed this,
	although it sure made the defense magazines...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

khb@chiba.Sun.COM (Keith Bierman - SPD Advanced Languages) (12/01/89)

I must confess to not having followed the SAE stuff closely (I must
have spent a good minute or so glancing at the suntech blurb until now :>).
Thanks for bringing this article to my attention. When I get back to
work (taking next week off :> :>) perhaps I can coax someone to loan
me their copy of the report itself.

The Right Honorable J.M. sez:

>This note:
>1) Analyzes the Society of Automotive Engineers (SAE)'s final report
>"FINAL REPORT, 32 BIT COMMERCIAL ISA TASK GROUP, AS-5, SAE" .....

>The objective was: "the 32 Bit Commercial ISA Task Group was established
>to evaluate suitability of existing commercial architectures for use as
>general purpose processors in avionic and other embedded
>applications"

Does anyone happen to know how/when/why SAE become the arbiter of avionic
applications ?

>The approach was to request applications from any vendor who wanted to
>propose things, and they got AMD29K, Intergraph Clipper, MIPS R3000,
>NS32000, Sun SPARC, and Zilog ZS80000.  "A set of criteria were
> established and relative weights set."  This was split into:
>	60%: functionality of the instruction sets (general)
>	20%: capabilities of the current implementation
>	20%: performance
        ^^^

So while we performance weenies can carp about their benchmarking
techniques and/or data reduction techniques (What no SVD, covariance and
sensitivity analyses ? :>) the results of the benchmark section would
appear to be of much less entertainment value than their evaluation of
the functionality of the instruction sets (I thought all the machines
were turning complete .... so this must have been the area the group
spent the most time locked in discussion).


>"Results:
>         		29000	R3000	32532	SPARC
>General		42.88	40.12	42.56	43.40
        ----------------------------------------------

>The most significant point of the results is the very small spread of the
>point values."

Agreed, the difference between the high (SPARC) and the low (MIPS) is
only 7.6%. But as this section is pure gedanken study, the rationale
employed is probably of great interest to this group. If someone has
the report handy and is a good typist, posting it would be a service
(more entertaining than yet another posting of xstones, or 8queens
:>). 

Since this was a GS there is NO measurement error, or other source of
measurement error, there is little justification for using the same
statistical tools we use for measuring benchmarking activities.

It is not surprising that this section was counted more heavily than
the other two, as the folks who build missiles, planes, and spacecraft
are more concerned about long term issues than the hot chip of the week
(galileo, for instance, is 1802 based ... and it is possibly the most
complex deep space probe yet flown).


>(There are pages of such things; some of the numbers make sense, some
>are inexplicable to me, but that's OK. This particular one is somewhat
>inexplicable... Some of the ratings directly contradict the findings
>of people like JMI, whose C Executive runs on many micros, and which
>MEASURED things like interrupt-handling and context switching,
^^^^^^^^

Ah, "data". If you can measure it with a stop watch, it is part of
"performance" or "current implementation". Without pondering their
report long and deep I can't begin to second guess them; but mixing
the gedanken study whose intent it is to crystal ball gaze (it takes a
lot of years to develop an embedded system, so one tries very, very
hard to pick the technology that will be ripe when you are ...
typically 5 to 10 years down the road (at least where I came from). 

>Under "Current implementations", there were good things like:
>"How many compatible performance variations are available?
	AMD29000	1
	MIPS R3000	3
	National 32000	5
	Sun SPARC	5"
>(Interesting: it doesn't matter whether an implementation covers
>a wide range of performance, what counts is the number of different
>ones.

yep. If I want to guess what will really be on the shelf in 10 years,
I want the one with the most suppilers ... one is likely to have stuck
around. 1 is a really bad number (if for no other reason, it usually
addes several inches of paper to the documents which must be approved
for your project to fly).... there are of course other reasons. 

>I deleted the NS32532 column for space reasons, and added the data column
>at the right (which was the Ada compiler, -O, and whose results were
>available May 1989 and posted shortly thereafter (I think) on the JIAWG
>bboard by the TI folks.

One presumes they left out optimized results for some bizzare reason
of their own. I have played with both the Verdix and Telesoft SPARC
Ada compilers and both come with optimizers.

....perf figures and analysis
....

>  Of course, the committee report came out AFTER
>the JIAWG decision was made [i.e., it was irrelevant to that],
>and this report explicitly did NOT recommend anything as the architecture
>for military projects.

There are non-military government embeded projects. There are
non-JIAWG (at least there were when I lived in that universe). As I
recall it is rare for such committees to ever come out and say BUY IBM
or anything like that :> They come up with fancy numeric ranking
schemes to shield themselves from anything that tacky. Also a lot of
such projects end up with close figures .... most readers (and writers
in that world) rely on the ranking (at least they used to). 

Lessons:
>1) It's hard to evaluate things on paper.  I think the committee tried
>hard, in a really difficult job, but it's real hard...

But real necessary. Long term projects require long term thinking.

>2) It's always a good idea to look behind the summaries a bit.

And at the background of the organization(s) involved, past
recommendations, projects which relied on or ignored the
recommendations (implicit as well as explicit) and how they turned out
(including funding battles, etc.)

>3) It's important to understand the difference between numbers than
>mean something, and numbers that don't.  The committee did understand
>that there was insufficient difference to prove anything.

Perhaps. I don't know the SAE (other than as the folks who along with
God and Honda tell me what oil to put into my motorcyles). One must
know how to read behind the words.

>Now, everyone interprets data a bit differently,  Just for fun, let's look
>at how Frank Yien and Scott Thorpe of Sun interpreted this, in
>SunTech Journal, Autumn 1989, page ST8, in the article called:
>	"SPARC Scores In DARPA/SAE Architecture Test"

>(THERE'S BEEN PLENTY OF DATA; NOW WE GET SOME "MARKETING" ANALYSIS;

Data ? 

The SAE chose a 60-40 split of "that which cannot be measured with a
stopwatch or ruler" vs. "lets take our places and do the 100-nanosec
dash". Taking the resulting numbers and renormalizing, computing means
and etc. isn't data. It's analysis. Since it's not being done to
engineer a product, or to elucidate the logic of the report, its all
been "marketing" analysis. Very interesting and entertaining mind you,
but this warning is a bit late. It is a very nice rhetorical move though.


>The article leads off with:

>"In a recent comparison of leading 32-bit architectures by DARPA (the Defense
>Advanced Research Projects Agency), the SPARC architecture was ranked
>as the top processor architecture for use in military projects."
>	Well, it had the highest numbers, but they weren't significant,
>	and the committee said so.

How many reports of this nature say otherwise ? As far as I know, its
a standard disclaimer like "your milage may vary".

>	Of course, it didn't matter much anyway, because the key
>	decisions were being made somewhere else, and the choices elsewhere
>	[MIPS & Intel] reflected what the large contractors decided in
>	doing serious evaluations.

Perhaps, perhaps not....

Excerpted from a press release dated Nov 27

 SPEC To Develop Chip Set

SPEC has been contracted by NASA to develop a high-performance GaAs
RISC processor to demonstrate the inherent speed and radiation-hardness
advantages of GaAs.  Multiple GaAs SPARC processors will be included in
a demonstration board that SPEC is building to look at Gas
capabilities.  The board will include four to eight GaAs PARC
processors,  GaAs array communications coprocessors and GaAs floating
point coprocessors.

Under the agreement with SPEC, Sun can license SPEC's GaAs-based SPARC
design with the option to have it manufactured by one of the six
semiconductor vendors now manufacturing SPARC microprocessors.  Initial
samples of the GaAs SPARC processor will be available late in 1990.

Note that: this isn't the benchmarking group that ee-times, sun, mips,
hp, et al. set up. Instead it is Systems & Processes Engineering
Corporation (SPEC) provides systems engineering services and
manufactured products to the aerospace industry, international and
U.S. commercial business, and to government agencies.  Located in
Austin, Tex., SPEC is a privately owned company. This brings up a
question ... did the SPEC benchmarking group do a name search ?


The array communications coprocessor is a GaAs implementation of a
proprietary SPEC inter-processor communications architecture.  The
coprocessor provides a tightly coupled message/data passing interface
between processors in a multi-processor computer system.  The floating
point coprocessor supports 32-bit and 64-bit operations in a highly
pipelined mode with a peak throughput of one floating point operation
per cycle.  These three components make a complete chip set that SPEC
will incorporate into board-level products for the commercial
marketplace.

"These are the building blocks necessary to build the high-performance
systems of the future.  Single- and multiple-processor GaAs
workstations will form the high end of performance in the 1990s.  Other
technologies will not be able to approach the performance of GaAs,"
said SPEC President Randolph E. Noster.

According to SPEC's chief scientist, Dr. Gary B. McMillian, the GaAs
SPARC processor and coprocessors are being designed to operate at 200
MHz, with performance at 800 to 1600 MIPS in a four- to eight-processor
implementation.  SPEC plans to use a VME/FutureBus implementation,
which will provide enough bus bandwidth to support the multiple,
high-speed processors.


>"Finally, SPARC won the benchmark category, without using the most
>powerful >SPARC implementations available from SPARC manufacturers
>today.  The 80-MHz >ECL SPARC implementation was not used in these
>comparisons;" 
>	Of course it wasn't; the embedded avionics market is not
>	excited by ECL, and Sun didn't have an ECL system for them to
>	benchmark anyway. 

Not necessarily true. Gould/Encore, Elxsi and others have particpated
in the big iron/high powered embeded system marketplace. And, as the
clipping I included above notes, GaS is of a more than passing
interest (rad hard, widely used in some battlefield stuff) in some
circles. Late breaking events in what used to be the USSR may make
some of the high performance/rad hard research less important ... or
perhaps it will free up resources to do more serious space
research.... but as late as Monday some folks still thought they had
funding for such projects.

While I am certainly NOT prepared to say sun has an ECL machine, I
find it interesting that John is so positive that we don't. Some folks use
the old algorithm "announce, take orders, design, ship, test" but it
is increasingly dangerous to rely on it. I'd be willing to bet that,
for instance, IBM will be able to ship its RIOS box close to whatever
date IBM sez it can after announcement. 

..... misc hype from the suntech article <omitted>

	Well, each to their own.... Note that the real war for the 32-bit
	RISC embedded defense standard seems to have 2 winners...

Sometimes battles last longer than one round. 


Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"

And in this case, my boss probably thinks I'm home asleep, not wasting
valuable computer cycles on the net. I may not even be speaking for
me.... I meant to go home hours ago.

"There is NO defense against the attack of the KILLER MICROS!"
			Eugene Brooks

        Nor should there be.
	 --khb

Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"

henry@utzoo.uucp (Henry Spencer) (12/05/89)

In article <128680@sun.Eng.Sun.COM> khb@chiba.Sun.COM (Keith Bierman - SPD Advanced Languages) writes:
>(galileo, for instance, is 1802 based ... and it is possibly the most
>complex deep space probe yet flown).

It's also a twenty-year-old design built ten years ago.  Galileo has waited
a *long* time to fly, due to an excruciating series of problems with launch
vehicles and upper stages.  (In some ways this is a good thing, because a
major design defect in Galileo's thrusters was discovered less than a year
ago...!)  It is definitely the most complex deep-space mission yet flown,
but is not representative of technology that would be used today.
-- 
Mars can wait:  we've barely   |     Henry Spencer at U of Toronto Zoology
started exploring the Moon.    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

brooks@maddog.llnl.gov (Eugene Brooks) (12/05/89)

In article <1989Dec4.171505.22203@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>It's also a twenty-year-old design built ten years ago.  Galileo has waited
>a *long* time to fly, due to an excruciating series of problems with launch
>vehicles and upper stages.  (In some ways this is a good thing, because a
>major design defect in Galileo's thrusters was discovered less than a year
>ago...!)  It is definitely the most complex deep-space mission yet flown,
>but is not representative of technology that would be used today.
You mean not representative of technology that would be used in a
mission designed today, built 10 years from now, and flown 20 years from now.
Technology changes, but the way such missions are arranged and delayed does
not...  Please note the lack of a smilie...

brooks@maddog.llnl.gov, brooks@maddog.uucp

cooper@hpsrad.enet.dec.com (g.d.cooper in the shadowlands) (12/06/89)

In article <40547@lll-winken.LLNL.GOV>, brooks@maddog.llnl.gov (Eugene Brooks) writes...

>In article <1989Dec4.171505.22203@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>>It's also a twenty-year-old design built ten years ago.  Galileo has waited
>>a *long* time to fly, due to an excruciating series of problems with launch
>>vehicles and upper stages.  (In some ways this is a good thing, because a
>>major design defect in Galileo's thrusters was discovered less than a year
>>ago...!)  It is definitely the most complex deep-space mission yet flown,
>>but is not representative of technology that would be used today.

This reminds me of a similar anecdote about the Apollo program, if I
remember correctly.  All of the logic was RTL and they couldn't fit
sufficient gates into the design which is why Armstrong had to
manually pilot the lunar lander.  The original idea was for a computer
controlled landing.

>You mean not representative of technology that would be used in a
>mission designed today, built 10 years from now, and flown 20 years from now.
>Technology changes, but the way such missions are arranged and delayed does
>not...  Please note the lack of a smilie...

And they could have used TTL by the time Apollo was ready to go but it
would have required a total redesign of all of the electronics and n
billion $s.

As a side note, I believe that NASA was the last large scale user of
RTL components.

			 Can you say archaic,

				shades
============================================================================
| He paid too high a price for living | Geoffrey D. Cooper                 | 
| too long with a single dream.....   | cooper@hpsrad.enet.dec.com	   |
|-------------------------------------| business (508) 467-3678            |
| decwrl!hpsrad.enet.dec.com!cooper   | home (617) 925-1099                |
============================================================================
Note: I'm a consultant.  My opinions are *MY* opinions.

khb@chiba.Sun.COM (chiba) (12/12/89)

In article <603@ryn.esg.dec.com> cooper@hpsrad.enet.dec.com (g.d.cooper in the shadowlands) writes:

>>> misc comments on the state of NASA tech, from misc. folks

>This reminds me of a similar anecdote about the Apollo program, if I
>remember correctly.  All of the logic was RTL and they couldn't fit...
>
....

>
>And they could have used TTL by the time Apollo was ready to go but it
>would have required a total redesign of all of the electronics and n
>billion $s.
>
>As a side note, I believe that NASA was the last large scale user of
>RTL components.
>			 Can you say archaic,
>

Also quite reliable. In the lab we can use all sorts of new toys. By
First Customer Shipment one expects the worst defects to be known and
fixed. A couple of years later the vendor makes a new widget and the
old one goes to the new guy/gal on the block. If there is an odd
failure mode that takes 5 years to show up, its not a problem.

It takes a considerable amount of calendar time to make it out to Deep
Space (say Jupiter ... which is closer than lots of other nice places
to visit). VGR has done real well; a few hw failures, but nothing
which caused the mission to fail. 

It can be argued that it would be better to build cheaper spacecraft
quicker, and launch lots. But unless and until we ensure waves of
spacecraft each one has to be near perfect, or the whole project is a
total loss. This implies a much more conservative set of
design/management rules.

The robotic arm of NASA (JPL, et al) has done a really fine job. My
remark about 1802's was not a joke, it was certainly NOT intended as
criticism. 

I want the latest technology (bugs and all) on my desk. I want
something reliable for my file server. I want something safe in my
motorcycle (enough risks as it is) and I most certainly want something
really safe (and therefore probably old) in anything flying in deep
space. 

cheers

Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"