[comp.sys.intel] i960CA benchmark results

brandis@inf.ethz.ch (Marc Brandis) (11/06/90)

Does anybody have some benchmark results for the i960CA, especially as
compared to other processors? I would be very interested in the integer
SPECmarks (as the CA does not have an FPU), the Dhrystone rating,
Hennessy-Benchmark results or similar stuff.

Moreover, does anybody know how well the i960CA puts its multiple
execution units into use? Is the Intel rating of 66 MIPS at 33 MHz 
realistic for non-toy programs?

Thanks for any results or pointers!


(* I speak only for myself.
   Marc-Michael Brandis, Institut fuer Computersysteme, CH-8092 Zurich
   email: brandis@inf.ethz.ch brandis@iis.ethz.ch
*)

rfg@NCD.COM (Ron Guilmette) (11/07/90)

In article <13912@neptune.inf.ethz.ch> brandis@inf.ethz.ch (Marc Brandis) writes:
>Does anybody have some benchmark results for the i960CA, especially as
>compared to other processors? I would be very interested in the integer
>SPECmarks (as the CA does not have an FPU), the Dhrystone rating,
>Hennessy-Benchmark results or similar stuff.

I seriously doubt that you will ever see SPEC numbers for the i960.
You see the programs in the SPEC suite tend to be large general
purpose programs.  I don't recall what all of them are, but I know
for sure that one of them is GCC (the GNU C compiler).  At least
that program (and probably many of the others in the SPEC suite)
ASSUME that you have something kinda like UNIX running on the system
under test.

To the best of my knowledge, UNIX hasn't been ported to the 960, nor
I believe, is it likely to be anytime soon.  After all, the 960 is
for *embedded* applications, right?

>Moreover, does anybody know how well the i960CA puts its multiple
>execution units into use? Is the Intel rating of 66 MIPS at 33 MHz 
>realistic for non-toy programs?

I believe that Intel states plainly that the 66 MIPS figure is peak
(but I'm not 100% sure).  Perhaps it was 99 MIPS peak for the CA
because of the possibility that up to three instructions could be
in three different functional units at one time.  But you can't sustain
that for any more than (perhaps) one cycle, because this (rare?)
case only happens when three out of a group of four instructions
meet certain criteria.  And then (I believe) you get to spend one
cycle (or more) executing just the one remaining instruction out
of that same group of four.

Keep in mind that this all applied only to the CA anyway (and possibly
some of the new recently announced family members).  Other family
members do not dispatch multiple instructions to multiple functional
units in the same cycle.

Regarding the general question of how well compilers (e.g. for C and
FORTRAN) schedule instructions to make use of the multiple functional
units in the CA, well... this will (of course) depend on the compiler
in question.  For the GNU C compiler the answer (for the moment) is
that instruction scheduling doesn't happen at all.  The chips (or
rather the instructions) just fall where they may.  That should change
when the long rumored GCC Version 2 appears, but that's not generally
available today.

With respect to other compilers, I have no specific information.  I can
say however that the folks at Intel are no dummies, and that they
certainly realize that instruction scheduling is a very significant
issue for i960 compilers.  I don't think it would be surprizing
(to anyone) if we all found out (later on) that they were looking
into the question of how to make their chips look better (performance-
wise) via compiler technology.

-- 

// Ron Guilmette  -  C++ Entomologist
// Internet: rfg@ncd.com      uucp: ...uunet!lupine!rfg
// Motto:  If it sticks, force it.  If it breaks, it needed replacing anyway.

brandis@inf.ethz.ch (Marc Brandis) (11/08/90)

In article <2464@lupine.NCD.COM> rfg@NCD.COM (Ron Guilmette) writes:
>I believe that Intel states plainly that the 66 MIPS figure is peak
>(but I'm not 100% sure).  Perhaps it was 99 MIPS peak for the CA
>because of the possibility that up to three instructions could be
>in three different functional units at one time.  But you can't sustain
>that for any more than (perhaps) one cycle, because this (rare?)
>case only happens when three out of a group of four instructions
>meet certain criteria.  And then (I believe) you get to spend one
>cycle (or more) executing just the one remaining instruction out
>of that same group of four.

Intel states a sustained performance of 66 MIPS (intel 80960CA User's
Manual, A-2) and a peak performance of 99 MIPS. The instruction decoder
is able to decode and issue one instruction for each unit each cycle. The
three execution units are the arithmetic and logical operations unit (REG), the
control instructions unit (CTRL) and the memory access unit (MEM). 

There is no restriction that when in one cycle multiple instructions have
been executed, that there can be no more than one instruction in the next
cycle. As I said, the CPU can decode and issue three instructions each
cycle, given the right mix of instructions and - of course - no dependencies
between these instructions. The instructions to be executed in parallel have
to occur in a certain order (REG-MEM-CTRL), but a compiler can easily reorder
them for this, as there can be no structural dependencies between them anyway.

Note that when the Intel documentation says the scheduler looks at four    
instructions at once and is able to fetch four instructions at once, it does
not imply that the instructions to be executed in parallel have to be in the
same quadword. The scheduler has a "rolling quad-word instruction window" 
(intel 80960 CA User's Manual, B-4) and after scheduling instructions from it,
it considers the next four unexecuted instructions (B-7). However, it does not
tell how new instructions become inserted into the rolling instruction window.
So I do not know how the ordering is handled after some instructions from
the window have been executed and new ones have been added to the window.

I do not agree about the argueing that multiple instruction execution is
possible only in very rare cases. Consider the following mix of instructions.

	Control:		14%
	Arithmetic, logical:	39%
	Data transfer:		26%

The data is taken from "Hennessy & Patterson, Computer Architecture: A
Quantitative Approach", DLX Instruction Set Measurements, Average Column,
Page C-5. Note that this data covers only 79% of all instructions, the
rest is floating-point or rarely executed instructions.

Let us make the following assumptions: The DLX instruction set is similar
enough to the i960 that this average distribution will also be found on
the i960. In the absence of floating-point instructions, the above ratios
between control, arithmetic and data transfer instructions hold also for
the 100% case. 

If both assumptions hold (which I think is reasonable), we have about
50% arithmetic and logical instructions, about 32% data transfer instructions
and 18% control instructions.

In the absence of data dependencies, you would assume that the arithmetic
and logical unit becomes the bottleneck, while the control and memory
instructions can be easily scheduled in parallel with the arithmetic 
ones. This would result in a sustained rate of 66 MIPS.

However, I have to admit that looking only at the average distribution of
instructions does not give the whole picture. E.g., the distribution in
one benchmark for DLX (US Steel) looks like 23% control instructions,
49% arithmetic operations and 10% data transfers. Scaled to 100%, you
get about 28% control, 59% arithmetic and 13% data transfer. Again, the
arithmetic unit would be the bottleneck, but as it would have to execute
more than half of the instructions, you cannot achieve two instructions per
cycle.

I am not sure how all this stuff looks like at the statement- or loop-level.
Of course, these predictions of performance are only valid if the ratio 
of different kinds of instructions does not vary too much between different
program parts. And one should not forget that there may be also structural
dependencies between the instructions, that the compiler cannot remove
and that cause the instruction scheduler to stall.

It would be interesting to see, whether there are other flaws in the
design of the i960CA that reduce the achievable performance. I could
imagine that the small instruction cache, the missing data cache and the
small external bus may be bottlenecks. (I understand that the i960CA is
designed for embedded applications, but the programs used in embedded 
applications are not so different from programs in other environments,
considering both size and locality patterns).

>With respect to other compilers, I have no specific information.  I can
>say however that the folks at Intel are no dummies, and that they
>certainly realize that instruction scheduling is a very significant
>issue for i960 compilers.  I don't think it would be surprizing
>(to anyone) if we all found out (later on) that they were looking
>into the question of how to make their chips look better (performance-
>wise) via compiler technology.

I do not think that there is anything wrong if somebody tries to get the
best performance out of a CPU by writing a sophisticated compiler, as long
as he makes optimizations that will be valuable for a large number of
programs. Of course, implementing optimizations that help only the 
Dryhstone benchmark or so is not the way to go, but I do not have any
problems with a sophisticated instruction scheduler in a compiler.

Marc-Michael Brandis
Institut fuer Computersysteme
ETH-Zentrum
CH-8092 Zurich, Switzerland
email: brandis@inf.ethz.ch

chris@gestetner.oz (Chris Mendes) (11/23/90)

In article <13912@neptune.inf.ethz.ch>, brandis@inf.ethz.ch (Marc Brandis) writes:
> Does anybody have some benchmark results for the i960CA, especially as
> compared to other processors? I would be very interested in the integer
> SPECmarks (as the CA does not have an FPU), the Dhrystone rating,
> Hennessy-Benchmark results or similar stuff.
> 
> Moreover, does anybody know how well the i960CA puts its multiple
> execution units into use? Is the Intel rating of 66 MIPS at 33 MHz 
> realistic for non-toy programs?
> 
> Thanks for any results or pointers!
> 
> 
> (* I speak only for myself.
>    Marc-Michael Brandis, Institut fuer Computersysteme, CH-8092 Zurich
>    email: brandis@inf.ethz.ch brandis@iis.ethz.ch
> *)

This article is in response to a request put forward in Article 386 for i960 Benchmarks:

Benchmark results for the EVCA i960 Eval. Board:
===============================================

1. Drystone.
	i960CA (code and data in 0 wait state sram)	28000 drystones/sec
	i960CA (code in DRAM , data in SRAM)		15150      "
	SUN3/60						4360       "
	SUN 3/50					3300       "

2 Mandelbrot.
	This showed that the i960CA with software floating point routines
	was between 2 and 3 times faster than a sun3/60 with a floating
	point coprocessor. Not that startling.

3. Bitmap.
	A comparison was made between our proprietary bitmap handler software on 
	a SUN 3/60 and the i960CA. It was found that the 960 was 8-10 times faster
	than the SUN. (This test basically involves a number of basic trapezoid, elipse
	and line drawing functions using 'C'  ,  the Bresenham algorithm was the 
	basis for the drawing routines).

	(NB: This is probably the most relevant benchmark for our company. )

Hope this sheds some light for you as to benchmarks.

All tests were compiled with the GCC960 compiler at maximum optimisation.
Obviously, some software will respond better to rescheduling than others so the
parallel execution of code depends very much on the application. Some code should
run as intel predict however this is not,  in  general, a sustained throughput level.

The floating point library is not as fast as we had hoped nor is it squeaky clean.
The software was overhauled by US Software after it failed to pass one of our
tests ( Paranoia ) and fixes were incorporated in version 1.2 of the Intel/GNU tools
release although there are still a few flaws.

Clearly, the chip flies and we beilieve that we have made the right decision as far
as our commitment to it goes.

---------------------------------------------------------------------------
Christopher M. Mendes
chris@gestetner.oz
Ph: +61-2-9750546
Fx: +61-2-9750448
---------------------------------------------------------------------------