brandis@inf.ethz.ch (Marc Brandis) (11/06/90)
Does anybody have some benchmark results for the i960CA, especially as compared to other processors? I would be very interested in the integer SPECmarks (as the CA does not have an FPU), the Dhrystone rating, Hennessy-Benchmark results or similar stuff. Moreover, does anybody know how well the i960CA puts its multiple execution units into use? Is the Intel rating of 66 MIPS at 33 MHz realistic for non-toy programs? Thanks for any results or pointers! (* I speak only for myself. Marc-Michael Brandis, Institut fuer Computersysteme, CH-8092 Zurich email: brandis@inf.ethz.ch brandis@iis.ethz.ch *)
rfg@NCD.COM (Ron Guilmette) (11/07/90)
In article <13912@neptune.inf.ethz.ch> brandis@inf.ethz.ch (Marc Brandis) writes: >Does anybody have some benchmark results for the i960CA, especially as >compared to other processors? I would be very interested in the integer >SPECmarks (as the CA does not have an FPU), the Dhrystone rating, >Hennessy-Benchmark results or similar stuff. I seriously doubt that you will ever see SPEC numbers for the i960. You see the programs in the SPEC suite tend to be large general purpose programs. I don't recall what all of them are, but I know for sure that one of them is GCC (the GNU C compiler). At least that program (and probably many of the others in the SPEC suite) ASSUME that you have something kinda like UNIX running on the system under test. To the best of my knowledge, UNIX hasn't been ported to the 960, nor I believe, is it likely to be anytime soon. After all, the 960 is for *embedded* applications, right? >Moreover, does anybody know how well the i960CA puts its multiple >execution units into use? Is the Intel rating of 66 MIPS at 33 MHz >realistic for non-toy programs? I believe that Intel states plainly that the 66 MIPS figure is peak (but I'm not 100% sure). Perhaps it was 99 MIPS peak for the CA because of the possibility that up to three instructions could be in three different functional units at one time. But you can't sustain that for any more than (perhaps) one cycle, because this (rare?) case only happens when three out of a group of four instructions meet certain criteria. And then (I believe) you get to spend one cycle (or more) executing just the one remaining instruction out of that same group of four. Keep in mind that this all applied only to the CA anyway (and possibly some of the new recently announced family members). Other family members do not dispatch multiple instructions to multiple functional units in the same cycle. Regarding the general question of how well compilers (e.g. for C and FORTRAN) schedule instructions to make use of the multiple functional units in the CA, well... this will (of course) depend on the compiler in question. For the GNU C compiler the answer (for the moment) is that instruction scheduling doesn't happen at all. The chips (or rather the instructions) just fall where they may. That should change when the long rumored GCC Version 2 appears, but that's not generally available today. With respect to other compilers, I have no specific information. I can say however that the folks at Intel are no dummies, and that they certainly realize that instruction scheduling is a very significant issue for i960 compilers. I don't think it would be surprizing (to anyone) if we all found out (later on) that they were looking into the question of how to make their chips look better (performance- wise) via compiler technology. -- // Ron Guilmette - C++ Entomologist // Internet: rfg@ncd.com uucp: ...uunet!lupine!rfg // Motto: If it sticks, force it. If it breaks, it needed replacing anyway.
brandis@inf.ethz.ch (Marc Brandis) (11/08/90)
In article <2464@lupine.NCD.COM> rfg@NCD.COM (Ron Guilmette) writes: >I believe that Intel states plainly that the 66 MIPS figure is peak >(but I'm not 100% sure). Perhaps it was 99 MIPS peak for the CA >because of the possibility that up to three instructions could be >in three different functional units at one time. But you can't sustain >that for any more than (perhaps) one cycle, because this (rare?) >case only happens when three out of a group of four instructions >meet certain criteria. And then (I believe) you get to spend one >cycle (or more) executing just the one remaining instruction out >of that same group of four. Intel states a sustained performance of 66 MIPS (intel 80960CA User's Manual, A-2) and a peak performance of 99 MIPS. The instruction decoder is able to decode and issue one instruction for each unit each cycle. The three execution units are the arithmetic and logical operations unit (REG), the control instructions unit (CTRL) and the memory access unit (MEM). There is no restriction that when in one cycle multiple instructions have been executed, that there can be no more than one instruction in the next cycle. As I said, the CPU can decode and issue three instructions each cycle, given the right mix of instructions and - of course - no dependencies between these instructions. The instructions to be executed in parallel have to occur in a certain order (REG-MEM-CTRL), but a compiler can easily reorder them for this, as there can be no structural dependencies between them anyway. Note that when the Intel documentation says the scheduler looks at four instructions at once and is able to fetch four instructions at once, it does not imply that the instructions to be executed in parallel have to be in the same quadword. The scheduler has a "rolling quad-word instruction window" (intel 80960 CA User's Manual, B-4) and after scheduling instructions from it, it considers the next four unexecuted instructions (B-7). However, it does not tell how new instructions become inserted into the rolling instruction window. So I do not know how the ordering is handled after some instructions from the window have been executed and new ones have been added to the window. I do not agree about the argueing that multiple instruction execution is possible only in very rare cases. Consider the following mix of instructions. Control: 14% Arithmetic, logical: 39% Data transfer: 26% The data is taken from "Hennessy & Patterson, Computer Architecture: A Quantitative Approach", DLX Instruction Set Measurements, Average Column, Page C-5. Note that this data covers only 79% of all instructions, the rest is floating-point or rarely executed instructions. Let us make the following assumptions: The DLX instruction set is similar enough to the i960 that this average distribution will also be found on the i960. In the absence of floating-point instructions, the above ratios between control, arithmetic and data transfer instructions hold also for the 100% case. If both assumptions hold (which I think is reasonable), we have about 50% arithmetic and logical instructions, about 32% data transfer instructions and 18% control instructions. In the absence of data dependencies, you would assume that the arithmetic and logical unit becomes the bottleneck, while the control and memory instructions can be easily scheduled in parallel with the arithmetic ones. This would result in a sustained rate of 66 MIPS. However, I have to admit that looking only at the average distribution of instructions does not give the whole picture. E.g., the distribution in one benchmark for DLX (US Steel) looks like 23% control instructions, 49% arithmetic operations and 10% data transfers. Scaled to 100%, you get about 28% control, 59% arithmetic and 13% data transfer. Again, the arithmetic unit would be the bottleneck, but as it would have to execute more than half of the instructions, you cannot achieve two instructions per cycle. I am not sure how all this stuff looks like at the statement- or loop-level. Of course, these predictions of performance are only valid if the ratio of different kinds of instructions does not vary too much between different program parts. And one should not forget that there may be also structural dependencies between the instructions, that the compiler cannot remove and that cause the instruction scheduler to stall. It would be interesting to see, whether there are other flaws in the design of the i960CA that reduce the achievable performance. I could imagine that the small instruction cache, the missing data cache and the small external bus may be bottlenecks. (I understand that the i960CA is designed for embedded applications, but the programs used in embedded applications are not so different from programs in other environments, considering both size and locality patterns). >With respect to other compilers, I have no specific information. I can >say however that the folks at Intel are no dummies, and that they >certainly realize that instruction scheduling is a very significant >issue for i960 compilers. I don't think it would be surprizing >(to anyone) if we all found out (later on) that they were looking >into the question of how to make their chips look better (performance- >wise) via compiler technology. I do not think that there is anything wrong if somebody tries to get the best performance out of a CPU by writing a sophisticated compiler, as long as he makes optimizations that will be valuable for a large number of programs. Of course, implementing optimizations that help only the Dryhstone benchmark or so is not the way to go, but I do not have any problems with a sophisticated instruction scheduler in a compiler. Marc-Michael Brandis Institut fuer Computersysteme ETH-Zentrum CH-8092 Zurich, Switzerland email: brandis@inf.ethz.ch
chris@gestetner.oz (Chris Mendes) (11/23/90)
In article <13912@neptune.inf.ethz.ch>, brandis@inf.ethz.ch (Marc Brandis) writes: > Does anybody have some benchmark results for the i960CA, especially as > compared to other processors? I would be very interested in the integer > SPECmarks (as the CA does not have an FPU), the Dhrystone rating, > Hennessy-Benchmark results or similar stuff. > > Moreover, does anybody know how well the i960CA puts its multiple > execution units into use? Is the Intel rating of 66 MIPS at 33 MHz > realistic for non-toy programs? > > Thanks for any results or pointers! > > > (* I speak only for myself. > Marc-Michael Brandis, Institut fuer Computersysteme, CH-8092 Zurich > email: brandis@inf.ethz.ch brandis@iis.ethz.ch > *) This article is in response to a request put forward in Article 386 for i960 Benchmarks: Benchmark results for the EVCA i960 Eval. Board: =============================================== 1. Drystone. i960CA (code and data in 0 wait state sram) 28000 drystones/sec i960CA (code in DRAM , data in SRAM) 15150 " SUN3/60 4360 " SUN 3/50 3300 " 2 Mandelbrot. This showed that the i960CA with software floating point routines was between 2 and 3 times faster than a sun3/60 with a floating point coprocessor. Not that startling. 3. Bitmap. A comparison was made between our proprietary bitmap handler software on a SUN 3/60 and the i960CA. It was found that the 960 was 8-10 times faster than the SUN. (This test basically involves a number of basic trapezoid, elipse and line drawing functions using 'C' , the Bresenham algorithm was the basis for the drawing routines). (NB: This is probably the most relevant benchmark for our company. ) Hope this sheds some light for you as to benchmarks. All tests were compiled with the GCC960 compiler at maximum optimisation. Obviously, some software will respond better to rescheduling than others so the parallel execution of code depends very much on the application. Some code should run as intel predict however this is not, in general, a sustained throughput level. The floating point library is not as fast as we had hoped nor is it squeaky clean. The software was overhauled by US Software after it failed to pass one of our tests ( Paranoia ) and fixes were incorporated in version 1.2 of the Intel/GNU tools release although there are still a few flaws. Clearly, the chip flies and we beilieve that we have made the right decision as far as our commitment to it goes. --------------------------------------------------------------------------- Christopher M. Mendes chris@gestetner.oz Ph: +61-2-9750546 Fx: +61-2-9750448 ---------------------------------------------------------------------------