josh@polaris.UUCP (06/14/87)
In article <202@celerity.UUCP> ps@celerity.UUCP (Pat Shanahan) writes: > >Defining a realistic benchmark set seems a good objective. I do think that >it is important to realize that performance is a multi-dimensional quantity. >The only situation in which it is realistic to say "machine X is twice as >fast as machine Y" is if X and Y have identical architectures. Any benchmark >that produces a single number is likely to be misleading, unless it is >understood to be very specialized. > Even when the machines have the same architecture, performance is not a one number affair. Consider the case of two implementations of the IBM 370 architecture, the 3033 and the 3090. Both are pipelined, indeed the 3090 is a *VERY* close relative of the 3033. In his description of the 3090 (IBM Systems Journal, Vol. 25, No. 1, 1986, pp. 4-19), Stu Tucker comments: "A study of workloads indicated that in certain applications decimal instructions were heavily used. Thus the I element was changed to prefetch decimal operands and overlap decimal execution. ... Half-word instructions on the 3033 execute in two cycles, one to propagate the sign into the left-half 16 bits and a second to execute as if the instruction were a full-word operation. These instructions have been improved on the IBM 3090 to allow one cycle execution. ... the only branches left that need to be guessed are branch-on- condition-code instructions that actually do have a condition- code setting operation ahead of them in the pipeline. For them, a decode history-table scheme is used, in which a table keeps the history of branches. ... Additional improvements were made to the multiply and floating- point add operatins. All fixed- and floating-point multiply instructions were built using a half cycle (9.25-nanosecond clock) and a highly parallel carry-save adder design. This makes possible a MULTIPLY LONG product generation in three cycles, not including the final add and normalization cycles. Multiply by zero is also detected and treated as a trivial fast case." The 3090 has a 64K store-in (AKA write-back) cache with 128 byte lines (AKA blocks) while the 3033 has a 32K store-through cache with 64 byte lines. Thus two machines with exactly the same architecture and very similar *organizations* will have different performance ratios on different workloads. The immediate predecessor of the 3090 series, the 308x machines, are even more different from the 3090 so that a 3090 that is 1.7-1.9 times as fast as a 308x on a "commercial workload" can be up to 3 times as fast as the 308x on a "scientific workload". True, quite a bit of the difference is in the floating point, but some of it is in improved branch handling as well. Sorry to ramble on so long just to say that Pat is right *even if* the machines have the same architecture, but that's what I work on, understanding the performance issues for a particular architecture. I'll let you guess which one :-) :-) Of course, any opinions expressed or errors are mine, not my employer's. -- Josh Knight, IBM T.J. Watson Research josh@ibm.com, josh@yktvmh.bitnet, ...!philabs!polaris!josh