[comp.arch] Benchmarking the 532 ...

josh@polaris.UUCP (06/14/87)

In article <202@celerity.UUCP> ps@celerity.UUCP (Pat Shanahan) writes:
>
>Defining a realistic benchmark set seems a good objective. I do think that
>it is important to realize that performance is a multi-dimensional quantity.
>The only situation in which it is realistic to say "machine X is twice as
>fast as machine Y" is if X and Y have identical architectures. Any benchmark
>that produces a single number is likely to be misleading, unless it is
>understood to be very specialized.
>

Even when the machines have the same architecture, performance is not
a one number affair.  Consider the case of two implementations of the
IBM 370 architecture, the 3033 and the 3090.  Both are pipelined, indeed
the 3090 is a *VERY* close relative of the 3033.  In his description
of the 3090 (IBM Systems Journal, Vol. 25, No. 1, 1986, pp. 4-19), Stu
Tucker comments:

   "A study of workloads indicated that in certain applications
    decimal instructions were heavily used.  Thus the I element was
    changed to prefetch decimal operands and overlap decimal execution.
     ...
    Half-word instructions on the 3033 execute in two cycles, one
    to propagate the sign into the left-half 16 bits and a second to
    execute as if the instruction were a full-word operation.  These
    instructions have been improved on the IBM 3090 to allow one cycle
    execution.
     ...
    the only branches left that need to be guessed are branch-on-
    condition-code instructions that actually do have a condition-
    code setting operation ahead of them in the pipeline.  For them,
    a decode history-table scheme is used, in which a table keeps
    the history of branches.
     ...
    Additional improvements were made to the multiply and floating-
    point add operatins.  All fixed- and floating-point multiply
    instructions were built using a half cycle (9.25-nanosecond
    clock) and a highly parallel carry-save adder design.  This
    makes possible a MULTIPLY LONG product generation in three
    cycles, not including the final add and normalization cycles.
    Multiply by zero is also detected and treated as a trivial fast
    case."

The 3090 has a 64K store-in (AKA write-back) cache with 128 byte
lines (AKA blocks) while the 3033 has a 32K store-through cache with
64 byte lines. Thus two machines with exactly the same architecture
and very similar *organizations* will have different performance
ratios on different workloads.  The immediate predecessor of the
3090 series, the 308x machines, are even more different from the
3090 so that a 3090 that is 1.7-1.9 times as fast as a 308x on a
"commercial workload" can be up to 3 times as fast as the 308x on
a "scientific workload".  True, quite a bit of the difference is
in the floating point, but some of it is in improved branch handling
as well.

Sorry to ramble on so long just to say that Pat is right *even if*
the machines have the same architecture, but that's what I work on,
understanding the performance issues for a particular architecture.
I'll let you guess which one :-) :-)

Of course, any opinions expressed or errors are mine, not my employer's.

-- 

	Josh Knight, IBM T.J. Watson Research
 josh@ibm.com, josh@yktvmh.bitnet,  ...!philabs!polaris!josh