[net.micro] IBM PC XT/370 benchmark results

gnu@sun.uucp (John Gilmore) (12/09/83)
This paper was made by a friend of mine at a company which has been
improving their APL interpreter (written in 370 assembler) for about 15
years.  When faced with making a micro product, their approach was to write
a 360 emulator for the 8088.  This produced a very slow but totally
compatible APL system, since it was running not only the same source code
but the same object code as the mainframe system.  They figured something
like the XT/370 was coming along, and their gamble paid off.

The author states: Please note well that the benchmarks are rather sloppy
(like no base reg in branches and stuff), so they have to be taken w/ a
grain of salt.


PERFORMANCE OF IBM PC XT/370

Robert Bernecky
I.P. Sharp Associates Limited
2 First Canadian Place, Suite 1900
Toronto, Canada M5X 1E3
(416) 364-5361
1983-10-26

The following are the results of a number of informal benchmarks
performed on an IBM PC XT/370 system at IBM Toronto, on October 26,
1983. IBM claimed that the system, based on 2 remasked Motorola 68000
processors, and a remasked Intel 8087 coprocessor, was rated at .1 MIP.
This figure appears to be conservative for an APL system, in which the
common instructions are L, LA, BC, A, TM, CLI, ST, and a few others.
All benchmarks were performed using machine code entered by hand, and
executed using fingers and a wristwatch with sweep-second LCDs. For
comparison purposes, the IPS and MIP rates for X360, the I.P. SHARP 370
emulator are also given in the table at the end.

Benchmark 1: This was intended to time a primitive loop, incrementing a
counter, and branching back to the increment:

	 SR    R1,R1
L0       LA    R1,1(,R1)
	 B     L0

Iterations per second (IPS): 230601
MIP rate: .46

(Note that the inner loop is two instructions, hence the MIP rate for
these two common instructions is about .46 MIP, quite a respectable
figure.) This figure should be taken with a bit of salt, since the LA
and the B were performed without use of index registers, and the B was
performed with no base register.

Benchmark 2: This is the same as benchmark 1, except that the LA is
replaced by a register-to-register add:

	 LA    R0,1
	 SR    R1,R1
L0       AR    R1,R0
	 B     L0

IPS: 247031
MIP rate: .494

The slightly increased MIP rate here is probably due in part to the AR
taking two bytes instead of 4 bytes, and in part to LA having to add 3
fields instead of two, plus having to decode the result register.

Benchmark 3: This is the same as 2, except that one operand for the add
came from mainstore instead of a general purpose register:

	 SR    R1,R1
L0       A     R1,=F'1'
	 B     L0

IPS: 156424
MIP Rate: .313

The decreased MIP rate here shows the effect of having to wait for 4
bytes to be fetched from mainstore.

Benchmark 4: This is the same as 3, except for the replacement of the
integer add by a double-precision floating point add, to time simple
floating point arithmetic:

	 SDR   D0,D0
	 LD    D4,=X'4110000000000001'
L0       ADR   D0,D4
	 B     L0

IPS: 83018
MIP rate: .16

Note the significant decrease in speed compared to fixed-point
arithmetic.  However, the relative speed of the instruction is quite
good: Only about 3 branch times, which is quite favorable compared with
mainframe timings (The Amdahl 470/V8 takes about 2 branch times for an
ADR).

Benchmark 5: This is similar to 4, except that an explicit LA was
performed for counting purposes:

	 SDR   D0,D0
	 LD    D4,=X'4110000000000001'
	 SR    R1,R1
L0       LA    R1,1(,R1)
	 ADR   D0,D4
	 B     L0

IPS: 71250
MIP rate: .213

Benchmark 6: This was intended to time the speed of byte fetch from
mainstore:

	 SR    R1,R1
L0       LA    R1,1(,R1)
	 CLI   1(R3),X'01'
   B     L0

IPS: 137918
MIP rate: .414

Although mainstore speed does appear to degrade performance, it doesn't
appear to be too bad.

Benchmark 7: This benchmark was intended to time the speed of character
move. 256 bytes of data were moved. In actual tests, 3 types of moves
were performed: on-boundary, off-boundary, and smeared (overlapped)
moves. All 3 moves performed about the same, within experimental
tolerances, so I am led to believe that the system has no special cases
for MVC. There is some room for improvement of performance here, should
IBM wish to do so.

	 SR    R1,R1
L0       LA    R1,1(,R1)
	 MVC   1(256,R3),1(R4)
	 B     L0

IPS: 2185
MIP rate: .006

This appears to be quite slow, but there was a fair amount of data
sloshing going on. This corresponds to a raw data rate of 559360
bytes/second. The bus data rate will presumably be twice that figure,
which comes close to the theoretical PC bus bandwidth of 1.2
megabytes/second. This suggests that MVC is bus-limited, and that if
IBM were to introduce a 16-bit bus, the speed of MVC could be doubled.

Benchmark 8: This was intended to time the Store Clock instruction, and
to see if the implementation of STCK was honest: STCK is defined to
never return the same value twice:

         SR    R1,R1
L0       LA    R1,1(,R1)
	 STCK  X
	 STCK  Y
	 B     L0

IPS: 39.84
MIP rate:.00016

STCK appears to be implemented honestly, although the timer resolution
being what it is, STCK may take a while to complete, while it waits for
the clock to tick. Maybe IBM will think about introducing a somewhat
faster clock into the beast. The timer resolution appears to be very
nearly twice the PC clock rate of 18.2 hz, so they may be playing games
with that clock. To be fair, the only time that STCK has to slow down
is when two of them are issued back to back, as in this benchmark,
andthe chances of this happening in a real application are negligible.

SUMMARY:

The system appears to be fairly well balanced, with simple floating and
fixed instructions executing in similar timeframes. The processor
appears to run somewhere between the speed of a 360/30 and a 370/145,
ignoring the limited channel bandwidth available on the PC. It should
make a very nice host for SHARP APL.

APPENDIX:

BENCHMARK    IPS (XT/370)  IPS (X360)   PERFORMANCE RATIO

1            230601        5321         43.3
2            247031        4935         50
3            156424        4116         38
4            83018         1538         54
5            71250         1355         53
6            137918        3199         43
7            2185          750          2.91
8            39.84         (stck not implemented on X360)

Note in the above that the speed of MVC is not substantially better on
XT/370 than it is on X360. This is because, for relatively long moves,
both systems are limited by the speed of the PC bus. Since MVC
typically accounts for about 5% of the processor time in a large APL
system, this is relatively unimportant. It would show up in a benchmark
of APL on both systems if reshape or similar  functions were executed on
very large arguments. It doesn't point out a failing of XT/370 so much
as it points out the good performance of X360, once it gets going on a
vector operation.