gnu@sun.uucp (John Gilmore) (12/09/83)
This paper was made by a friend of mine at a company which has been improving their APL interpreter (written in 370 assembler) for about 15 years. When faced with making a micro product, their approach was to write a 360 emulator for the 8088. This produced a very slow but totally compatible APL system, since it was running not only the same source code but the same object code as the mainframe system. They figured something like the XT/370 was coming along, and their gamble paid off. The author states: Please note well that the benchmarks are rather sloppy (like no base reg in branches and stuff), so they have to be taken w/ a grain of salt. PERFORMANCE OF IBM PC XT/370 Robert Bernecky I.P. Sharp Associates Limited 2 First Canadian Place, Suite 1900 Toronto, Canada M5X 1E3 (416) 364-5361 1983-10-26 The following are the results of a number of informal benchmarks performed on an IBM PC XT/370 system at IBM Toronto, on October 26, 1983. IBM claimed that the system, based on 2 remasked Motorola 68000 processors, and a remasked Intel 8087 coprocessor, was rated at .1 MIP. This figure appears to be conservative for an APL system, in which the common instructions are L, LA, BC, A, TM, CLI, ST, and a few others. All benchmarks were performed using machine code entered by hand, and executed using fingers and a wristwatch with sweep-second LCDs. For comparison purposes, the IPS and MIP rates for X360, the I.P. SHARP 370 emulator are also given in the table at the end. Benchmark 1: This was intended to time a primitive loop, incrementing a counter, and branching back to the increment: SR R1,R1 L0 LA R1,1(,R1) B L0 Iterations per second (IPS): 230601 MIP rate: .46 (Note that the inner loop is two instructions, hence the MIP rate for these two common instructions is about .46 MIP, quite a respectable figure.) This figure should be taken with a bit of salt, since the LA and the B were performed without use of index registers, and the B was performed with no base register. Benchmark 2: This is the same as benchmark 1, except that the LA is replaced by a register-to-register add: LA R0,1 SR R1,R1 L0 AR R1,R0 B L0 IPS: 247031 MIP rate: .494 The slightly increased MIP rate here is probably due in part to the AR taking two bytes instead of 4 bytes, and in part to LA having to add 3 fields instead of two, plus having to decode the result register. Benchmark 3: This is the same as 2, except that one operand for the add came from mainstore instead of a general purpose register: SR R1,R1 L0 A R1,=F'1' B L0 IPS: 156424 MIP Rate: .313 The decreased MIP rate here shows the effect of having to wait for 4 bytes to be fetched from mainstore. Benchmark 4: This is the same as 3, except for the replacement of the integer add by a double-precision floating point add, to time simple floating point arithmetic: SDR D0,D0 LD D4,=X'4110000000000001' L0 ADR D0,D4 B L0 IPS: 83018 MIP rate: .16 Note the significant decrease in speed compared to fixed-point arithmetic. However, the relative speed of the instruction is quite good: Only about 3 branch times, which is quite favorable compared with mainframe timings (The Amdahl 470/V8 takes about 2 branch times for an ADR). Benchmark 5: This is similar to 4, except that an explicit LA was performed for counting purposes: SDR D0,D0 LD D4,=X'4110000000000001' SR R1,R1 L0 LA R1,1(,R1) ADR D0,D4 B L0 IPS: 71250 MIP rate: .213 Benchmark 6: This was intended to time the speed of byte fetch from mainstore: SR R1,R1 L0 LA R1,1(,R1) CLI 1(R3),X'01' B L0 IPS: 137918 MIP rate: .414 Although mainstore speed does appear to degrade performance, it doesn't appear to be too bad. Benchmark 7: This benchmark was intended to time the speed of character move. 256 bytes of data were moved. In actual tests, 3 types of moves were performed: on-boundary, off-boundary, and smeared (overlapped) moves. All 3 moves performed about the same, within experimental tolerances, so I am led to believe that the system has no special cases for MVC. There is some room for improvement of performance here, should IBM wish to do so. SR R1,R1 L0 LA R1,1(,R1) MVC 1(256,R3),1(R4) B L0 IPS: 2185 MIP rate: .006 This appears to be quite slow, but there was a fair amount of data sloshing going on. This corresponds to a raw data rate of 559360 bytes/second. The bus data rate will presumably be twice that figure, which comes close to the theoretical PC bus bandwidth of 1.2 megabytes/second. This suggests that MVC is bus-limited, and that if IBM were to introduce a 16-bit bus, the speed of MVC could be doubled. Benchmark 8: This was intended to time the Store Clock instruction, and to see if the implementation of STCK was honest: STCK is defined to never return the same value twice: SR R1,R1 L0 LA R1,1(,R1) STCK X STCK Y B L0 IPS: 39.84 MIP rate:.00016 STCK appears to be implemented honestly, although the timer resolution being what it is, STCK may take a while to complete, while it waits for the clock to tick. Maybe IBM will think about introducing a somewhat faster clock into the beast. The timer resolution appears to be very nearly twice the PC clock rate of 18.2 hz, so they may be playing games with that clock. To be fair, the only time that STCK has to slow down is when two of them are issued back to back, as in this benchmark, andthe chances of this happening in a real application are negligible. SUMMARY: The system appears to be fairly well balanced, with simple floating and fixed instructions executing in similar timeframes. The processor appears to run somewhere between the speed of a 360/30 and a 370/145, ignoring the limited channel bandwidth available on the PC. It should make a very nice host for SHARP APL. APPENDIX: BENCHMARK IPS (XT/370) IPS (X360) PERFORMANCE RATIO 1 230601 5321 43.3 2 247031 4935 50 3 156424 4116 38 4 83018 1538 54 5 71250 1355 53 6 137918 3199 43 7 2185 750 2.91 8 39.84 (stck not implemented on X360) Note in the above that the speed of MVC is not substantially better on XT/370 than it is on X360. This is because, for relatively long moves, both systems are limited by the speed of the PC bus. Since MVC typically accounts for about 5% of the processor time in a large APL system, this is relatively unimportant. It would show up in a benchmark of APL on both systems if reshape or similar functions were executed on very large arguments. It doesn't point out a failing of XT/370 so much as it points out the good performance of X360, once it gets going on a vector operation.