[comp.arch] 68030 Performance

curry@nsc.UUCP (03/23/87)
The following is a letter written by one of our architects to Electronic
Design.  I thought that it would be interesting reading on the net and
would be good for creating some discussion.  


		MC68030 COMPARISON WITH MC68020

I read with interest the "Assessing MC68030 and MC68882 Performance" letter
from Roy Druian, Product Marketing Manager of Motorola, as well as Dave 
Bursky's "32-Bit Microprocessors 1987 Technology Forecast" in the Electronic 
Design magazine of Jan. 8, 1987.  Mr. Druian's letter tried to demonstrate
that Motorola's 68030 at 20MHz has twice the performance of 68020 at 16.67MHz.
In Mr. Busky's survey, Motorola's 68030 is presented as having 3 times the 
performance of 68020.

Using Motorola's own performance data for 68020 and the data available for 
68030 (see bibliography below), I calculated the 68030 performance improvement 
relative to the 68020.  These calculations, presented in detail below, show 
that the performance of 68030 is only 18% better than that of 68020 at the same 
frequency (20MHz) even if we believe the high hit ratios Motorola claims for 
its small (256 byte) internal caches.

Using the data presented by Motorola in [2], the calculated 68030 performance 
is 3.3 MIPS at 20 MHz, while 68020 performance is 2.3 MIPS at 16.67 and 2.8 
MIPS at 20MHz.

The other 68030 problems that Motorola's papers do not stress are:
	-impossible timing for its synchronous bus protocol (very hard and 
	 expensive to run with zero wait-states at 20MHz).
	-virtual (logical) caches (see [1] for problems of virtual caches)
	-lack of hardware cache invalidation for the internal caches, which
	 makes 68030 very hard to use in multiprocessing systems or where
	 external devices (eg. DMA) may change memory values
	-long context switching time, because it has to save temporary data
	 inside the CPU (problem inherited from 68020)
	-small TLB (Address Translation Cache) for the 68030 MMU (22 entries
	 only)

Actually 68030 has no real MMU, as the TLB misses are handled by the 68030 
Execution Unit and not by the MMU itself, transparent to the execution pipe.
The high price of a TLB miss (because it is handled in microcode, serial to 
the instruction execution), combined with the relatively low TLB hit ratio 
(should be 90 - 95% for the 22 entry TLB and not 98% as Motorola claims) 
makes the use of the on-chip 68030 MMU expensive in terms of performance.

	68030 vs. 68020 Comparison

The improvements of 68030 versus 68020 consist of improvement of the Instruction
Cache hit ratio due to larger line size (16 bytes for 68030 vs. 4 bytes for 
68020), addition of a small Data Cache and integration of the MMU on chip.

As for the "internal Harvard architecture", 68020 allows also the overlap of 
instruction fetches with data operand access [3].  The 68030 left unchanged 
the Execution Unit, except for the addition of some instructions to support
the TLB on chip [4].  As a consequence, the only change in the performance
model of 68030 when compared with 68020 is the improvement in the performance
penalty for addressing instructions and data in memory.

Using Motorola's own data (Table 6 in [2]), the average number of operand memory
access per instruction for 68020 are:
	-0.384 reads per instruction
	-0.242 writes per instruction

The Data Cache with 48% hit ratio and the 2 clock bus cycle will improve the 
68030 execution time by:

0.384 * 0.48 * 2 + 0.384 * 0.52 * 1 = 0.567 clocks

where:
	- 0.384 represents the average number of operand reads per instruction,
	- 0.48 represents the Data Cach hit ratio,
	- 2 represents the difference, in clocks, between 68020 going to
	  external memory (no wait states) and 68030 finding data in the
	  internal cache,
	- 0.52 (1-0.48) represents the Data Cache miss ratio,
	- 1 represents the difference, in clocks, between the 3 clock bus
	  cycle of 68020 and the 2 clock bus cycle of 68030.

The writes have roughly the same influence on performance for 68020 and 68030, 
as the Data Cache is a write-through cache and writes are buffered by the BIU.

The other 68030 improvement is the higher Instruction Cache hit ratio because
of 16 byte line size (the effect of the burst is already taken into account
in the hit ratio of the Instruction Cache and Data Cache).

According to Table 4 in [2], the number of clocks-per-instruction for 68020
drops from 7.159 with a 64% hit ratio instruction cache to 6.373 with a 100%
hit ratio instruction cache (an improvement of 0.786 clocks-per-instruction).
The estimated improvement given to 68030 by its 82% hit ratio instruction 
cache and its 2 clock bus protocol (no wait state) is 0.5 clocks-per-instruc-
tion.

The overall performance improvements due to the architectural improvements
of 68030 relative to 68020 is then:

0.567 + 0.5 = 1.067 clocks/instruction

According to Motorola's own calculations in [2], the average performance of
68020 with the Instruction Cache ON for the workload in [2] is 7.159 clocks/
instruction (Table 4 in [2]) when no wait states are present.  This translates
into 2.3 MIPS at 16.67 MHz and 2.8 MIPS at 20MHz.  

The (7.159 -1.067) = 6.092 clocks/instruction for 68030 translates into 3.3
MIPS at 20MHz.

The relative improvement in performance of 68030 versus 68020 at the same 
frequency is then:

(3.3 - 2.8) * 100/ 2.8 = 18%

If the 16.67 MHz 68020 is compared against the 20 MHz 68030, the performance 
improvement factor is still only:

(3.3 - 2.3) * 100/ 2.3 = 43%

	68030 Synchronous Bus Timing

Even the 18% architectural improvements of 68030 versus 68020 is questionable
because of the way the 68030 2-cycle synchronous bus protocol is designed.

For a Read bus cycle, the 68030 issues the address at the beginning of the 
first clock cycle and expects the system to return the ready signal (named
STERM) at the end of the same cycle.  Data is sampled by 68030 in the middle 
of the second cycle.  According to the 68030 spec [6], the read bus timing at
20MHz is:

	- Address-to-STERM time = 25 ns
	- Address-to-Data time = 40 ns

For a Write bus cycle, 68030 issues the address at the beginning of the first
clock cycle, data at the beginning of the second clock cycle and expects the
system to return ready (STERM) at the end of the first cycle.  According to 
the 68030 spec, the write bus timing at 20 MHz is :

	- Address-to-STERM time = 25 ns
	- Write Data Valid time = 25 ns

It is hard to avoid wait-states at 20MHz with such timing even when using
the fastest and most expensive static RAM's.  No wonder the Motorola is 
unwilling to  commit pushing 68030 beyond 20MHz; with the 68030 bus protocol 
wait states are unavoidable.

		Sorin Iacobovici
		Computer and System Architecture
		National Semiconductor Corp.
		Santa Clara, Ca.
	
	REFERENCES
	----------

	1. A.J. Smith "Cache Memories", ACM Computing Surveys, vol.14,
		no. 3, September 1982, pp. 473-530
	2. D. MacGregor, J. Rubinstein "A Performance Analysis of 
		MC68020-based Systems", IEEE Micro, Dec. 1985,
		pp. 50-70
	3. D. MacGregor, D. Mothersole and B. Moyer "The Motorola MC68020",
		IEEE Micro, Vol.4, no.4 Aug. 1984, pp. 101-118
	4. J.T.Reinhart "Extra Functions and Higher Speed Push
		Microprocessor to Top", Electron Products, 
		Oct. 1, 1986, pp.35-39
	5. D. MacGregor "Diverse Applications Put Splotlight on 68020's
		Improvements", Electronic Design, Feb. 7, 1985, 
		pp.155-164
	6. Motorola Inc. "MC68030 -Second Generation 32-Bit Enhanced
		Microprocessor", Technical Data, Motorola Inc., 1986
		pp. 1-27