[net.micro.68k] 68020 Performance Revisited Again

falcone@erlang.DEC (Joe Falcone, HLO2-3/N03, dtn 225-6059) (11/03/84)

CC:	 


68020 Performance Revisited Again

I am delighted to get responses from both Xerox PARC and Motorola.
As was pointed out, I am concentrating on the use of the 68020 in
networked virtual-memory workstations, so this influences my analysis.

Peter Deutsch's argument about data cache is well taken, except that
his use of the 180ns figure is wrong.  The address-valid to data-valid
window is 120ns.  Given a design with parts tolerance for mass production,
you would have to design your memory management unit, virtual address
translation, and data cache for a ~100ns window.  Given that all of this
is done OFF chip, the design would have to be very aggressive and the
result would have adverse price implications because of the high-speed
logic and memory invovled.

Doug MacGregor pointed out that faster 68020 processors will follow, but these
will only exacerbate the situation.  The 24MHz part will have an 80ns window
for memory access, which will require an extremely aggressive design for
the memory management unit, virtual address translation, and data cache.

The difference between the performance tables generated by MacGregor
and myself can be explained by our perspectives.  MacGregor and I
used different bases and methods to calculate our respective tables.

There were three points where I used conservative figures to emphasize
the ballpark nature of the comparison, as I was more interested in seeing
what league these devices were in with respect to the VAX line.

------------------------------------------------------------
     68K MEMORY	   100ns	  200 ns	  300ns
CPU  -------------------------------------------------------
8MHz  68000        0.6 (1x)       0.6 (1x)        0.6 (1x)
16MHz 68020*       2.1 (3.5x)     1.5 (2.5x)      1.3 (2.2x)
16MHz 68020**      2.7 (4.5x)     2.3 (3.7x)      2.0 (3.3x)
------------------------------------------------------------
*  I-cache disabled
** 100% I-cache hit ratio

1. I derived my table by dividing each figure above by 3 for the upper
   bound and 5 for the lower bound, since I had decided on 3 to 5
   as scale factors between the 68000 and the 11/780.  These scale
   factors are extremely generous after viewing the performance of
   the 8MHz 68000 and 11/780 systems running benchmarks.

2. Again, I must apologize for generosity. I used 0.7 MIPS instead
   of 0.6 MIPS for the 8MHz 68000, which is why I have 0.14-0.23 
   "VAX MIPS" (0.7 divided by 3 to 5).  I did this because 0.6 MIPS
   seemed on the low-side of my personal observations of 68K systems.
   This scaling gives MIPS figures normalized to one 780 MIPS.

3. Because there is some dispute over 780 performance even within Digital,
   I felt obliged to loosen the 780 figures to reflect the ranges reported.
   So I set the 780 at 0.7 to 1 MIPS in the comparison, downrating it by
   30% to achieve the low figure.

MacGregor generated his figures by starting with the VAX figures, deriving
the 68000 numbers by dividing by the scale factors, and then using the
68000 numbers to derive the 68020 figures with the multipliers above.

So MacGregor and I used different bases to compute our figures from.  
This is a common practice in performance analysis which prevents sane 
comparisons of one company's benchmarks with another.  

The truth of the matter is probably somewhere in the range between my
figures and MacGregor's, so I offer the following conciliatory table.

                              "VAX MIPS"
---------------------------------------------------------
     68K MEMORY	   100ns	  200 ns	  300ns
CPU  ----------------------------------------------------
8MHz    68000    0.14-0.25      0.14-0.25       0.14-0.25
16MHz   68020*   0.42-0.88      0.30-0.63       0.24-0.55
16MHz   68020**  0.56-1.13      0.46-0.93       0.40-0.83
---------------------------------------------------------
VAX-11/780	    -->		0.7-1.0		<--
VAX-11/785	    -->		1.0-1.5		<--
VAX 8600	    -->		3.4-4.2		<--
---------------------------------------------------------
*  I-cache disabled
** 100% I-cache hit ratio

My initial motivation was to put an end to the practice of comparing
microprocessor chips to 7-year-old fully-configured, virtual-memory,
multi-user computer systems.  I firmly believe that the data in the
table above can be used as ballpark figures for systems built around
the 68020, but one must remain cautious of the hooks.  It is one thing
to talk about 80-100ns virtual memory management and cache, it is quite
another thing to build it.

Joe Falcone
Eastern Research Laboratory		decwrl!
Digital Equipment Corporation		decvax!deccra!jrf
Hudson, Massachusetts			tardis!

rpw3@redwood.UUCP (Rob Warnock) (11/05/84)

+---------------
| The truth of the matter is probably somewhere in the range between my
| figures and MacGregor's...
+---------------

That may be good theory, and I agree with most of your comments about
cache design (see some long stuff I posted some months ago), but my actual
experience with "real" UNIX tasks (cc, nroff, grep, vi, mail, news, etc.)
runs counter to even your "conciliatory" numbers:

+---------------
| 68K MEMORY	   100ns	  200 ns	  300ns
| 8MHz    68000    0.14-0.25      0.14-0.25       0.14-0.25
+---------------
---> 32:16					265ns
---> 5.5MHz 68k					~0.5

My experience with the Fortune Systems 32:16, which runs a 5.5 Mhz clock
(no wait states), with 200ns 64K chips (including time in the cycle for
65 ns for ECC that's not used, so call them 265 ns. chips), is that on
every CPU-intensive benchmark I tried that did not involve (significant)
floating-point, the 5.5MHz 68k ran almost EXACTLY 0.5 * VAX-11/780 speed
(single-user, in both cases). (Note the compiler used on the 68k treats
"int" == "long" == 32 bits, as does the VAX.)

On disk intensive tasks, the speeds were very nearly the ratio of the
random access times of the drives involved. For certain tasks for which
the 68k software had been carefully tuned (e.g., tty output), it actually
outperformed the VAX (though making the same changes to the VAX/4.1bsd
kernel would surely wipe out the discrepency). On mixed tasks, it did
somewhat better than linear interpolation would predict (but this is
to be expected since there is a non-linear soft transition from disk-bound
to CPU-bound).

+---------------
| My initial motivation was to put an end to the practice of comparing
| microprocessor chips to 7-year-old fully-configured, virtual-memory,
| multi-user computer systems. 
+---------------

I agree, wholeheartedly! But when you say...

+---------------
| ...I firmly believe that the data in the
| table above can be used as ballpark figures for systems built around
| the 68020, but one must remain cautious of the hooks.
| 
| Joe Falcone
+---------------

Sorry, your table isn't even close to what my stopwatch says about Fortune,
CT Miniframe (10Mhz no wait states), Callan, and others. [Hint: I think
you may have been misled by basing your VAX numbers on the theoretical
performance of the 200ns. SBI -- Isn't it true that a 780 processor can't
keep the SBI busy? Also, you need to allow for bias in comparing UNIX to VMS.]

As far as my experience has led me to conclude so far, unless the designers
screw up in the UNIX port or in the memory management or in the disk subsystem,
MY "ballpark figure" is that a straightforward 68000 system at 10Mhz (with
no wait states) closely equals a VAX-11/780 in UNIX system performance with
(say) 5-25 users doing "typical UNIX" things (with the "same" lineage UNIX).

On the other hand, whipping up a blazing 68020 system's not so easy, either.
I firmly agree that getting a 20Mhz 68020 to do 4 * VAX (factor of two in
clock over 10MHz times ~1.5 for instruction cache times ~1.5 for 32-bit bus)
is NOT going to be easy, and maybe not even economical!

But a SYSTEM designer might settle for 2 to 2.5 times VAX, and win big in
price/performance. (E.g., DON'T use a cache, but just use the fastest 256K chips
you can get, interleaved to save power while using overlapped RAS/MMU-decode.
Use multi-processors if you still need more horses in the box.)

Summary:
	1. I agree with your general style of analyis, ...
	2. ...but I think your "baseline" is still WAY off what I have seen.
	3. "Incidental" issues like disk I/O and tty drivers can make a
	   FAR greater difference on user-perceived system performance --
	   be careful about too much fine-tuning of the CPU/memory.
	4. I am only talking about "typical UNIX" apps, not F.P. crunching.

Rob Warnock

UUCP:	{ihnp4,ucbvax!amd}!fortune!redwood!rpw3
DDD:	(415)572-2607
Envoy:	rob.warnock/kingfisher
USPS:	510 Trinidad Ln, Foster City, CA  94404

falcone@erlang.DEC (Joe Falcone, HLO2-3/N03, dtn 225-6059) (11/06/84)

CC:	 


Response to redwood!rbw3

1. Your idea of cpu-intensive UNIX benchmarks sure is strange;
   Gosh, I always thought there was a fairly large I/O component to
   cc, nroff, grep, vi, mail, news, etc.  Benchmarks with significant
   I/O components measure your bus and disks, not your processor.
   And since you can put high-performance disks on most microprocessors
   these days, it is not surprising that your figures came out so high.

2. My experience with cpu-bound tasks (with little or no I/O and running 
   essentially core-resident) on the 8MHz HP Series 200 and the 
   10MHz SMI 2-170 is that the VAX-11/780 is anywhere from 2 to 10 times
   faster, and most tests fall between 3 and 5 times faster given comparable
   code quality.

3. Just as you have been able to find a benchmark which runs faster on
   the 68K, I have a benchmark which ran 100 times faster on the 780 and
   did not use floating-point - these benchmarks are meaningless because
   they don't measure the machinery, they are measuring compiler and OS
   quality.  It is very difficult to measure the real beast in these machines.

4. Your comment about the 200ns SBI is ludicrous - the 780 has a large
   cache and the SBI handles 64 bit packets, so there is no way that
   the SBI is kept busy - that is by design to allow enough bandwidth
   for other devices to do their stuff.  One of the faults of the 68K
   family is the tendency to use up nearly all available bus bandwidth
   for instruction execution, leaving very little for I/O and coprocessors.

5. I've spent 8 years working with UNIX systems.  I have yet to see
   a machine run 4.2 better than the 780 does (soon to change with the
   advent of the VAX 8600).  If you do want to get into UNIX vs. VMS
   operating system comparisons, VMS does have significantly better compilers
   and quicker I/O so a lot of benchmarks run faster on it.  While on this
   subject, no one has yet to run benchmarks on the 68020 with a compiler
   which uses the extended instruction set, so this should add a few percent
   to 68020 performance.

6. I would suggest that you read the article on the 68020 in IEEE Micro.
   If you had, you would not have so ridiculously over-simplified the
   performance implications of clock, bus, and cache.  No, doubling the
   clock does not double performance.  Sorry, you don't hit your instruction
   cache 100% of the time, so you'll have to wait around a bit more.  
   Too bad, your 32-bit bus saturates just as quickly as on the 68010
   (because of more 32-bit operands and the doubled clock speed fetching them).
   And multi-processors?  With 70-90% of bus bandwidth gone, you had better
   have some really bright ideas on how to get out of this one, Ollie.
   Yes, it is going to take interleave, data cache, a blazing MMU, and all
   of the things we have come to expect from mainframes - but this all comes
   at a bigger pricetag.

7. I've based my figures on 5 years of experience with VAX and 68K systems
   (most of it not at Digital) - I'm reporting on what I've seen and what
   I think you can expect from the 68020 - at best you can expect 780 
   performance given comparable compilers on CPU intensive benchmarks.
   And that ain't too bad, if you ask me.  

all in the opinion of...

Joe Falcone
Eastern Research Laboratory			decwrl!
Digital Equipment Corporation			decvax!deccra!jrf
Hudson, Massachusetts				tardis!

guy@rlgvax.UUCP (Guy Harris) (11/13/84)

> 1. Your idea of cpu-intensive UNIX benchmarks sure is strange;
>    Gosh, I always thought there was a fairly large I/O component to
>    cc, nroff, grep, vi, mail, news, etc.

Ever timed "cc" or "nroff"?  *VERY* CPU-intensive - at least the versions
we've got here on our 780.  One "make" rebuilding the kernel takes up
between 60 and 90% of an 11/780.

Also, note he only referred to the aforementioned as '"real" UNIX tasks',
not "cpu-intensive UNIX benchmarks."  He referred both to CPU-intensive
and disk-intensive tasks.

> 5. I've spent 8 years working with UNIX systems.  I have yet to see
>    a machine run 4.2 better than the 780 does (soon to change with the
>    advent of the VAX 8600).

Working for a competitor who *has* a machine that runs 4.2 better than the
780 does, unless you're beating the terminals to death (our terminal mux is,
shall we say, sub-optimal), I'm a little biased here, but there do exist
superminis out there that are faster than an 11/780.  Are you willing to make
that claim about the Power 6/32, *and* the Pyramid 90x, *and* the
top-of-the-line Gould (maybe the MV/10000, too)?  (While we're at it, how about
the 11/785?  If it isn't any improvement over the 11/780 running 4.2, *somebody*
screwed up...)  (Anybody put 4.2 up on some big IBM/Amdahl/... iron? For
terminal I/O, I dunno, but I bet it's pretty good on CPU-intensive or
disk-intensive jobs.)  If you mean you've never seen any *micro* out there run
4.2 better than the 11/780, maybe.

I agree that statements of the "wow, this supermicro is faster than a
<fill in your favorite mini>!" ilk are to be taken with a grain of salt -
we had a supermicro in house whose manufacturer boasted that it was as fast
as an 11/70.  We decided, after working some with it, that it was no doubt
true, under certain circumstances.  If you dropped it off a building, it would
fall as fast as an 11/70 (modulo air drag).

	Guy Harris
	{seismo,ihnp4,allegra}!rlgvax!guy