falcone@erlang.DEC (Joe Falcone, HLO2-3/N03, dtn 225-6059) (11/03/84)
CC: 68020 Performance Revisited Again I am delighted to get responses from both Xerox PARC and Motorola. As was pointed out, I am concentrating on the use of the 68020 in networked virtual-memory workstations, so this influences my analysis. Peter Deutsch's argument about data cache is well taken, except that his use of the 180ns figure is wrong. The address-valid to data-valid window is 120ns. Given a design with parts tolerance for mass production, you would have to design your memory management unit, virtual address translation, and data cache for a ~100ns window. Given that all of this is done OFF chip, the design would have to be very aggressive and the result would have adverse price implications because of the high-speed logic and memory invovled. Doug MacGregor pointed out that faster 68020 processors will follow, but these will only exacerbate the situation. The 24MHz part will have an 80ns window for memory access, which will require an extremely aggressive design for the memory management unit, virtual address translation, and data cache. The difference between the performance tables generated by MacGregor and myself can be explained by our perspectives. MacGregor and I used different bases and methods to calculate our respective tables. There were three points where I used conservative figures to emphasize the ballpark nature of the comparison, as I was more interested in seeing what league these devices were in with respect to the VAX line. ------------------------------------------------------------ 68K MEMORY 100ns 200 ns 300ns CPU ------------------------------------------------------- 8MHz 68000 0.6 (1x) 0.6 (1x) 0.6 (1x) 16MHz 68020* 2.1 (3.5x) 1.5 (2.5x) 1.3 (2.2x) 16MHz 68020** 2.7 (4.5x) 2.3 (3.7x) 2.0 (3.3x) ------------------------------------------------------------ * I-cache disabled ** 100% I-cache hit ratio 1. I derived my table by dividing each figure above by 3 for the upper bound and 5 for the lower bound, since I had decided on 3 to 5 as scale factors between the 68000 and the 11/780. These scale factors are extremely generous after viewing the performance of the 8MHz 68000 and 11/780 systems running benchmarks. 2. Again, I must apologize for generosity. I used 0.7 MIPS instead of 0.6 MIPS for the 8MHz 68000, which is why I have 0.14-0.23 "VAX MIPS" (0.7 divided by 3 to 5). I did this because 0.6 MIPS seemed on the low-side of my personal observations of 68K systems. This scaling gives MIPS figures normalized to one 780 MIPS. 3. Because there is some dispute over 780 performance even within Digital, I felt obliged to loosen the 780 figures to reflect the ranges reported. So I set the 780 at 0.7 to 1 MIPS in the comparison, downrating it by 30% to achieve the low figure. MacGregor generated his figures by starting with the VAX figures, deriving the 68000 numbers by dividing by the scale factors, and then using the 68000 numbers to derive the 68020 figures with the multipliers above. So MacGregor and I used different bases to compute our figures from. This is a common practice in performance analysis which prevents sane comparisons of one company's benchmarks with another. The truth of the matter is probably somewhere in the range between my figures and MacGregor's, so I offer the following conciliatory table. "VAX MIPS" --------------------------------------------------------- 68K MEMORY 100ns 200 ns 300ns CPU ---------------------------------------------------- 8MHz 68000 0.14-0.25 0.14-0.25 0.14-0.25 16MHz 68020* 0.42-0.88 0.30-0.63 0.24-0.55 16MHz 68020** 0.56-1.13 0.46-0.93 0.40-0.83 --------------------------------------------------------- VAX-11/780 --> 0.7-1.0 <-- VAX-11/785 --> 1.0-1.5 <-- VAX 8600 --> 3.4-4.2 <-- --------------------------------------------------------- * I-cache disabled ** 100% I-cache hit ratio My initial motivation was to put an end to the practice of comparing microprocessor chips to 7-year-old fully-configured, virtual-memory, multi-user computer systems. I firmly believe that the data in the table above can be used as ballpark figures for systems built around the 68020, but one must remain cautious of the hooks. It is one thing to talk about 80-100ns virtual memory management and cache, it is quite another thing to build it. Joe Falcone Eastern Research Laboratory decwrl! Digital Equipment Corporation decvax!deccra!jrf Hudson, Massachusetts tardis!
rpw3@redwood.UUCP (Rob Warnock) (11/05/84)
+--------------- | The truth of the matter is probably somewhere in the range between my | figures and MacGregor's... +--------------- That may be good theory, and I agree with most of your comments about cache design (see some long stuff I posted some months ago), but my actual experience with "real" UNIX tasks (cc, nroff, grep, vi, mail, news, etc.) runs counter to even your "conciliatory" numbers: +--------------- | 68K MEMORY 100ns 200 ns 300ns | 8MHz 68000 0.14-0.25 0.14-0.25 0.14-0.25 +--------------- ---> 32:16 265ns ---> 5.5MHz 68k ~0.5 My experience with the Fortune Systems 32:16, which runs a 5.5 Mhz clock (no wait states), with 200ns 64K chips (including time in the cycle for 65 ns for ECC that's not used, so call them 265 ns. chips), is that on every CPU-intensive benchmark I tried that did not involve (significant) floating-point, the 5.5MHz 68k ran almost EXACTLY 0.5 * VAX-11/780 speed (single-user, in both cases). (Note the compiler used on the 68k treats "int" == "long" == 32 bits, as does the VAX.) On disk intensive tasks, the speeds were very nearly the ratio of the random access times of the drives involved. For certain tasks for which the 68k software had been carefully tuned (e.g., tty output), it actually outperformed the VAX (though making the same changes to the VAX/4.1bsd kernel would surely wipe out the discrepency). On mixed tasks, it did somewhat better than linear interpolation would predict (but this is to be expected since there is a non-linear soft transition from disk-bound to CPU-bound). +--------------- | My initial motivation was to put an end to the practice of comparing | microprocessor chips to 7-year-old fully-configured, virtual-memory, | multi-user computer systems. +--------------- I agree, wholeheartedly! But when you say... +--------------- | ...I firmly believe that the data in the | table above can be used as ballpark figures for systems built around | the 68020, but one must remain cautious of the hooks. | | Joe Falcone +--------------- Sorry, your table isn't even close to what my stopwatch says about Fortune, CT Miniframe (10Mhz no wait states), Callan, and others. [Hint: I think you may have been misled by basing your VAX numbers on the theoretical performance of the 200ns. SBI -- Isn't it true that a 780 processor can't keep the SBI busy? Also, you need to allow for bias in comparing UNIX to VMS.] As far as my experience has led me to conclude so far, unless the designers screw up in the UNIX port or in the memory management or in the disk subsystem, MY "ballpark figure" is that a straightforward 68000 system at 10Mhz (with no wait states) closely equals a VAX-11/780 in UNIX system performance with (say) 5-25 users doing "typical UNIX" things (with the "same" lineage UNIX). On the other hand, whipping up a blazing 68020 system's not so easy, either. I firmly agree that getting a 20Mhz 68020 to do 4 * VAX (factor of two in clock over 10MHz times ~1.5 for instruction cache times ~1.5 for 32-bit bus) is NOT going to be easy, and maybe not even economical! But a SYSTEM designer might settle for 2 to 2.5 times VAX, and win big in price/performance. (E.g., DON'T use a cache, but just use the fastest 256K chips you can get, interleaved to save power while using overlapped RAS/MMU-decode. Use multi-processors if you still need more horses in the box.) Summary: 1. I agree with your general style of analyis, ... 2. ...but I think your "baseline" is still WAY off what I have seen. 3. "Incidental" issues like disk I/O and tty drivers can make a FAR greater difference on user-perceived system performance -- be careful about too much fine-tuning of the CPU/memory. 4. I am only talking about "typical UNIX" apps, not F.P. crunching. Rob Warnock UUCP: {ihnp4,ucbvax!amd}!fortune!redwood!rpw3 DDD: (415)572-2607 Envoy: rob.warnock/kingfisher USPS: 510 Trinidad Ln, Foster City, CA 94404
falcone@erlang.DEC (Joe Falcone, HLO2-3/N03, dtn 225-6059) (11/06/84)
CC: Response to redwood!rbw3 1. Your idea of cpu-intensive UNIX benchmarks sure is strange; Gosh, I always thought there was a fairly large I/O component to cc, nroff, grep, vi, mail, news, etc. Benchmarks with significant I/O components measure your bus and disks, not your processor. And since you can put high-performance disks on most microprocessors these days, it is not surprising that your figures came out so high. 2. My experience with cpu-bound tasks (with little or no I/O and running essentially core-resident) on the 8MHz HP Series 200 and the 10MHz SMI 2-170 is that the VAX-11/780 is anywhere from 2 to 10 times faster, and most tests fall between 3 and 5 times faster given comparable code quality. 3. Just as you have been able to find a benchmark which runs faster on the 68K, I have a benchmark which ran 100 times faster on the 780 and did not use floating-point - these benchmarks are meaningless because they don't measure the machinery, they are measuring compiler and OS quality. It is very difficult to measure the real beast in these machines. 4. Your comment about the 200ns SBI is ludicrous - the 780 has a large cache and the SBI handles 64 bit packets, so there is no way that the SBI is kept busy - that is by design to allow enough bandwidth for other devices to do their stuff. One of the faults of the 68K family is the tendency to use up nearly all available bus bandwidth for instruction execution, leaving very little for I/O and coprocessors. 5. I've spent 8 years working with UNIX systems. I have yet to see a machine run 4.2 better than the 780 does (soon to change with the advent of the VAX 8600). If you do want to get into UNIX vs. VMS operating system comparisons, VMS does have significantly better compilers and quicker I/O so a lot of benchmarks run faster on it. While on this subject, no one has yet to run benchmarks on the 68020 with a compiler which uses the extended instruction set, so this should add a few percent to 68020 performance. 6. I would suggest that you read the article on the 68020 in IEEE Micro. If you had, you would not have so ridiculously over-simplified the performance implications of clock, bus, and cache. No, doubling the clock does not double performance. Sorry, you don't hit your instruction cache 100% of the time, so you'll have to wait around a bit more. Too bad, your 32-bit bus saturates just as quickly as on the 68010 (because of more 32-bit operands and the doubled clock speed fetching them). And multi-processors? With 70-90% of bus bandwidth gone, you had better have some really bright ideas on how to get out of this one, Ollie. Yes, it is going to take interleave, data cache, a blazing MMU, and all of the things we have come to expect from mainframes - but this all comes at a bigger pricetag. 7. I've based my figures on 5 years of experience with VAX and 68K systems (most of it not at Digital) - I'm reporting on what I've seen and what I think you can expect from the 68020 - at best you can expect 780 performance given comparable compilers on CPU intensive benchmarks. And that ain't too bad, if you ask me. all in the opinion of... Joe Falcone Eastern Research Laboratory decwrl! Digital Equipment Corporation decvax!deccra!jrf Hudson, Massachusetts tardis!
guy@rlgvax.UUCP (Guy Harris) (11/13/84)
> 1. Your idea of cpu-intensive UNIX benchmarks sure is strange; > Gosh, I always thought there was a fairly large I/O component to > cc, nroff, grep, vi, mail, news, etc. Ever timed "cc" or "nroff"? *VERY* CPU-intensive - at least the versions we've got here on our 780. One "make" rebuilding the kernel takes up between 60 and 90% of an 11/780. Also, note he only referred to the aforementioned as '"real" UNIX tasks', not "cpu-intensive UNIX benchmarks." He referred both to CPU-intensive and disk-intensive tasks. > 5. I've spent 8 years working with UNIX systems. I have yet to see > a machine run 4.2 better than the 780 does (soon to change with the > advent of the VAX 8600). Working for a competitor who *has* a machine that runs 4.2 better than the 780 does, unless you're beating the terminals to death (our terminal mux is, shall we say, sub-optimal), I'm a little biased here, but there do exist superminis out there that are faster than an 11/780. Are you willing to make that claim about the Power 6/32, *and* the Pyramid 90x, *and* the top-of-the-line Gould (maybe the MV/10000, too)? (While we're at it, how about the 11/785? If it isn't any improvement over the 11/780 running 4.2, *somebody* screwed up...) (Anybody put 4.2 up on some big IBM/Amdahl/... iron? For terminal I/O, I dunno, but I bet it's pretty good on CPU-intensive or disk-intensive jobs.) If you mean you've never seen any *micro* out there run 4.2 better than the 11/780, maybe. I agree that statements of the "wow, this supermicro is faster than a <fill in your favorite mini>!" ilk are to be taken with a grain of salt - we had a supermicro in house whose manufacturer boasted that it was as fast as an 11/70. We decided, after working some with it, that it was no doubt true, under certain circumstances. If you dropped it off a building, it would fall as fast as an 11/70 (modulo air drag). Guy Harris {seismo,ihnp4,allegra}!rlgvax!guy