grenley@nsc.nsc.com (George Grenley) (05/07/87)
Well, discussion on the '532 seems to have died down a bit, so I guess it's time to stir things up. The 532 will run at 30mhz (maybe faster), and many instructions execute in two clocks. This lead to some eager magazine types to claim "15 MIP Performance". I guess in the wake of the hoopla about AMD's new chip, it's understandable the editors got a little excited. So, here are some simulated facts: Our design team has done simulations of the chip's performance, both with ideal 0 wait state memory and with "real world" typical VME bus memory. We ran some unix utilities, including our own compilers, etc. I will divulge a few of these numbers now, and more later. (If I don't get burned for this): Grep ran at 8.4 mips from 0 ws memory, 7.9 from VME. Grep was one of the best. One of the worst was our assembler, it hit 5 mips from 0 ws, and 4.5 mips from VME. On the average (these two plus several other CPU intensive programs) the '532 hit 6.1 mips from 0 ws, 5.3 mips from VME. So, here's the deal. I invite Mot, Intel, and other interested parties to work with me in defining some sort of realistic benchmark, which we'll run (in public). I expect to have system level hardware late this year, so if we get started now, we'll have very interesting Xmas presents... May the best CPU win! (Not that having the best CPU is a requirement for having the most design wins - just ask Intel). Regards, George Grenley Manager, '532 systems development (or something like that) Disclaimer: I work for NSC. I used to work for Intel, selling 8086s. Before that, I sold 68000s for Mostek. I would rather drive steam trains.
mash@mips.UUCP (John Mashey) (05/07/87)
In article <4294@nsc.nsc.com> grenley@nsc.UUCP (George Grenley) writes: > >So, here are some simulated facts: Our design team has done simulations >of the chip's performance, both with ideal 0 wait state memory and with >"real world" typical VME bus memory. We ran some unix utilities, including >our own compilers, etc. I will divulge a few of these numbers now, and more >later. (If I don't get burned for this): >Grep ran at 8.4 mips from 0 ws memory, 7.9 from VME. Grep was one of the >best. One of the worst was our assembler, it hit 5 mips from 0 ws, and >4.5 mips from VME. On the average (these two plus several other CPU >intensive programs) the '532 hit 6.1 mips from 0 ws, 5.3 mips from VME. Could you say a little more on the configurations: cache size, nature [write-back or write-thru] if write-thru, did you use write buffers, and if so, how deep. exactly what the assumptions were on the VME memories It would also be interesting [although I realize this might be sensitive info] to get more info on the simulations, to be able to make a read on the accuracy of the simulations: instruction cycles TLB-miss cycles cache-miss cycles [if present] write-buffer stall & write/read interlock cycles >So, here's the deal. I invite Mot, Intel, and other interested parties >to work with me in defining some sort of realistic benchmark, which we'll >run (in public). I expect to have system level hardware late this year, >so if we get started now, we'll have very interesting Xmas presents... I think that's a great idea and am delighted that somebody has suggested it. Presumably there will be 68030s benchmarkable in hardware by then, and certainly 386s, Clippers, and WE32200s. As a first suggestion, I'd observe that there are at least the following classes of realistic benchmarks: 1) Large FORTRAN / C floating-point ones [and there are many of these that are widely available]. One probably needs at least 5-10 of these to cover the different sorts of things that people do. 2) Large integer benchmarks: this is the real tough category: most of the larger, realistic ones tend to be proprietary codes, or else things where the code [like for assemblers, compilers, etc] inherently differs among systems. this also needs 5-10 of them, and could at least include a few of the larger UNIX utilities, although most of them fit into reasonable-sized caches, and hence don't stress things the way larger applications do. 3) Multi-user and/or systems benchmarks, using UNIX. Run shell scripts, etc. I'dthink there should at least be a few of these. One might want to focus on 1&2, if only to avoid the arguments on 3 regarding different peripheral choices, operating system tuning, etc, unless the shootout is intended as an OS shootout also. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
lm@cottage.WISC.EDU (Larry McVoy) (05/08/87)
In article <374@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes: >Could you say a little more on the configurations: > cache size, nature [write-back or write-thru] > if write-thru, did you use write buffers, and if so, how deep. > exactly what the assumptions were on the VME memories If they are following their data book, then Icache is 512 bytes, 32 lines, direct mapped. Dcache is 1024 bytes, 64 lines, 2 way set associative, write through. TLB is 64 entries, fully associative. >It would also be interesting [although I realize this might be >sensitive info] to get more info on the simulations, to be able to >make a read on the accuracy of the simulations: > > instruction cycles > TLB-miss cycles > cache-miss cycles > [if present] write-buffer stall & write/read interlock cycles I'd like to see this too. Also, I'm a little surprised at the figures. I just spent a bit of time going over the data book and I would have expected better numbers. Closer to 8-10 MIPS... Bummer.... --- Larry McVoy lm@cottage.wisc.edu or uwvax!mcvoy "What a wonderful world it is that has girls in it!" -L.L.
grenley@nsc.nsc.com (George Grenley) (05/08/87)
My 532 performance posting has generated some response. Good! Herewith, more details: For those of you who wish to run the grep benchmark yourself, the test was to search for the string "int" in file chroot.c in Unix V source. I realize this means you nedd the source.... In article <374@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes: >In article <4294@nsc.nsc.com> grenley@nsc.UUCP (George Grenley) writes: >> [deleted lengthy reference to 32532 performance - see previous posting] >Could you say a little more on the configurations: > cache size, nature [write-back or write-thru] > if write-thru, did you use write buffers, and if so, how deep. > exactly what the assumptions were on the VME memories Okay, good questions. Here are some answers: The 532 itself has a 512 byte (16 byte line size) instruction cache, and a 2 way set associative data cache of 1024 bytes - see our published advance data sheet for details. I'm sure our NSC sales people would be happy to hear from you... One purpose of building the simulator was to evaluate EXTERNAL cache designs. Because of real-estate restrictions, we settled on a direct-mapped cache of 64 Kbytes. Both internal and external caches are write through. We use a write buffer of depth 8. Our analysis shows the possibility of this filling w/ typical VME memories is < 1%. For VME memory performance, we used published DRAM specs from several vendors such as Clearpoint, Microproject, etc. We assumed high availability of the memories (and VMEbus) - i.e., no heavy DMA or multiprocessing. >It would also be interesting [although I realize this might be >sensitive info] to get more info on the simulations, to be able to >make a read on the accuracy of the simulations: > > instruction cycles > TLB-miss cycles > cache-miss cycles > [if present] write-buffer stall & write/read interlock cycles You are right on two counts: It would be interesting, and it is also sensitive. Such data would give our competitors too good an idea how good our cach(es) really are, so I shall refrain from any detailed discussion of this for awhile. I will say this: I am a hardware designer, not a CPU architect or a software guy. With the rapid rise in CPU clock rate (30 meg for the '532) and memory demands (2 clock cycle = 66 nsec GROSS), internal caches are becoming a necessity. I would not want to have to design the cache for the 532 if it did not already have one - it would be expensive to do without seriously compromising performance. But, just to tease people a bit, the combined internal and external caches used in the above simulations have an overall read hit rate of better than 93%. As a result, the system is relatively insensitive to main memory performance. [deleted, req to the world to get together for a lil old-fashioned horse-race] >I think that's a great idea and am delighted that somebody has suggested it. >Presumably there will be 68030s benchmarkable in hardware by then, >and certainly 386s, Clippers, and WE32200s. As a first suggestion, >I'd observe that there are at least the following classes of realistic >benchmarks: > 1) Large FORTRAN / C floating-point ones [and there are many of these > that are widely available]. One probably needs at least 5-10 of these > to cover the different sorts of things that people do. My understanding is that the LinPak benchmark has become the standard for the number crunching guys. We used it at Jetas Technology when I worked there as our standard of reference for numeric performance. It seems to well-represent the kind of array-oriented math typical of FORTRAN. (good old fortran - only HLL I was ever really good at 8-)) > 2) Large integer benchmarks: this is the real tough category: > most of the larger, realistic ones tend to be proprietary codes, > or else things where the code [like for assemblers, compilers, etc] > inherently differs among systems. this also needs 5-10 of them, > and could at least include a few of the larger UNIX utilities, > although most of them fit into reasonable-sized caches, and hence > don't stress things the way larger applications do. How about, instead, compiles? They are usually CPU intense (unless you have a REALLY terrible disk system, and reflect the general non-numeric work-load typical of most cpus. I recall reading analyses of instruction mixes from many different non-numeric applications that show they don't vary much. Since compiles of Unix source files is a "portable test" it might be suitable... > 3) Multi-user and/or systems benchmarks, using UNIX. Run shell > scripts, etc. I'dthink there should at least be a few of these. >One might want to focus on 1&2, if only to avoid the arguments on 3 >regarding different peripheral choices, operating system tuning, etc, >unless the shootout is intended as an OS shootout also. Personally, I see item 3 as being at least as important as 1 and 2, from a practical point of view. Overall system performance is ultimately the only thing that matters. Whether system A is faster than system B because the OS is a better port, or the compiler optimizes better, or the I/O subsystem is really hot, is of interest to us as system designers, only insofar as it helps us design better systems. This is the primary reason why I'm talking about the 532, to help other hardware hacks do a good job designing it in. In my experience all CPUs are the same speed when they're sitting on your desk... 8-) All for now. Regards, George Grenley usual disclaimer - either I'm lying, or not...
grenley@nsc.nsc.com (George Grenley) (05/08/87)
In the interests of brevity, I have deleted a lot from Larry's posting, much of it comments by John Mashey. If you haven't been following the discussion, do go back and catch up. In article <3552@spool.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes: > Icache is 512 bytes, 32 lines, direct mapped. > Dcache is 1024 bytes, 64 lines, 2 way set associative, write through. > TLB is 64 entries, fully associative. Correct. However, there is also an external cache, described by me in a previous posting. >Also, I'm a little surprised at the figures. I just spent a bit of time >going over the data book and I would have expected better numbers. >Closer to 8-10 MIPS... Bummer.... If you can get 10 mips we may have a job for you... Seriously, our simulation was based on a real-world Unix environment, with context switches, etc. - Difficult to model by hand. I am sure that well-written code without a lot of system overhead and cold-cache problems will hit 10 mips.