zs01+@andrew.cmu.edu (Zalman Stern) (04/07/90)
Excerpts from netnews.comp.arch: 4-Apr-90 Re: Black magic, IBM RIOS. Donald Lindsay@MATHOM.GA (1977) > Each cache-load cycles the data bus 8 times, so programs using large > strides would probably want to use uncached loads and stores. A > quick search didn't turn up any mention that the RS allows that. > Does anyone know that for sure? And does anyone have similar numbers > for, say, the i860, which was specifically intended to be used that > way? > -- > Don D.C.Lindsay Carnegie Mellon Computer Science To the best of my knowledge, the only way to do uncached loads on an IBM RISC System/6000 is to map the memory through IO space. This is slow (rumored to be ~20 cycles per access). Sincerely, Zalman Stern Internet: zs01+@andrew.cmu.edu Usenet: I'm soooo confused... Information Technology Center, Carnegie Mellon, Pittsburgh, PA 15213-3890
pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) (04/09/90)
In article <1990Apr4.140713.8996@specialix.co.uk> jpp@specialix.co.uk (John Pettitt) writes: Path: aber-cs!gdt!dcl-cs!ukc!slxsys!jpp From: jpp@specialix.co.uk (John Pettitt) Newsgroups: comp.arch Date: 4 Apr 90 14:07:13 GMT References: <PCG.90Apr3204004@odin.cs.aber.ac.uk> Organization: Specialix International, London Lines: 20 pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) writes: >Executing the two versions, whose inner loop come to be 7 instructions >and 10 instructions, gives me times of 1.3 seconds and 10.4 seconds. >Look again at the figures: > register: 16M*7 instructions in 1.3 s > volatile static: 16M*10 instructions in 10.4 s Same test on a mips 3240 (25Mhz R3000) register: 1.0 s volatile static: 8.9 s This I cannot believe. The numbers above say that a 20Mhz RIOS does in the most favorable conditions 70 integer native MIPS, i.e. around 3 instructions per Hz (BLACK MAGIC!). Your register timings for the 3240 (and the MIPS inner loop is 4 instructions) imply 16M*4 instructions in 1 second, that is 64 MIPS. I cannot believe that the 25 Mhz R3000 os superscalar as well. There is also another reason I cannot believe your 3240 figures; I have tried the same loop on a 16.67 Mhz DECstation and it takes 3.5 seconds, and on a 5840 it takes 2.8 seconds. This is the expected value. I will be shortly posting a full table with several machines and some notes on them. One interesting result I can anticipate is that the IBM machine is the fastest, and it also has the largest multiple between the register vs. the memory based loop. Other machines usually have a factor of 3-5 at most. One of my interests in compiling this table is to see the ration between native MIPS, Mhz, and *transistor counts* (the issue is internal vs. external parallelism). -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
jfh@rpp386.cactus.org (John F. Haugh II) (04/10/90)
In article <PCG.90Apr9120504@odin.cs.aber.ac.uk> pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) writes: >This I cannot believe. The numbers above say that a 20Mhz RIOS does in >the most favorable conditions 70 integer native MIPS, i.e. around 3 >instructions per Hz (BLACK MAGIC!). Your register timings for the 3240 >(and the MIPS inner loop is 4 instructions) imply 16M*4 instructions in >1 second, that is 64 MIPS. I cannot believe that the 25 Mhz R3000 os >superscalar as well. I tested a 20MHz S/6000 with a backlevel compiler and managed to [ only ] produce 2.1 seconds for the register loop. I don't know if it was 7 instructions, but assuming it was yields 4096 x 4096 x 7 / 2.1 = 55,924,053 instructions per second which translates to 2.8 instructions per cycle. -- John F. Haugh II UUCP: ...!cs.utexas.edu!rpp386!jfh Ma Bell: (512) 832-8832 Domain: jfh@rpp386.cactus.org
flee@shire.cs.psu.edu (Felix Lee) (04/11/90)
Piercarlo Grandi <pcg@cs.aber.ac.uk> wrote: > register timings for the 3240 (and the MIPS inner loop is 4 > instructions) imply 16M*4 instructions in 1 second, that is 64 MIPS. I > cannot believe that the 25 Mhz R3000 os superscalar as well. Be careful when you examine the machine code. On a MIPS RC3240, the inner loop is unrolled, so it only gets executed 4M times, not 16M, and it contains 6 instructions, not 4. This is 4M*6 instructions in 1.0s, which is 24 native MIPS, a reasonable number. -- Felix Lee flee@shire.cs.psu.edu *!psuvax1!flee
pcg@aber-cs.UUCP (Piercarlo Grandi) (04/12/90)
The simple test I have produced is meaningless without actual instruction counting, and analysis of the inner loop. It is a CPU/memory level test, and not a system level test; the goal is not to look at the cleverest optimization to shorten the time, but *how* the time is spent. J. Pettitt wrote me that recent MIPS compilers unroll the inner loop; in theory it can be replaced by 4096 additions of 277 executed once, or even by 1/N additions of N*277 (where N*277 < MAX_INT). In any case the MIPS compilers I have seen on DEC machines get a 5 instruction inner loop, which in the best case (a 5840) is executed in around 3 seconds. The IBM machines have 7 instruction sin the inner loop because the compiler does *very* hairy things with scheduling. Let me repeat: my simple benchmark is not for *system* performance analysis; by themselves the numbers are almost meaningless; analysis of generated code is vital. It is only a tool with which to study the CPU and memory subsystem architectures, just like the other, very interesting, cache busting benchmarks recently discussed. To me, the most interesting information in my benchmark results is the range of variation between the times for the various storage classes. When, like in the IBM RIOS example, declaring the loop variables differently results in a time variation of nearly 8 times, that's interesting. It also means that actual *system* performance levels will be incredibly dependent on memory or register access patterns even at a very microscopic level. The RIOS cache busting benchmarks recently posted also reveal an incredible range of variation, at a less microscopic level. This of course has profound consequences. -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
phg@cs.brown.edu (Peter H. Golde) (04/12/90)
In article <18208@rpp386.cactus.org> jfh@rpp386.cactus.org (John F. Haugh II) writes: >I tested a 20MHz S/6000 with a backlevel compiler and managed to [ only ] >produce 2.1 seconds for the register loop. I don't know if it was 7 >instructions, but assuming it was yields > > 4096 x 4096 x 7 / 2.1 = 55,924,053 instructions per second > >which translates to 2.8 instructions per cycle. I tested my 386 machine on the register loop and it took 0.0 seconds (it doesn't even HAVE an inner loop). Does this mean it's faster than an S/6000? (and to think I never knew....) I hereby name this benchmark "Dhumbstone". :-) :-) :-) --Peter Golde (phg@cs.brown.edu)
pcg@aber-cs.UUCP (Piercarlo Grandi) (04/12/90)
I tested a 20MHz S/6000 with a backlevel compiler and managed to [ only ] produce 2.1 seconds for the register loop. I don't know if it was 7 instructions, but assuming it was yields 4096 x 4096 x 7 / 2.1 = 55,9 Message-ID: <1719@aber-cs.UUCP> Date: 11 Apr 90 21:13:59 GMT Reply-To: pcg@cs.aber.ac.uk (Piercarlo Grandi) Organization: Dept of CS, UCW Aberystwyth (Disclaimer: my statements are purely personal) Lines: 34 The simple test I have produced is meaningless without actual instruction counting, and analysis of the inner loop. It is a CPU/memory level test, and not a system level test; the goal is not to look at the cleverest optimization to shorten the time, but *how* the time is spent. J. Pettitt wrote me that recent MIPS compilers unroll the inner loop; in theory it can be replaced by 4096 additions of 277 executed once, or even by 1/N additions of N*277 (where N*277 < MAX_INT). In any case the MIPS compilers I have seen on DEC machines get a 5 instruction inner loop, which in the best case (a 5840) is executed in around 3 seconds. The IBM machines have 7 instruction sin the inner loop because the compiler does *very* hairy things with scheduling. Let me repeat: my simple benchmark is not for *system* performance analysis; by themselves the numbers are almost meaningless; analysis of generated code is vital. It is only a tool with which to study the CPU and memory subsystem architectures, just like the other, very interesting, cache busting benchmarks recently discussed. To me, the most interesting information in my benchmark results is the range of variation between the times for the various storage classes. When, like in the IBM RIOS example, declaring the loop variables differently results in a time variation of nearly 8 times, that's interesting. It also means that actual *system* performance levels will be incredibly dependent on memory or register access patterns even at a very microscopic level. The RIOS cache busting benchmarks recently posted also reveal an incredible range of variation, at a less microscopic level. This of course has profound consequences. -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
pcg@aber-cs.UUCP (Piercarlo Grandi) (04/12/90)
In article <6438@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes: In article <1990Apr4.140713.8996@specialix.co.uk> jpp@specialix.co.uk (John Pettitt) writes: >pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) writes: >> register: 16M*7 instructions in 1.3 s >> volatile static: 16M*10 instructions in 10.4 s >Same test on a mips 3240 (25Mhz R3000) > register: 1.0 s > volatile static: 8.9 s >and on an ALR 486 with 128K cache > register: 4.8 s > volatile static: 9.5 s >Hmmm interesting. Why are these numbers interesting or magic? The RIOS numbers are interesting because we are looking at a 20Mhz 530 doing around 70 MIPS peak. It drops to about 15 MIPS if the same program is rerun with variables in memory. This means that not only we pay the cost of cache transations, we also lose superscalarity. The MIPS tests show that the meaning of this test has not been understood; the crucial inner loop has been unrolled, and thus the test has become one of language speed, not one of CPU/memory architecture/speed. Since the program is meaningless, looking at just how fast it runs, without looking at the generated code, is pointless. The 486 figures show that the 486 has remarkable performance. The typical 20 Mhz RISC chip has a register time of just over 3 seconds, with usually 5-6 instructions in the inner loop. The 486 is not as quick, but the difference for the variables in memory case is much smaller. This means that, as expected, RISCs get bogged down (load-store architecture) by memory accesses. On very fast machines, keeping important values in registers is a big savings. You execute less instructions, and you spend much less time waiting on memory. This is a true but uninteresting. It is the *magnitudes* of the effect that are most interesting. When the same loop (which mimics many hot spots in real world programs) on a superscalar exhibits a factor in running time of 8 depending on whether the variables are (for CISCs the factor tends to be 2, for RISCS it tends to be 5) in registers or memory, you start believing religiously in Von Neumann's bottleneck, at least until we coax the chip guys to deliver faster memory, not just larger. When you look at transistor counts as well you wonder even more whether a low level of NUMA external parallelism (2-6 CPUs) is dearer/cheaper faster/slower than a low level of internal parallelism (superscalar). However, for a better discussion of these issues, please wait for my forthcoming table of figures for a couple dozen CPU types. -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) (04/14/90)
In article <E2&79e5@cs.psu.edu> flee@shire.cs.psu.edu (Felix Lee) writes: Piercarlo Grandi <pcg@cs.aber.ac.uk> wrote: > register timings for the 3240 (and the MIPS inner loop is 4 > instructions) imply 16M*4 instructions in 1 second, that is 64 MIPS. I > cannot believe that the 25 Mhz R3000 os superscalar as well. Be careful when you examine the machine code. On a MIPS RC3240, the inner loop is unrolled, so it only gets executed 4M times, not 16M, and it contains 6 instructions, not 4. This is 4M*6 instructions in 1.0s, which is 24 native MIPS, a reasonable number. Precisely my point. Actually the difference was that I was using the 1.31 release of the MIPS compiler, that does not do loop unrolling, while Pettit probably was using the 2.1 version, that does loop unrolling. As I have had to repeat many times, the purpose of my benchmark is to see how the CPU and memory subsystem perform, not the compiler, so you want to look at the generated code. If you drag the compiler into the act, and look only at the runtime, you are assuming that this is a language level benchmark, which is a very silly (or very disingenuous -- like the IBM/DEC posting of dhrystone 1.1 data) assumption, and one that I have carefully, loudly disclaimed many times. -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) (04/14/90)
In article <36107@brunix.UUCP> phg@cs.brown.edu (Peter H. Golde) writes:
I tested my 386 machine on the register loop and it took 0.0 seconds
(it doesn't even HAVE an inner loop). Does this mean it's
faster than an S/6000? (and to think I never knew....)
I hereby name this benchmark "Dhumbstone". :-) :-) :-)
I agree completely :-). It is a benchmark of the dumbness of many people
who don't understand the difference between benchmarks at CPU/memory
level (mine, LLNL's cache busting), language level (dhrystone,
whetstone, perfect), and system level (SPEC, various transactionals),
and the need in any case to carefully analyze their performance profile.
WHAT IS A DHUMBSTONE?
Your Dhumbstone rating is a direct function of the narrowness and
superficiality of your analysis of a CPU/memory (or other) level
benchmark, or of your blatant vested interests in mispresenting it.
At one extreme, D. Lindsay who posted the cache busting CPU/memory level
benchmark, and the LLNL guys who authored it, score a flat zero rating.
A honorary zero rating also goes to John Mashey, for his repeated,
obnoxiously demistifying articles on benchmarking. Sorry, folks! ;-)
At the other exteme, IBM marketdroids score a nearly infinite rating,
because not only they published meaningless MIPS/VUPS/OOPS...! numbers
based on dhrystone 1.1 compiled by a code rewriting compiler, they also
did it for a product that does not need any hype to impress. I am sure
they have great potential to improve their rating still more. :-(
To repeat it once more: with a CPU/memory level benchmark you want that
it be small, so you can analyze it carefully, and you want to look not
just at run time, but also at instruction count, memory access modes,
instruction ordering, etc...
If you understand this, your Dhumbstone rating will unfortunately suffer
and you may no longer qualify for your extra improved, newly released,
RISCy CISC, state-of-the-art, multiprocessor, vectoring, caching,
superscalar, highly optimized, limited time special offer, for only
$9,999.95 dollars, of a complimentary subscription to this newsgroup.
:-) :-) :-) (for those that have a low MhontyPhytonstone rating).
--
Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk