[comp.arch] Black magic, IBM RIOS.

zs01+@andrew.cmu.edu (Zalman Stern) (04/07/90)

Excerpts from netnews.comp.arch: 4-Apr-90 Re: Black magic, IBM RIOS.
Donald Lindsay@MATHOM.GA (1977)

> Each cache-load cycles the data bus 8 times, so programs using large
> strides would probably want to use uncached loads and stores.  A
> quick search didn't turn up any mention that the RS allows that.
> Does anyone know that for sure? And does anyone have similar numbers
> for, say, the i860, which was specifically intended to be used that
> way?

> -- 
> Don		D.C.Lindsay 	Carnegie Mellon Computer Science

To the best of my knowledge, the only way to do uncached loads on an IBM
RISC System/6000 is to map the memory through IO space. This is slow
(rumored to be ~20 cycles per access).

Sincerely,
Zalman Stern
Internet: zs01+@andrew.cmu.edu     Usenet: I'm soooo confused...
Information Technology Center, Carnegie Mellon, Pittsburgh, PA 15213-3890

pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) (04/09/90)

In article <1990Apr4.140713.8996@specialix.co.uk> jpp@specialix.co.uk (John Pettitt) writes:

   Path: aber-cs!gdt!dcl-cs!ukc!slxsys!jpp
   From: jpp@specialix.co.uk (John Pettitt)
   Newsgroups: comp.arch
   Date: 4 Apr 90 14:07:13 GMT
   References: <PCG.90Apr3204004@odin.cs.aber.ac.uk>
   Organization: Specialix International, London
   Lines: 20

   pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) writes:
   >Executing the two versions, whose inner loop come to be 7 instructions
   >and 10 instructions, gives me times of 1.3 seconds and 10.4 seconds.
   >Look again at the figures:

   >	register:		16M*7 instructions in 1.3 s
   >	volatile static:	16M*10 instructions in 10.4 s

   Same test on a mips 3240 (25Mhz R3000)

	   register:		1.0 s
	   volatile static:	8.9 s

This I cannot believe. The numbers above say that a 20Mhz RIOS does in
the most favorable conditions 70 integer native MIPS, i.e. around 3
instructions per Hz (BLACK MAGIC!). Your register timings for the 3240
(and the MIPS inner loop is 4 instructions) imply 16M*4 instructions in
1 second, that is 64 MIPS. I cannot believe that the 25 Mhz R3000 os
superscalar as well.

There is also another reason I cannot believe your 3240 figures; I have
tried the same loop on a 16.67 Mhz DECstation and it takes 3.5 seconds,
and on a 5840 it takes 2.8 seconds. This is the expected value.

I will be shortly posting a full table with several machines and some
notes on them. One interesting result I can anticipate is that the IBM
machine is the fastest, and it also has the largest multiple between the
register vs. the memory based loop. Other machines usually have a factor
of 3-5 at most.

One of my interests in compiling this table is to see the ration between
native MIPS, Mhz, and *transistor counts* (the issue is internal vs.
external parallelism).
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

jfh@rpp386.cactus.org (John F. Haugh II) (04/10/90)

In article <PCG.90Apr9120504@odin.cs.aber.ac.uk> pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) writes:
>This I cannot believe. The numbers above say that a 20Mhz RIOS does in
>the most favorable conditions 70 integer native MIPS, i.e. around 3
>instructions per Hz (BLACK MAGIC!). Your register timings for the 3240
>(and the MIPS inner loop is 4 instructions) imply 16M*4 instructions in
>1 second, that is 64 MIPS. I cannot believe that the 25 Mhz R3000 os
>superscalar as well.

I tested a 20MHz S/6000 with a backlevel compiler and managed to [ only ]
produce 2.1 seconds for the register loop.  I don't know if it was 7
instructions, but assuming it was yields

	4096 x 4096 x 7 / 2.1 = 55,924,053 instructions per second

which translates to 2.8 instructions per cycle.
-- 
John F. Haugh II                             UUCP: ...!cs.utexas.edu!rpp386!jfh
Ma Bell: (512) 832-8832                           Domain: jfh@rpp386.cactus.org

flee@shire.cs.psu.edu (Felix Lee) (04/11/90)

Piercarlo Grandi <pcg@cs.aber.ac.uk> wrote:
> register timings for the 3240 (and the MIPS inner loop is 4
> instructions) imply 16M*4 instructions in 1 second, that is 64 MIPS. I
> cannot believe that the 25 Mhz R3000 os superscalar as well.

Be careful when you examine the machine code.  On a MIPS RC3240, the
inner loop is unrolled, so it only gets executed 4M times, not 16M,
and it contains 6 instructions, not 4.  This is 4M*6 instructions in
1.0s, which is 24 native MIPS, a reasonable number.
--
Felix Lee	flee@shire.cs.psu.edu	*!psuvax1!flee

pcg@aber-cs.UUCP (Piercarlo Grandi) (04/12/90)

The simple test I have produced is meaningless without actual instruction
counting, and analysis of the inner loop. It is a CPU/memory level test, and
not a system level test; the goal is not to look at the cleverest optimization
to shorten the time, but *how* the time is spent.

J. Pettitt wrote me that recent MIPS compilers unroll the inner loop; in
theory it can be replaced by 4096 additions of 277 executed once, or even by
1/N additions of N*277 (where N*277 < MAX_INT).

In any case the MIPS compilers I have seen on DEC machines get a 5
instruction inner loop, which in the best case (a 5840) is executed in
around 3 seconds. The IBM machines have 7 instruction sin the inner loop
because the compiler does *very* hairy things with scheduling.

Let me repeat: my simple benchmark is not for *system* performance analysis;
by themselves the numbers are almost meaningless; analysis of generated code
is vital.  It is only a tool with which to study the CPU and memory subsystem
architectures, just like the other, very interesting, cache busting
benchmarks recently discussed.

To me, the most interesting information in my benchmark results is the range
of variation between the times for the various storage classes. When, like
in the IBM RIOS example, declaring the loop variables differently results in
a time variation of nearly 8 times, that's interesting. It also means that
actual *system* performance levels will be incredibly dependent on memory or
register access patterns even at a very microscopic level.  The RIOS cache
busting benchmarks recently posted also reveal an incredible range of
variation, at a less microscopic level.

This of course has profound consequences. 
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

phg@cs.brown.edu (Peter H. Golde) (04/12/90)

In article <18208@rpp386.cactus.org> jfh@rpp386.cactus.org (John F. Haugh II) writes:
>I tested a 20MHz S/6000 with a backlevel compiler and managed to [ only ]
>produce 2.1 seconds for the register loop.  I don't know if it was 7
>instructions, but assuming it was yields
>
>	4096 x 4096 x 7 / 2.1 = 55,924,053 instructions per second
>
>which translates to 2.8 instructions per cycle.

I tested my 386 machine on the register loop and it took 0.0 seconds
(it doesn't even HAVE an inner loop).  Does this mean it's
faster than an S/6000? (and to think I never knew....)

I hereby name this benchmark "Dhumbstone".  :-) :-) :-)

--Peter Golde (phg@cs.brown.edu)

pcg@aber-cs.UUCP (Piercarlo Grandi) (04/12/90)

  I tested a 20MHz S/6000 with a backlevel compiler and managed to [ only ]
  produce 2.1 seconds for the register loop.  I don't know if it was 7
  instructions, but assuming it was yields
  
  	4096 x 4096 x 7 / 2.1 = 55,9
Message-ID: <1719@aber-cs.UUCP>
Date: 11 Apr 90 21:13:59 GMT
Reply-To: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Organization: Dept of CS, UCW Aberystwyth
	(Disclaimer: my statements are purely personal)
Lines: 34

The simple test I have produced is meaningless without actual instruction
counting, and analysis of the inner loop. It is a CPU/memory level test, and
not a system level test; the goal is not to look at the cleverest optimization
to shorten the time, but *how* the time is spent.

J. Pettitt wrote me that recent MIPS compilers unroll the inner loop; in
theory it can be replaced by 4096 additions of 277 executed once, or even by
1/N additions of N*277 (where N*277 < MAX_INT).

In any case the MIPS compilers I have seen on DEC machines get a 5
instruction inner loop, which in the best case (a 5840) is executed in
around 3 seconds. The IBM machines have 7 instruction sin the inner loop
because the compiler does *very* hairy things with scheduling.

Let me repeat: my simple benchmark is not for *system* performance analysis;
by themselves the numbers are almost meaningless; analysis of generated code
is vital.  It is only a tool with which to study the CPU and memory subsystem
architectures, just like the other, very interesting, cache busting
benchmarks recently discussed.

To me, the most interesting information in my benchmark results is the range
of variation between the times for the various storage classes. When, like
in the IBM RIOS example, declaring the loop variables differently results in
a time variation of nearly 8 times, that's interesting. It also means that
actual *system* performance levels will be incredibly dependent on memory or
register access patterns even at a very microscopic level.  The RIOS cache
busting benchmarks recently posted also reveal an incredible range of
variation, at a less microscopic level.

This of course has profound consequences. 
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@aber-cs.UUCP (Piercarlo Grandi) (04/12/90)

In article <6438@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
  In article <1990Apr4.140713.8996@specialix.co.uk> jpp@specialix.co.uk (John Pettitt) writes:
  >pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) writes:
  >>	register:		16M*7 instructions in 1.3 s
  >>	volatile static:	16M*10 instructions in 10.4 s
  >Same test on a mips 3240 (25Mhz R3000)
  >	register:		1.0 s
  >	volatile static:	8.9 s
  >and on an ALR 486 with 128K cache
  >	register:		4.8 s
  >	volatile static:	9.5 s
  >Hmmm interesting.

  Why are these numbers interesting or magic?

The RIOS numbers are interesting because we are looking at a 20Mhz 530
doing around 70 MIPS peak. It drops to about 15 MIPS if the same program
is rerun with variables in memory. This means that not only we pay the cost
of cache transations, we also lose superscalarity.

The MIPS tests show that the meaning of this test has not been understood;
the crucial inner loop has been unrolled, and thus the test has become one
of language speed, not one of CPU/memory architecture/speed. Since the
program is meaningless, looking at just how fast it runs, without looking at
the generated code, is pointless.

The 486 figures show that the 486 has remarkable performance. The typical 20
Mhz RISC chip has a register time of just over 3 seconds, with usually 5-6
instructions in the inner loop. The 486 is not as quick, but the difference
for the variables in memory case is much smaller. This means that, as
expected, RISCs get bogged down (load-store architecture) by memory accesses.

  On very fast machines, keeping important values in registers is a big
  savings.  You execute less instructions, and you spend much less time
  waiting on memory.

This is a true but uninteresting. It is the *magnitudes* of the effect that
are most interesting. When the same loop (which mimics many hot spots in
real world programs) on a superscalar exhibits a factor in running time of 8
depending on whether the variables are (for CISCs the factor tends to be 2,
for RISCS it tends to be 5) in registers or memory, you start believing
religiously in Von Neumann's bottleneck, at least until we coax the chip
guys to deliver faster memory, not just larger.

When you look at transistor counts as well you wonder even more whether a low
level of NUMA external parallelism (2-6 CPUs) is dearer/cheaper faster/slower
than a low level of internal parallelism (superscalar).

However, for a better discussion of these issues, please wait for my
forthcoming table of figures for a couple dozen CPU types.
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) (04/14/90)

In article <E2&79e5@cs.psu.edu> flee@shire.cs.psu.edu (Felix Lee) writes:

   Piercarlo Grandi <pcg@cs.aber.ac.uk> wrote:
   > register timings for the 3240 (and the MIPS inner loop is 4
   > instructions) imply 16M*4 instructions in 1 second, that is 64 MIPS. I
   > cannot believe that the 25 Mhz R3000 os superscalar as well.

   Be careful when you examine the machine code.  On a MIPS RC3240, the
   inner loop is unrolled, so it only gets executed 4M times, not 16M,
   and it contains 6 instructions, not 4.  This is 4M*6 instructions in
   1.0s, which is 24 native MIPS, a reasonable number.

Precisely my point. Actually the difference was that I was using the 1.31
release of the MIPS compiler, that does not do loop unrolling, while
Pettit probably was using the 2.1 version, that does loop unrolling.

As I have had to repeat many times, the purpose of my benchmark is to
see how the CPU and memory subsystem perform, not the compiler, so you
want to look at the generated code. If you drag the compiler into the
act, and look only at the runtime, you are assuming that this is a
language level benchmark, which is a very silly (or very disingenuous --
like the IBM/DEC posting of dhrystone 1.1 data) assumption, and one that
I have carefully, loudly disclaimed many times.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) (04/14/90)

In article <36107@brunix.UUCP> phg@cs.brown.edu (Peter H. Golde) writes:

   I tested my 386 machine on the register loop and it took 0.0 seconds
   (it doesn't even HAVE an inner loop).  Does this mean it's
   faster than an S/6000? (and to think I never knew....)

   I hereby name this benchmark "Dhumbstone".  :-) :-) :-)

I agree completely :-). It is a benchmark of the dumbness of many people
who don't understand the difference between benchmarks at CPU/memory
level (mine, LLNL's cache busting), language level (dhrystone,
whetstone, perfect), and system level (SPEC, various transactionals),
and the need in any case to carefully analyze their performance profile.

    WHAT IS A DHUMBSTONE?

    Your Dhumbstone rating is a direct function of the narrowness and
    superficiality of your analysis of a CPU/memory (or other) level
    benchmark, or of your blatant vested interests in mispresenting it.

    At one extreme, D. Lindsay who posted the cache busting CPU/memory level
    benchmark, and the LLNL guys who authored it, score a flat zero rating.
    A honorary zero rating also goes to John Mashey, for his repeated,
    obnoxiously demistifying articles on benchmarking. Sorry, folks! ;-)

    At the other exteme, IBM marketdroids score a nearly infinite rating,
    because not only they published meaningless MIPS/VUPS/OOPS...! numbers
    based on dhrystone 1.1 compiled by a code rewriting compiler, they also
    did it for a product that does not need any hype to impress.  I am sure
    they have great potential to improve their rating still more. :-(

To repeat it once more: with a CPU/memory level benchmark you want that
it be small, so you can analyze it carefully, and you want to look not
just at run time, but also at instruction count, memory access modes,
instruction ordering, etc...

If you understand this, your Dhumbstone rating will unfortunately suffer
and you may no longer qualify for your extra improved, newly released,
RISCy CISC, state-of-the-art, multiprocessor, vectoring, caching,
superscalar, highly optimized, limited time special offer, for only
$9,999.95 dollars, of a complimentary subscription to this newsgroup.
:-) :-) :-) (for those that have a low MhontyPhytonstone rating).
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk