[comp.arch] IBM RS6000

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) (01/11/91)

Scientific researchers are beginning to get results on the IBM RS6000 machines.
(Some RS6000's have apparently been shipping in quantity recently.)
I heard two comments today, which correspond with other things I have heard.
These comments are (beware, hearsay coming):

1)   The machines are as fast as other micros on scalar code, and a lot faster
     on vector code (other things being equal: clock speed, cache, etc. etc).
     Many of the codes here *are* vectorizable.

2)   The machine is very, very bad at context switches.  So bad, that response
     time becomes terrible with *one* CPU bound background process running.

Again, this is *hearsay*.  But, I am particularly curious if anyone has any
insight on 2) above.  Is it as bad as these various sources have reported?
Does anyone have any numbers on the cost of a context switch?  Is it a
function of process size, or ...whatever?

If it bad, why is it?  What is it about the design?  

Memory management?  

Cache?

O/S bug or feature?

How could IBM have missed something like this in the design (it should have
been obvious when the first prototype was built...?  Doesn't everyone do
big compiles as background jobs?)

Or, maybe this is just a smear campaign by IBM's rivals, who are upset that
IBM has an apparently hot product?

  Hugh LaMaster, M/S 233-9,  UUCP:                ames!lamaster
  NASA Ames Research Center  Internet:            lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035    With Good Mailer:    lamaster@george.arc.nasa.gov 
  Phone:  415/604-6117       

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (01/11/91)

>>>>> On 10 Jan 91 21:41:22 GMT, lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) said:

Hugh> Scientific researchers are beginning to get results on the IBM
Hugh> RS6000 machines. [...]  I heard two comments today, which
Hugh> correspond with other things I have heard.

Hugh> 1) The machines are as fast as other micros on scalar code, and
Hugh>    a lot faster on vector code (other things being equal: clock
Hugh>    speed, cache, etc. etc).  Many of the codes here *are*
Hugh>    vectorizable.

This is certainly consistent with the initial benchmarks, the Los
Alamos report, and the general hearsay from around here.  My RS/6000
runs vectorizable codes about twice as fast as a DECstation 5000 and
scalar floating-point codes at about the same speed as a DECstation
5000.  Unlike the MIPS-based machines, the RS/6000 often runs a bit
*faster* in 64-bit precision than in 32 bits....

Hugh> 2) The machine is very, very bad at context switches.  So bad,
Hugh>    that response time becomes terrible with *one* CPU bound
Hugh>    background process running.

Hugh> Again, this is *hearsay*.  But, I am particularly curious if
Hugh> anyone has any insight on 2) above.  Is it as bad as these
Hugh> various sources have reported?

This was also the result presented in "Personal Workstation".
However, that magazine consistently exhibits an incredibly low level
of understanding of computer performance issues, so it is difficult to
give that particular result much credibility....

A contrary result was expressed in a recent issue of "Workstation
News" (or some such title), which showed the RS/6000 as being much
faster than the DECstation 5000 for multiple jobs.  The DECstation
5000 was, in turn, much faster than the Sun 4?? (470 maybe?).  

As a side note on "scientific literacy", I threw the latter
newspaper/magazine away when I noticed that the apparently *huge*
differences in performance shown on the chart were simply an optical
illusion produced by the choice of the range.  The values were
something like
	IBM	200
	DEC	140
	SUN	100
and the ordinate of the chart was from about 80 to 220, thus making
the IBM line some 5 times as "high" as the SUN line (though the actual
speed ratio was only 2:1) and making the DEC line about 3 times as
"high" as the SUN line (with an actual ratio of 1.4:1).  Perhaps "The
Visual Display of Quantitative Information" should be require reading
for journalists as well as scientists and engineers....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

trt@mcnc.org (Tom Truscott) (01/12/91)

>   I've heard the same thing about context switching. I won't get to
> benchmark a system until February, so I can't be sure, but I was shown
> figures which indicated that running processes using temp files was
> faster than pipes.

In the following test (of AIX 3.1 on low-end RS/6000),
pipelines seemed as fast as temporary files:

}  $ cat x
}  soelim /usr/dict/words > tmp1
}  eqn < tmp1 > tmp2
}  tbl < tmp2 > tmp3
}  nroff < tmp3 > /dev/null
}  rm tmp1 tmp2 tmp3
}  $ time sh x
}  
}  real    0m17.24s
}  user    0m16.11s
}  sys     0m0.39s
}  
}  
}  $ cat y
}  soelim /usr/dict/words | eqn | tbl | nroff > /dev/null
}  $ time sh y
}  
}  real    0m16.49s
}  user    0m16.10s
}  sys     0m0.35s
}  $ ^D

An interesting detail of RS/6000 pipelines is that they hold 32k.
That means pipelines such as
	(cd $1 && tar cf - .) | (cd $2 && tar xpf -)
should need far fewer context switches
than are needed on systems with only a 4k pipeline buffer.
	Tom Truscott

guy@auspex.auspex.com (Guy Harris) (01/12/91)

>Bingo. The OS wants to load the entire file into memory before working with
>it. If the file is larger than memory will allow it will crash the machine!

*Load* it into memory (i.e., read every single byte into physical
memory), or *map* it into memory?  The two are inequivalent....

guy@auspex.auspex.com (Guy Harris) (01/12/91)

>Have you heard about the 'hidden' 3 character buffer when doing interprocess
>communication. Set up a socket. Write it so it will gnaw on the chars and then
>spit them back to the client. You'll notice you have to press 4 keys to see
>the first character you typed. (This may have been fixed since 3.001)

What I'd heard about that made it sound as if you needed a tty, not a
socket (you *did* say "first character you typed"; one generally types
characters into ttys or pseudo-ttys, not sockets); it sounded like they
just didn't set a couple of tty parameters right when going into
uncooked mode.  I'd also seen indications that it was fixed.

zs01+@andrew.cmu.edu (Zalman Stern) (01/12/91)

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
>[IBM RISC System/6000 is very fast on vector code and as fast as other
> processors (at equiv. clock) on scalar code.]

The hardware has lower FP cycle counts than other processors.  (One
exception is the MIPS R6000.) One place where the RIOS falls down is that
branches take too long.  (zero to three cycles with the average on the high
side.)  There is some room to improve this in the implementation.

> [Context switching is rumored to be slow.]
> If it bad, why is it?  What is it about the design?  
> 
> Memory management?  

The RIOS MMU is an excersise in complexity. The inverted page table (IPT)
with hardware reload and hardware lock bit support is too far gone. TLB
reload is somewhat slow as a result. One might see performance problems
with processes that thrash the TLB. I haven't measured this though and it
would only show up for large processes. The IPT also limits how different
address spaces can share memory. [See the IPT flamage that has shown up in
this newsgroup at least three times already.] This leads to performance
tradeoffs for Mach. In practice this isn't a problem and it certainly
shouldn't show up in AIX since it only shares 256 megabyte segments between
processes anyway. (Segment sharing is efficient on the RIOS hardware.)

> 
> Cache?

Shouldn't be a problem. the cache is tagged with 52 bit virtual addresses
so there is no need to flush anything on a context switch. the 4 way set
associative data cache might improve cache residency across context
switches. (That is the next time your process gets scheduled, there is a
better chance that some of its data will still be in the cache.)

> 
> O/S bug or feature?

Most likely. Wouldn't be the first performance bug in AIX 3.1 :-) One
way to test it would be to get context switch times for Mach 2.5 on the
RIOS and compare them to the DECstation 5000.

> 
> How could IBM have missed something like this in the design (it should have
> been obvious when the first prototype was built...?  Doesn't everyone do
> big compiles as background jobs?)

When I was doing development on a 530 (25 Mhz RIOS) I didn't notice these
problems. (My MIPS Magnum feels a little better, but at least part of that
is the losing X11 performance on the RIOS.)  Of course, a single user
workload is not a good test case for context switching. In general,
performance problems are not simple and when you are working full tilt just
to get rid of OS crash bugs, they can easily be overlooked. The compilation
performance was pretty good though. It was somewhere between 20 and 30
minutes to build a full Mach kernel (with optimization turned on).

> 
> Or, maybe this is just a smear campaign by IBM's rivals, who are upset that
> IBM has an apparently hot product?

The RIOS is definitely in the game performance wise. Architecturally, other
RISC chips are getting similar performance with much simpler
implementations.  I also question the value of proprietary architectures in
this day and age.

Zalman Stern, MIPS Computer Systems, 928 E. Arques 1-03, Sunnyvale, CA 94086
zalman@mips.com OR {ames,decwrl,prls,pyramid}!mips!zalman     (408) 524 8395
       "``Ah, so,'' said Daruma the O-maker" -- Tom Robbins

ken@harvard.edu (Ken Cleary) (01/12/91)

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:

>Scientific researchers are beginning to get results on the IBM RS6000 machines.
>2)   The machine is very, very bad at context switches.  So bad, that response
>     time becomes terrible with *one* CPU bound background process running.

I did not do extensive benchmarking, but I tend to agree with the above.
I got only brief access to a demo machine, and I mostly just played with
the MIT X11 demo programs, like plaid, maze, etc.  As soon as I popped the 
first one up, I was amazed by the speed.  As I started running more copies of
these programs, I was surprised by the choppiness and slowdown of them.
I was not running more than 10 total copies of these rapidly animated graphics
demos.  (Yes, I realize that graphics updates may introduce I/O waits, so
perhaps this is not a fair benchmark, since a slow X terminal will take more
wall-clock time, for a graphically-animated client process, even though the
client won't be burning CPU cycles.)  Perhaps the choppiness is an indication
of course-grained time-slicing, so as to minimize context-switching?
{I may be stepping beyond the realms of my expertise, here.}

gene@zeno.mn.org (Gene H. Olson) (01/14/91)

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:

>Scientific researchers are beginning to get results on the IBM RS6000 machines.
>(Some RS6000's have apparently been shipping in quantity recently.)
>I heard two comments today, which correspond with other things I have heard.
>These comments are (beware, hearsay coming):

>1)   The machines are as fast as other micros on scalar code, and a lot faster
>     on vector code (other things being equal: clock speed, cache, etc. etc).
>     Many of the codes here *are* vectorizable.

I have one, and only one datapoint.  I wrote a text compression program
(compact) recently posted to comp.sources.misc.  I have run this program
on a wide variety of machines.  Compact has a small main progam loop,
and will use any amount of data memory for the compression tables.  The
main loop fits in the cache of all competitive machines.  However the
data (default working space is about 1 meg) is accessed like a hash table,
so any size cache is hit very hard by data accesses.  Register variables
are used effectively, so stack accesses should be minimal in machines
with 16 or more general purpose registers.

After hearing what a tremendous performer the RS6000 is, I tried running
the program there.  I found it ran dead even (+/- 10%) with a SparcStation
1 (not 1+) and a 25 MHz 486 machine with a good memory subsystem.  All three
of these machines compressed comparable data at 290 to 320 Kbytes/second.

I was quite surprised.  The machine I tested is a bottom-of-the-line
desktop model,  but I was led to believe this machine was 2-to-3 times
faster than the other machines.  The machine is a very early model, and
there could be some performance compromises, although I know of none.
According to a booth IBM guy at Unix Expo, I was running the latest,
greatest, and best version of the operating system and tools.
_________________________________________________________________________
   __ 
  /  )                Gene H. Olson             gene@zeno.mn.org
 / __  _ __  _                             
(__/ _(/_//_(/_                                 gene@digibd.com

amos@SHUM.HUJI.AC.IL (amos shapir) (01/14/91)

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:

>Scientific researchers are beginning to get results on the IBM RS6000 machines.
>(Some RS6000's have apparently been shipping in quantity recently.)
>I heard two comments today, which correspond with other things I have heard.
>These comments are (beware, hearsay coming):

>1)   The machines are as fast as other micros on scalar code, and a lot faster
>     on vector code (other things being equal: clock speed, cache, etc. etc).
>     Many of the codes here *are* vectorizable.

>2)   The machine is very, very bad at context switches.  So bad, that response
>     time becomes terrible with *one* CPU bound background process running.

It's not heresy any more!  I have just run a set of tests comparing the
RS6000 to SUN4 an Solbourne workstations.  For CPU-bound jobs, it measured
about 30MIPS - that depends on how you define MIPS, but the other machines
measured 9 and 12 on the same test, respectively.

For context-switch bound jobs (running one byte through many pipes)
the RS6000 performs about 70% as fast as a SUN4 for up to 8 processes,
and up to about 1.5 times faster when more than 8 processes are run.
(This "ledge" at 8 processes is a feature of SUNOS, and appears on all
Sun hardware).  The latter ratio was also reported by the "iocall"
(system call overhead) test.

The bottom line: it's great for number crunching, but for most interactive
jobs it responds as a regular SUN4.

--
 Amos Shapir  amos@shum.huji.ac.il
The Hebrew Univ. of Jerusalem, Dept. of Comp. Science.
Tel. +972 2 584385 GEO: 35 14 E / 31 46 N city

rtaheri@hpcupt1.cup.hp.com (Reza Taheri) (01/15/91)

> 1)   The machines are as fast as other micros on scalar code, and a lot faster
>      on vector code (other things being equal: clock speed, cache, etc. etc).
>      Many of the codes here *are* vectorizable.
> 
> 2)   The machine is very, very bad at context switches.  So bad, that response
>      time becomes terrible with *one* CPU bound background process running.
> 
> Again, this is *hearsay*.  But, I am particularly curious if anyone has any
> insight on 2) above.  Is it as bad as these various sources have reported?
> Does anyone have any numbers on the cost of a context switch?  Is it a
> function of process size, or ...whatever?

    My experience confirms this.  The Dhrystones/SPEC numbers for the
520 scream, but the TPC-B numbers and a multi-user, program development
benchmark (similar to the proposed SDET of SPEC 2.0) I ran put it in
the same class as machines with a fraction of its SPEC 1.0 rating.

    My guess is that this is due to a number of factors rather than a
single one.  IBM set out to design a fast workstation not a multi-user
computer, these machines have very small caches and TLBs for their clock
rate.  This brings about two points: 1) RIOS machines are marketed as
worksations AND as workstation servers.  A good server needs to be
a fast multi-tasking computer.  How good are the RIOS machines as
servers?  And 2) Are there any obstacles in simply increasing cache
and TLB sizes and making context switches faster; i.e. is the slowness
of multi-user performance in the architectute or in the implementation?

Reza Taheri
rtaheri@hpperf1.cup.hp.com

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (01/15/91)

>>>>> On 14 Jan 91 05:59:22 GMT, gene@zeno.mn.org (Gene H. Olson) said:

Gene> lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
    [about the IBM RS/6000]
>1) The machines are as fast as other micros on scalar code, and a lot
>   faster on vector code (other things being equal: clock speed, cache,
>   etc. etc).  Many of the codes here *are* vectorizable.

Gene> [....]  I wrote a text compression program (compact) recently
Gene> posted to comp.sources.misc. [....]  However the data (default
Gene> working space is about 1 meg) is accessed like a hash table, so
Gene> any size cache is hit very hard by data accesses.  [....]  I
Gene> found it ran dead even (+/- 10%) with a SparcStation 1 (not 1+)
Gene> and a 25 MHz 486 machine with a good memory subsystem.

The memory access pattern is the clue.  The IBM RS/6000 architecture
is clearly optimized for sequential access patterns.  The cache line
size on the Model 320 that Gene used is 64 bytes.  The memory
interface to cache delivers 8 bytes per clock with a latency of about
8 clocks, so each cache miss is going to cost you about 16 cycles.
That is a fairly large penalty if you are going to only use one byte!
Machines with smaller cache line sizes will retrieve a lot less unused
information on each cache miss, and hence will run relatively more
efficiently.
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/16/91)

In article <3830014@hpcupt1.cup.hp.com> rtaheri@hpcupt1.cup.hp.com (Reza Taheri) writes:

|     My experience confirms this.  The Dhrystones/SPEC numbers for the
| 520 scream, but the TPC-B numbers and a multi-user, program development
| benchmark (similar to the proposed SDET of SPEC 2.0) I ran put it in
| the same class as machines with a fraction of its SPEC 1.0 rating.

  Has anyone got MUSBUS numbers for the RS6000 and SS1+ or SS2?
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

r0nick@arnor.uucp (01/17/91)

The comment that the RISC System/6000 (I have to use the official name) seems to slow
down too much with background jobs is contrary to my experience. I've been using a model 
320 with 32Mb RAM since June, and have found very little slowdown of interactive 
response time when running background jobs. If another job is running in the foreground, 
in another window for example, there is considerable slowdown, but the main impact that
a background job has is a small delay if the machine is left untouched for a while as 
the X code is paged back into RAM. 
These machines do seem to need lots of RAM to perform well, more than seems common on 
other workstations. You didn't mention exactly what system you were using, so I can't 
comment on that, but IBM seems to be pushing systems with far too little RAM (8-16Mb in
the desktop).

				-Nick Carter
	Note: I am a co-op employee. Don't even consider the idea of thinking that 
		what I've said is in any way IBM's opinion.

tony@jassys.UUCP (Tony Holden) (01/20/91)

 >   Has anyone got MUSBUS numbers for the RS6000 and SS1+ or SS2?
 
 We've just gotten a 520 at work, no one on but me at the moment so
 I'll be glad to give a run.  But.....  I've lost my copy of MUSBUS.
 
 If someone will send a copy I'll run it.
 
 
 -- 
 Tony Holden					Live on the edge,
 tony@jassys					Bank in Texas


-- 
Tony Holden					Live on the edge,
tony@jassys					Bank in Texas

mash@mips.COM (John Mashey) (01/21/91)

In article <1991Jan20.033052.10919@athena.mit.edu> jfc@athena.mit.edu (John F Carr) writes:
>
>>The RIOS MMU is an excersise in complexity. The inverted page table (IPT)
>>with hardware reload and hardware lock bit support is too far gone. TLB
>>reload is somewhat slow as a result. One might see performance problems
>>with processes that thrash the TLB. 
>
>I'm not convinced this is a problem.  The RIOS MMU is very similar to
.....  good info on RT PC refill.
..... can anybody post a correpsonding analysis for RS6000?
>
>Mach 2.5 is guilty of "all the world's a VAX" thinking.  It requires
>VAX style page tables; it emulates these on the RT by taking a page
>fault each time the virtual address used to access the same physical
>page changes.  Performance may vary a lot depending on memory access
>patterns and use of shared data.  It isn't a fair test of MIPS vs. IBM
>hardware to run an operating system that requires a certain MMU when
>only one of the machines has it.

This is a fair comment, although one must also point out that:
	a) Mach was trying to be runnable on a number of different machines
	with different kinds of MMUs; however, most machines look more
	like VAXen than RT PCs.
	b) Mach hardly requires a MIPS MMU
	c) Note that the RT PC MMU was designed for & in conjunction with
	one particular OS version, with little convern for whether other
	OS's would port well or not (and there's nothing worng with that,
	given IBM's viewpoint).  On the other hand, the MIPS MMU was
	explicitly designed with a specific flexibility in mind, that
	lots of different people would be able to port different
	operating systems to it without having to do major redesigns.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086