[comp.arch] Why the original RT seemed/was slow

sauer@auschs.UUCP (Charlie Sauer) (11/21/88)

I had hoped to just sit back and watch this discussion, but...

  - there seems to be little distinction between the original machine
    and the upgrades last year and this year

  - there are some key points which have been insufficiently noticed in the
    discussion, e.g., the role of optimizing compilers

  - there have been some significant inaccuracies, e.g., with respect to
    the implementation and impact of the VRM.  (Contrary to one assertion,
    the VRM is not half assembly code, but mostly PL.8 and C.  Though the
    VMI has negative impact in some applications, in many cases that is more 
    than compensated for by the benefits of the paging code in the VRM and the
    real time capabilities in the VRM.)

So, I'll try to provide some perspective.

The main reason, in my opinion, that the original machine seemed/was slow
when it appeared in Spring of '86 is that the design decisions were made,
for the most part, with the plan that the machine ship in Winter of '84.
If we had been able to hold that plan, then it would have seemed much faster.

The main limitations in performance in the machine when released were

  - the compilers, both for AIX and 4.2/RT, had very little optimization 
    capability.  Thus they violated a fundamental concept of RISC, that
    of exploiting the processor with highly optimizing compilers.  In 
    AIX 1.1, we began providing global optimizing C and Fortran compilers
    based on pcc/f77 but incorporating HCR's Portable Code Optimizer.  In
    87, a C compiler based on the PL.8 compiler (the "Advanced C Compiler")
    became available as an AIX option.  Correspondingly, the Metaware High C
    compiler became available with 4.2/RT and 4.3/RT.  These compilers
    provided roughly 2 to 1 improvements in performance in many applications.

  - there was no built in floating point hardware and the optional floating
    point accelerator was not very fast

  - the disk controller used a ST-506 interface and had no DMA capability

Though the I/O bus was only 16 bits, it had a number of implementation
enhancements over the AT bus, e.g., 32 bit burst/buffering extensions, and that
bus is not used the main memory bus, so I don't think that bus is a major 
bottleneck in the machine.  It is able to sustain more than 2 megabytes/sec in
DMA transfers, and that is adequate for most applications on a machine of that
processor speed.

The original processor was a 6 MHz NMOS implementation.  It had several
instructions which were multiple cycle but should have been one cycle, 
and was not able to pipeline loads and stores with virtual memory enabled.
On our standard internal CPU kernel, it comes in at 2.1 MIPS.

Last year we started shipping a new processor implementation which reduces
the above cited multiple cycle instructions to single cycle instructions,
is able to pipeline loads and stores with virtual memory enabled, and
has a number of other minor improvements.  The 10 MHz CMOS implementation
of that chip comes in at 4.5 MIPS on the above cited kernel.  Primarily
due to memory shortages, we were unable to provide adequate quantities of
machines with that processor until early this year.  In July we started
shipping machines with a 12.5 MHz version of the new implementation.
Though these machines are not as fast as some high end workstations,
we think they are very competitive in price/performance, as do others.
See David Wilson's dollar/Khornerstone ratings in the May Unix Review,
for example.

Besides reimplementing the processor and providing optimizing compilers,
we provided a 20MHz 68881 standard with the 10 MHz CMOS processor and
provided an optional floating point accelerator using the ADSP 3210 and 3221.
With the 12.5 MHz machines, the standard floating point unit is based
on those parts.  We also started providing DMA controllers for both
ESDI and SCSI disks, with caching on the controller cards.

For those that care, RT models 10, 15, 20 and 25 have the 6 MHz NMOS processor.
The 115 and 125 have the 10 MHz CMOS processor, and the 130 and 135 have
the 12.5 MHz CMOS processor.

Those are the machines that we have shipped.  I think it is well known that
we are working on follow on machines which support the Micro-Channel.
Other than that, I don't think much is publicly known about those machines,
so I won't say anything more about them now.
-- 
Charlie Sauer   IBM AES/ESD, D75/802     uucp: cs.utexas.edu!ibmaus!sauer
                11400 Burnet Road         822: @CS.UTEXAS.EDU:sauer@ibmaus.uucp
                Austin, Texas 78758    aesnet: sauer@auschs  
                (512) 823-3692           vnet: SAUER at AUSVM6

sauer@auschs.UUCP (Charlie Sauer) (11/30/88)

Since my previous posting, I've received several mail requests to post 
benchmark results.  Here's what I have so far from the group responsible
for running various benchmarks:

                          RT PERFORMANCE
 ___________________________________________________________________
 |          |     |    |     |    |*DHRY- | WHETSTONES|  LINPACK   |
 | MODEL    |     |    |     |    | STONES|   KWHETS  |   KFLOPS   |
 |          |     |    |FLOAT|    |_________________________________
 |          |PROC |MHZ |POINT|NOTE|       |  SP |  DP |  SP |  DP  |
 ___________________________________________________________________
 |RT  025   |ROMP | 5.9| FPA |    |  4000 | 985 | 730 | 196 |  118 |
 |    125   |APC  |10.0| FPA2| 1  |  8300 |1893 |1733 | 400 |  360 |
 |    125   |APC  |10.0| FPA2| 2  |  8300 |1893 |1733 |1040 |  700 |
 |    135   |APC  |12.4|  *  | 1  | 10400 |2298 |2087 | 490 |  410 |
 |    135   |APC  |12.4|  *  | 2  | 10400 |2298 |2087 |1210 |  810 |
 ___________________________________________________________________

  * DHRYSTONE = V1.1 W/ ADVANCED C COMPILER

  1. LINPACK PERFORMANCE = UNROLLED BLAS
  2. LINPACK PERFORMANCE = CODED BLAS
-- 
Charlie Sauer   IBM AES/ESD, D75/802     uucp: cs.utexas.edu!ibmaus!sauer
                11400 Burnet Road         822: @CS.UTEXAS.EDU:sauer@ibmaus.uucp
                Austin, Texas 78758    aesnet: sauer@auschs  
                (512) 823-3692           vnet: SAUER at AUSVM6

butcher@g.gp.cs.cmu.edu (Lawrence Butcher) (12/01/88)

Mr. Sauer's benchmark people should be aware that the current version of the
Dhrystone benchmark is version 2.0.  Version 1.1 numbers are not timely.

We have a bunch of RT's.  We use the Metaware High C compiler version 1.4R,
fall back to pcc when mc generates incorrect code, and run MACH.  The
Dhrystone numbers Mr. Sauer quotes are so far from the ones that I measure
here that I wonder if we are measuring different things.  If any reader can
get their RT to run the benchmark significantly faster than I can, please
send me mail.  We would love to know what to do to get a C compiler that
doubles the speed of our programs.

My measured Dhrystone numbers (Dhry 2.0) are (roughly):

2191 for the (RT, ROMP, 6150, Model 025, at 6 MHz) using mc
3176 for the 6152 using mc
3551 for the (APC, 6151, 125, at 10 MHz) using pcc
4474 for the (APC, 6151, 125 at 10 MHz) using mc

The following are my personal feelings only.

The 025, running the software available to me, is about as fast as a SUN 3/50
while running the Dhrystone program.  However my experience is that an 025
machine with 4 megs of RAM, a disk, and MACH, is totally useless.  It cannot
run X10, an xterm, and a single outgoing telnet at the same time.  Yes, I am
saying that it is not as useful as a terminal.  A diskless Sun 3/50 running
SUN OS 3.5, suntools, emacs, and pcc is comfortable.  The 125 is about as
fast as a SUN 3/60, again running the software we have here.  High end 80386
and middle-end 68030 systems should be about twice as fast as the 3/60.

I also have access to a MIPS M500 running at 8 MHz.  It Dhrystones at 12000+.
This number may be hard for others to verify, because I don't think that
MIPS makes 8 MHz chips any more.  The Performance Semi version of the R3000
is available at 16, 20, and 25 MHz.  This chip is commercially available
and there is a C compiler available for it which generates correct code.  Of
course we shouldn't use the the existence of a fast MIPS processor to try to
predict the speed of future IBM machines.  Just the speed of MIPS machines.

I am disappointed that there is no version of GCC for the RT.  IBM could
allocate one person to do a port and give away the results, but it seems
that they would rather have a company sell a compiler without source.

The APA8 display was a joke, and I am disappointed that IBM believes they
fixed things with the APA16 display.  Although both look great on a desk
when they are shut off, both are too small to use for editing.

I am disappointed with the IBM 2-button mouse.  It takes tremendous force
to push the buttons.  We had an informal contest and only one person here
could hold both buttons down while lifting the mouse off the table.  (For
people who have not had the pleasure of using the IBM mouse, you hold both
buttons down at once to emulate holding the middle button down on a real
mouse).

I absolutely cannot believe that the keyboard clicks when you hit the shift
or control key.  I wish that the caps lock key and the control key were the
same size so we could switch the key caps.  Thank you someone for that
layout.  Everyone here rebinds CTRL to the right place.  And the ESC key??

I am disappointed that the RT expansion bus is an AT bus.  The AT form factor
limits the size of peripheral cards, and limits the power they have available.
Ever heard of an SMD controller for the AT??  16 line serial line card??
These problems will be worse with smaller MCA cards.  We will throw away the
RT boxes when they are obsolete.  We will NOT throw away the SUN 3/160 boxes
when the 3/160 is no longer interesting.

Many of the things that I dislike about the RT could be fixed in the future.
Today we give APC's to people who cannot afford SUNs.  We give old RT's to
people to punish them :-)

			Lawrence Butcher @g.gp.cs.cmu.edu
--

njs@scifi.UUCP (Nicholas J. Simicich) (12/03/88)

In article <3736@pt.cs.cmu.edu> butcher@g.gp.cs.cmu.edu (Lawrence Butcher) writes:
>Mr. Sauer's benchmark people should be aware that the current version of the
>Dhrystone benchmark is version 2.0.  Version 1.1 numbers are not timely.
>
>We have a bunch of RT's.  We use the Metaware High C compiler version 1.4R,
>fall back to pcc when mc generates incorrect code, and run MACH.  The
>Dhrystone numbers Mr. Sauer quotes are so far from the ones that I measure
>here that I wonder if we are measuring different things.  If any reader can
>get their RT to run the benchmark significantly faster than I can, please
>send me mail.  We would love to know what to do to get a C compiler that
>doubles the speed of our programs.
>
 (.....)
>
>The 025, running the software available to me, is about as fast as a SUN 3/50
>while running the Dhrystone program.  However my experience is that an 025
>machine with 4 megs of RAM, a disk, and MACH, is totally useless.  It cannot
>run X10, an xterm, and a single outgoing telnet at the same time.  Yes, I am
>saying that it is not as useful as a terminal.  A diskless Sun 3/50 running
>SUN OS 3.5, suntools, emacs, and pcc is comfortable.  The 125 is about as
>fast as a SUN 3/60, again running the software we have here.  High end 80386
>and middle-end 68030 systems should be about twice as fast as the 3/60.

At IBM T.J. Watson Research, we have a number of RT's running Mach.  A
simple C program running CPU bound with a working set of around 300k
running niced makes it impossible to do any other work on the machine.
This does not happen on either the AOS or AIX machines we have.  The
operating system seems to be the sole difference I can come up with.
I believe that the Mach operating system runs well on a number of
other machines and suspect that it is simply a matter of tuning.

I typically run X 10, GNUemacs, outgoing telnets under Xterm, and so
forth.  I also have other people logging in to my system through
telnet, am the server for some DS style remote mounts, and so forth.
Admittedly, I have more memory and an APC.  But I first brought up AIX
on a 3 meg 025, and it was servicable, just a tad slow.

 (.....)

>I am disappointed that there is no version of GCC for the RT.  IBM could
>allocate one person to do a port and give away the results, but it seems
>that they would rather have a company sell a compiler without source.

As far as I knew, someone in Project Athena has had a working code
generator for GCC since last year, but that there were (at the time)
technical reasons why the code generated was incorrect, even though it
generated the correct results.  Something about the granularity on the
intermediate code pass.  I have no idea what the current status is.

 (.....)

>I am disappointed with the IBM 2-button mouse.  It takes tremendous force
>to push the buttons.  We had an informal contest and only one person here
>could hold both buttons down while lifting the mouse off the table.  (For
>people who have not had the pleasure of using the IBM mouse, you hold both
>buttons down at once to emulate holding the middle button down on a real
>mouse).

I believe that there is a great deal of similiarity between the IBM RT
mouse and the older Microsoft mice.  The mechanisms seem to be
similar, as does the button pushing force.  I can hold the buttons
down while using the mouse and lifting it off of the table surface,
and frequently do, since I allocate the cover of a Usenix 4.2 manual
as my mouse surface.  Then again, I have large hands.  People can pick
up a Microsoft mouse and judge for themselves.

Personally, I think that one button is the right number of buttons on
the mouse.  But this doesn't fit the X model.

>I absolutely cannot believe that the keyboard clicks when you hit the shift
>or control key.  I wish that the caps lock key and the control key were the
>same size so we could switch the key caps.  Thank you someone for that
>layout.  Everyone here rebinds CTRL to the right place.  And the ESC key??

Neither can I, as mine doesn't click when I push ctrl or shift.  But I
run AIX.  The clicking is under software control, not hardware
control, so I suspect that this is a problem with MACH or the AOS
porting base again.  I won't comment on key placement.

>I am disappointed that the RT expansion bus is an AT bus.  The AT form factor
>limits the size of peripheral cards, and limits the power they have available.
>Ever heard of an SMD controller for the AT??  16 line serial line card??

I use the Anvil Systems Stallion 16 line card.  It has a 186 on it,
and all of the cooking and stuff is offloaded to the card.
Communication to the card is at the ioctl()/read()/write() level.
Requires a special I/O driver, of course, which is available from
Anvil for AIX.  This is not a product endorsement.  At the Unix Expo,
I saw a lot of 16 line serial cards that ran on the AT bus, as well as
some that fit in the smaller form factor of the MCA bus card. Of
course, they required special connectors.  We sell a SCSI adapter, but
not an SMD adapter, as far as I know, although I understand that you
can get conversion cards.

>These problems will be worse with smaller MCA cards.  We will throw away the
>RT boxes when they are obsolete.  We will NOT throw away the SUN 3/160 boxes
>when the 3/160 is no longer interesting.

Throw one my way?  I'll be glad to drive over and pick it up.  :-)

>Many of the things that I dislike about the RT could be fixed in the future.
>Today we give APC's to people who cannot afford SUNs.  We give old RT's to
>people to punish them :-)
>
>			Lawrence Butcher @g.gp.cs.cmu.edu
>-- 

I believe that you are correct in your assessment: the problems are
fixable.  Some are already fixed, in that I think that the 19 inch
Megapel is enough real-estate to edit on, and that other problems you
mentioned are being fixed, through our efforts and through the efforts
of third parties.  I also believe that at least some of these problems
you mention are software related, perhaps even Mach specific, and that
the RT can't be blamed for them.  Since this has gotten away from
comp.arch, I've directed followups to comp.sys.ibm.pc.rt.
-- 
Nick Simicich --- uunet!bywater!scifi!njs --- njs@ibm.com (Internet)

rick@pcrat.UUCP (Rick Richardson) (12/03/88)

In article <3736@pt.cs.cmu.edu> butcher@g.gp.cs.cmu.edu (Lawrence Butcher) writes:
>Mr. Sauer's benchmark people should be aware that the current version of the
>Dhrystone benchmark is version 2.0.  Version 1.1 numbers are not timely.
>
>My measured Dhrystone numbers (Dhry 2.0) are (roughly):
>
>2191 for the (RT, ROMP, 6150, Model 025, at 6 MHz) using mc
>3176 for the 6152 using mc
>3551 for the (APC, 6151, 125, at 10 MHz) using pcc
>4474 for the (APC, 6151, 125 at 10 MHz) using mc

Looking back on the last list of 1.1 results I put together,
it seems pretty clear the Sauer's "Advanced C" numbers
are running with all possible optimization turned on.
Which makes them pretty much useless as an indication
of anything other than that the optimizer people have been
at work. 

Dhrystone 2.1 is not as easily fooled by optimizers.  How
about posting some 2.1 numbers for the record?

-- 
Rick Richardson | JetRoff "di"-troff to LaserJet Postprocessor|uunet!pcrat!dry2
PC Research,Inc.| Mail: uunet!pcrat!jetroff; For anon uucp do:|for Dhrystone 2
uunet!pcrat!rick| uucp jetroff!~jetuucp/file_list ~nuucp/.    |submission forms.
jetroff Wk2200-0300,Sa,Su ACU {2400,PEP} 12013898963 "" \d\r\d ogin: jetuucp

rpd@RPD.MACH.CS.CMU.EDU (Richard Draves) (12/05/88)

In article <447@scifi.UUCP> njs@scifi.UUCP (Nicholas J. Simicich) writes:
>At IBM T.J. Watson Research, we have a number of RT's running Mach.  A
>simple C program running CPU bound with a working set of around 300k
>running niced makes it impossible to do any other work on the machine.
>This does not happen on either the AOS or AIX machines we have.  The
>operating system seems to be the sole difference I can come up with.
>I believe that the Mach operating system runs well on a number of
>other machines and suspect that it is simply a matter of tuning.

I know of one performance gotcha with Mach on RTs.  The RT MMU only allows
sharing of segments.  Mach VM is more general and allows sharing of pages.
However, it should still notice when what is being shared is in fact an
entire segment (notably, text segments) and use a common segment to
implement the sharing.  However, it doesn't do this.  (The interface between
the machine-independent and machine-dependent VM code makes it difficult
to figure out that this is possible/desirable.)

Instead, each address space is composed of different segments.  Because the
RT architecture only allows a page to be in one segment at a time, when
a process uses a shared page it may take a "translation fault" which moves
the page into the right segment.  These faults are pretty expensive; on a
Model 25 RT they take more than a millisecond.  For example, every time
our csh runs a command about 80 of these faults occur.  Rich Sanzi recently
greatly improved the translation-fault handling time, but it is still
an unfortunate performance hit.

I dug out my copy of Dhrystone 1.1 and tried to reproduce Sauer's numbers.
		Sauer		Draves
Model 25	 4000		 3270
Model 125	 8300		 7855
Model 135	 10400		 9765

I used hc2.1d and ran the tests single-user.  Problems with VM don't explain
the discrepancies.  (I wonder why the Model 25 number is especially far off?)
Is there some compiler better than hc2.1d?  Do AIX and AOS get different
numbers?

Rich Draves
--

njs@scifi.UUCP (Nicholas J. Simicich) (12/05/88)

In article <3764@pt.cs.cmu.edu> rpd@RPD.MACH.CS.CMU.EDU (Richard Draves) writes:
  ..............
>I dug out my copy of Dhrystone 1.1 and tried to reproduce Sauer's numbers.
>		Sauer		Draves
>Model 25	 4000		 3270
>Model 125	 8300		 7855
>Model 135	 10400		 9765
>
>I used hc2.1d and ran the tests single-user.  Problems with VM don't explain
>the discrepancies.  (I wonder why the Model 25 number is especially far off?)
>Is there some compiler better than hc2.1d?  Do AIX and AOS get different
>numbers?
  ..............

I haven't asked Charlie, but I suspect that he would have used the
Advanced C Compiler under AIX for his figures.  I haven't compared the
compilers, personally.  This is a totally different compiler with
totally different numbers, I presume.  Charlie?


-- 
Nick Simicich --- uunet!bywater!scifi!njs --- njs@ibm.com (Internet)

friedl@vsi.COM (Stephen J. Friedl) (12/06/88)

In article <628@pcrat.UUCP>, rick@pcrat.UUCP (Rick Richardson) writes:
< [...]
< Which makes them pretty much useless as an indication
< of anything other than that the optimizer people have been
< at work. 
< 
< Dhrystone 2.1 is not as easily fooled by optimizers.  How
< about posting some 2.1 numbers for the record?

This looks like the traditional battle between the people
who make tank armor and those who make anti-tank weapons...

-- 
Stephen J. Friedl        3B2-kind-of-guy            friedl@vsi.com
V-Systems, Inc.                                 attmail!vsi!friedl
Santa Ana, CA  USA       +1 714 545 6442    {backbones}!vsi!friedl
Nancy Reagan on my new '89 Mustang GT Convertible: "Just say WOW!"

sauer@auschs.UUCP (Charlie Sauer) (12/06/88)

In article <471@scifi.UUCP>, njs@scifi.UUCP (Nicholas J. Simicich) writes:
> I haven't asked Charlie, but I suspect that he would have used the
> Advanced C Compiler under AIX for his figures.  I haven't compared the
> compilers, personally.  This is a totally different compiler with
> totally different numbers, I presume.  Charlie?

Yes, it was the Advanced C Compiler, with AIX 2.2, that was used for those 
numbers, as a footnote (Advanced C, which implies AIX) in the posting indicated.
-- 
C.H. Sauer IBM Advanced Workstations Div. uucp: cs.utexas.edu!ibmaus!sauer
           11400 Burnet Road, D75/802      822: @CS.UTEXAS.EDU:sauer@ibmaus.uucp
           Austin, Texas 78758-2502     aesnet: sauer@auschs  
           (512) 823-3692                 vnet: SAUER at AUSVM6