[comp.os.mach] Mach performance? Long

af@spice.cs.cmu.edu (Alessandro Forin) (12/21/89)

In article <14246@jumbo.dec.com>, discolo@jumbo.DEC.COM (Anthony Discolo) writes:
> Does anyone have benchmarks that compare Mach to BSD and/or Ultrix
> in the areas of RPC/networking/scheduling?
> 
> Any pointers would be greatly appreciated.
> 
> Anthony

Seems to me large part of the question wants to compare apples and oranges.

1- Do you really believe BSD and derivatives to have an RPC mechanism ?
   Are you thinking of Sun's/Apollo/... RPCs or .. what ?
   How do you believe they can be correctly compared to Mach's IPC ? 

2- How do you "measure" a scheduler ?  On a multiprocessor ?
   Under which load ?  Do you believe the functionalities of
   the Mach scheduler (e.g. user processor allocation, handoff
   scheduling, fixed/timeshare priorities ) are in any way
   comparable to U*x ?

If you have any more precise definition of the comparisons you
want to make I'd be glad to provide you with an answer.

Assuming for now you just want *some* comparison of Ultrix & Mach 2.5
on a pmax here is some data collected some time ago on my pmax,
comparing under the same conditions (machine, disk, environment,
programs&input, buffer cache, network, time of day, etc etc etc)
Mach and Ultrix. [Since we have made progress since, I sometimes 
added the numbers I get _right_now_ on the same machine, but clearly
under different & less controlled conditions].

As far as networking, when I tested our ethernet device driver I was satisfied
by measuring a binary ftp of a large file (the Ultrix image :-) into
/dev/null using Ultrix's ftp, multiuser, on the same thinwir-ing.
Among two Ultrix pmaxen I got no better than about 190 kb/sec, among two Mach
pmaxen I got up to 300 kb/sec, as measured by ftp itself.

We do have internal test programs for Mach IPC.
Here are the results I get right_now on my pmax (multiuser, X11
running, my emacs & news programs in the background, some 20 assorted
systems servers for the fancy CMU environment).

binding to host testarossa
Test          syscall: Iters:10000      E:228      U:31       S:197     
Test        localLoop: Iters:10000      E:0        U:0        S:0       
Test        localNull: Iters:10000      E:0        U:0        S:0       
Test         localAdd: Iters:10000      E:16       U:15       S:0       
Test       localBigIn: Iters:10000      E:203      U:203      S:0       
Test      localBigOut: Iters:10000      E:203      U:203      S:0       
Test    localBigInOut: Iters:10000      E:390      U:390      S:0       
Test       remoteNull: Iters:10000      E:1985     U:15       S:1125    
Test        remoteAdd: Iters:10000      E:2062     U:141      S:985     
Test      remoteBigIn: Iters:10000      E:2797     U:172      S:1234    
Test     remoteBigOut: Iters:10000      E:2016     U:141      S:1000    
Test   remoteBigInOut: Iters:10000      E:2765     U:390      S:1266    

The localXXX entries are for local procedure calls, remoteXXX
are for the same operation invoked on a server process.
Add just adds two numbers, BigIn passes (by value) as IN parameter a 
200bytes string, BigOut returns it (by value), BigInOut does both.
All tests ran 10000 times, Elapsed/User/System times are as given,
e.g. a null RPC takes 200 microseconds, elapsed.

My standard use of the machine is for compilation, so here are two
compilation benchmarks: a small one and a large one:

Compilation benchmark, small programs.

			Ultrix			Mach

Elapsed			17.9			15.4
Breakdown:
		        1.5 real	        1.2 real
		        2.4 real	        2.1 real
		        1.5 real	        1.1 real
		        1.7 real	        1.5 real
		        2.0 real	        1.7 real
		        2.0 real	        1.7 real
		        1.6 real	        1.3 real
		        2.7 real	        2.4 real
		        1.6 real	        1.4 real

Mach kernel compilation benchmark.

		Ultrix		Mach		Mach-nbc

Elapsed		2325.0		2139.0		2180.0
User		1404.0		1471.6		1388.8
System		413.0		323.5		351.1
Utilization	78%		83%		77%
I/O		7120+17397	3278+17756	291+6271

[The Mach-nbc entry if for a no-buffer-cache kernel]

Some specific U*x tests are performed by a suite of little programs we 
wrote ourselves.

			Ultrix			Mach		right_now

Elapsed			15.1 secs		15.6 secs

exec			16 ms			12 ms
touch			0.566 ms		0.684 ms	[0.440]
fork			4.960 ms		4.218 ms	[3.125]
getpid			2.4 secs		2.8 secs	[1.9]
iocall			8.1 secs		8.2 secs
puzzle			0.6 secs		0.6 secs
pgtest			1.0 secs		1.3 secs

File system performance is compared by Satya's benchmark (see SOSP87)

Satya's Filesystem benchmark results

(1) On local disk
				Ultrix			Mach

Total elapsed			121 secs		108 secs
Phase I:
 Creating directories		  2 secs		  4 secs
Phase II:
 Copying files			  9 secs		  9 secs
Phase III:
 Recursive directory stats	  9 secs		  7 secs
Phase IV:
 Scanning each file		 18 secs		  9 secs
Phase V:
 Compilation			 83 secs		 79 secs

Hope this helps,
sandro-
Disclaimer: The Ultrix group have different ideas on what should be tested
and by which benchmarks.  Their opinions might be very different from ours.
The opinions of our users seem to agree with our own opinions.

grunwald@foobar.colorado.edu (Dirk Grunwald) (12/21/89)

Does the version of MACH that's working at CMU use the MIPS ECOFF
loader format?  If not, does it use BSD loader formats? Or standard
Mach format? That, in of itself, would be useful - I'd like to get
debugging information from Gcc/G++ to work.

If it's not using their loader format, from whence come the compilers
and assemblers and loaders?

Dirk Grunwald -- Univ. of Colorado at Boulder	(grunwald@foobar.colorado.edu)
						(grunwald@boulder.colorado.edu)

jg@max.crl.dec.com (Jim Gettys) (12/21/89)

I don't understand why your off-machine RCP performance is so poor; you
do decently
with on machine RPC (relative to other RPC implementations like that at
DECSRC), 
but your off machine (over the net) case is only comparable to what a
MicroVAX running
Topaz can do, or just running over TCP (X round trip times on Ultrix
over the net using TCP
are less than a factor of two worse (around 3.6 ms)).  Where is the
bottle-neck?
			- Jim Gettys
			  Digital Equipment Corporation
			  Cambridge Research Laboratory

Richard.Draves@CS.CMU.EDU (12/22/89)

Off-machine RPCs are relatively slow because they aren't handled
directly by the kernel.  A user-level process, called the netmsgserver,
handles network IPC.  When a task on machine A sends a message to a task
on machine B, the message really goes to the netmsgserver on A, which
sends it to the netmsgserver on B, which sends it to the final
destination.  (These intermediaries are transparent to the user tasks on
A & B.  Messages are sent to capabilities which don't have any location
information in their names.)  In sum, not an architecture motivated by
performance.

Rich

discolo@jumbo.pa.dec.com (Anthony Discolo) (12/22/89)

In article <7372@pt.cs.cmu.edu>, af@spice.cs.cmu.edu (Alessandro Forin) writes:
> We do have internal test programs for Mach IPC.
> Here are the results I get right_now on my pmax (multiuser, X11
> running, my emacs & news programs in the background, some 20 assorted
> systems servers for the fancy CMU environment).
> 
> binding to host testarossa
> Test          syscall: Iters:10000      E:228      U:31       S:197     
> Test        localLoop: Iters:10000      E:0        U:0        S:0       
> Test        localNull: Iters:10000      E:0        U:0        S:0       
> Test         localAdd: Iters:10000      E:16       U:15       S:0       
> Test       localBigIn: Iters:10000      E:203      U:203      S:0       
> Test      localBigOut: Iters:10000      E:203      U:203      S:0       
> Test    localBigInOut: Iters:10000      E:390      U:390      S:0       
> Test       remoteNull: Iters:10000      E:1985     U:15       S:1125    
> Test        remoteAdd: Iters:10000      E:2062     U:141      S:985     
> Test      remoteBigIn: Iters:10000      E:2797     U:172      S:1234    
> Test     remoteBigOut: Iters:10000      E:2016     U:141      S:1000    
> Test   remoteBigInOut: Iters:10000      E:2765     U:390      S:1266    
> 
> The localXXX entries are for local procedure calls, remoteXXX
> are for the same operation invoked on a server process.
> Add just adds two numbers, BigIn passes (by value) as IN parameter a 
> 200bytes string, BigOut returns it (by value), BigInOut does both.
> All tests ran 10000 times, Elapsed/User/System times are as given,
> e.g. a null RPC takes 200 microseconds, elapsed.

Thanks.  Just a couple of questions...

Do the localXXX entries refer to inter-address space/intra-machine
procedure calls?  Do the remoteXXX entries refer to inter-machine
procedure calls?

Do the BigInOut procedures touch their arguments?

Anthony
-----
Anthony Discolo
DEC Systems Research Center
130 Lytton Ave.
Palo Alto, CA  94301
ARPA: discolo@src.DEC.COM

af@spice.cs.cmu.edu (Alessandro Forin) (12/22/89)

In article <1482@crltrx.crl.dec.com>, jg@max.crl.dec.com (Jim Gettys) writes:
> I don't understand why your off-machine RCP performance is so poor;
...
> your off machine (over the net) case is only comparable to what a
> MicroVAX running
> Topaz can do, or just running over TCP (X round trip times on Ultrix
> over the net using TCP
> are less than a factor of two worse (around 3.6 ms)).  Where is the
> bottle-neck?
> 			- Jim Gettys
> 			  Digital Equipment Corporation
> 			  Cambridge Research Laboratory

I believe there is a deep misunderstanding here, and re-reading my post I 
realize that I am largely responsible for it.
[On the other hand, where did you hear of a system that does an RPC
 over the ether in 200 usecs ???? From the SOSP proceedings I see that
 the official score seems to be:
	Cedar:	1.1 MILLIsecs/call	Dorado
	Amoeba:	1.4			Tadpole (68020)
	V:	2.5			Sun 3/75
	Topaz:	2.7			Firefly (5-way multi)
	Sprite:	2.8			Sun 3/75
	Topaz:	4.8			Firefly (mono)
]

The RPC numbers I gave are LOCAL: the output from the program
is definitely misleading in the use of the terms "local" and "remote".
What the author meant is "local" for a normal procedure call (in the
same address space), "remote" for a different thread but on the same machine.
[The results should be the same for threads in the same or in separate
 address spaces, the test is across separate address spaces.]

The table is also misleading in that times are not normalized.
Here are the normalized numbers, all times in MICROSECONDS per call.

Test          syscall:    E:22       U:3        S:19
Test        localLoop:    E:0        U:0        S:0       
Test        localNull:    E:0        U:0        S:0       
Test         localAdd:    E:1        U:1        S:0       
Test       localBigIn:    E:20       U:20       S:0       
Test      localBigOut:    E:20       U:20       S:0       
Test    localBigInOut:    E:39       U:39       S:0       
Test       remoteNull:    E:198      U:1        S:112
Test        remoteAdd:    E:206      U:14       S:98
Test      remoteBigIn:    E:279      U:17       S:123
Test     remoteBigOut:    E:201      U:14       S:100
Test   remoteBigInOut:    E:276      U:39       S:126

What the test wants to show is the approx ratio between local
(LPC?) and remote procedure call (RPC!) on the given machine,
which in this case turns out to be slightly better than
a factor of 10 slower, which is not too bad given
the amount of optimization the MIPS guys put in their
compilers.

It is well known that Mach network IPC is not very fast, we always stressed
functionality over performance, e.g. the ability to transparently replace
transport protocols from TPC to UDP to VMTP to ... whatever with only
trivial changes in a single user-level process.  Maybe some day we will 
put some work on getting it fast, but right now we are not in the race.
Besides, Mach IPC is a no-cheat IPC system, with all the security 
measures necessary for a true multiuser/time-shared machine.
For instance, I would expect the guys at Trusted Information Systems
to see about the same performance on their Secure Mach.
Comparisons with unsecure systems like Topaz are therefore inappropriate, with
all due respect to all the painful work they did to get their very good
numbers. On a multiprocessor. [Same goes for Amoeba or V-kernel]
We also do well on transfering large volumes of data, see for instance
what the NeXT box does with bitmaps.

Anyways, here are the times I get between two similar pmaxen, on the same
cable, right_now, multiuser etc etc. These times ARE for network IPC and 
ARE normalized as above in usecs per call.  And Rick WILL kill me for
handing them out. [Take them as lower-bounds, on a young machine]

binding to host rvb
Test       remoteNull:    E:6563     U:15       S:203     
Test        remoteAdd:    E:6421     U:0        S:265     
Test      remoteBigIn:    E:7172     U:47       S:141     
Test     remoteBigOut:    E:6563     U:15       S:94      
Test   remoteBigInOut:    E:7140     U:94       S:266     

Why is it so slow ?  Because the path is

Machine-A:	client -> kernel -> network_server -> kernel -> ether
Machine-B:	ether -> kernel -> network_server -> kernel -> server

that is, the network_servers "interpose" between kernels. On other systems
the path is typically something like

Machine-A:	client -> kernel -> ether
Machine-B:	ether -> kernel -> server

which saves 4 `copy' operations.  Topaz then cheats by only `copying'
once, from the client's stack into a preallocated packet (and then back)
in user-visible memory.  The packet is then handed over to the device driver
by reference. If I had to guess, I'd say this way they could do something like
500 MICROsecs/call on a multiprocessor pmax.  And if they double-cheated
by using the Washington version they'd probably do even better.
The term `copy' above includes context-switching overhead, if applicable.

As far as X is concerned, I take the word of the people at MIT that
"It runs visibly faster under Mach than under Ultrix".  I am no X guru, for 
me the machine is so fast anyways that I can't see any difference.
But I would indeed expect the better ether driver to have some positive 
effects.

I'd be glad to run under Mach the benchmark you used to get the "3.6 msecs" 
figure and report the findings, where do I ftp it from ?
BTW, did you see the tex previewer for X11 by Eric Cooper on a pmax ?

sandro-
PS: I am setting up a TAR file with the benchmarks I mentioned,
except of course the Mach sources, for which you can subsitute
your favorite BSD kernel.  By tomorrow it should be available
by anonymous FTP on host testarossa.mach.cs.cmu.edu [128.2.250.252]
in the directory /usr/pub

Rick.Rashid@CS.CMU.EDU (12/22/89)

Actually, Rich is only partly correct.  On most systems the cost of
4.3BSD networking code (which is used by the netmsgserver) dominates all
other costs.  The netmsgserver has support for several protocols but the
standard one used at CMU (and by NeXT) is based on TCP connections.
There is code in the kernel for "short-circuiting" established Mach IPC
connections directly to the TCP code so when this option is used there
is no "intermediate" netmsgserver processing for each message.
Unfortunately the cost of IP/TCP code more than makes up the difference.
The last time I looked the cost of IP output was as high as 1600 VAX
instructions by itself.  We have experimented with various more high
performance protocols such as David Cheriton's VMTP.  A now somewhat
dated version of VMTP is an option in our system and the netmsgserver knows
how to use it.  That version of VMTP was never really stable in CMU's
complex network environment, though, so it has only been used for
experimentation.  It is likely that the most recent VMTP release will
be re-integrated and tested in the next few months.

Anyone interested in experimenting with the use of alternate protocols or
network driving code in Mach can do so relatively easily by modifying the
existing netmsgserver (which is already set up to handle 3 different
protocols) and the kernel's short-circuit code can be easily connected
to alternate network driver code.

Richard.Draves@CS.CMU.EDU (12/22/89)

I still don't understand what Sandro's numbers measure, so I won't try
to comment on them.

Excerpts from mail: 21-Dec-89 Re: Mach performance? [Long]
Rick.Rashid@CS.CMU.EDU (1425)

> Actually, Rich is only partly correct.
 
I didn't mention the "short-circuit" path in my brief description of
remote IPC because I don't think it is usable.  It is an experimental
option.  I don't even know if the code would still compile if one turned
on the option.  It certainly isn't in use anywhere.  NeXT tried to put
the short-circuit code into their production system and found it was too
buggy to use; they had to back it out.

I think the short-circuit code was a successful experiment.  The
improved times it produced confirmed that the netmsgserver is a
bottleneck in remote Mach IPC.

Rich

jg@max.crl.dec.com (Jim Gettys) (12/23/89)

The 3.6 millisecond figure (round trip using TCP) between two PMAXen on a local
net was measured with the x11perf
program, which you can get off of expo.lcs.mit.edu or gatekeeper.dec.com.
It is performing a no-op X request, and timing the response (I think it
is just measuring elapsed time on XSync X library calls.).

The current version of x11perf is version 1.2.  It will also be on the
X11R4 distribution.

					- Jim

pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (12/23/89)

In article <QZYJhvu00hYP4Qa1N2@cs.cmu.edu> Richard.Draves@CS.CMU.EDU writes:

   Excerpts from mail: 21-Dec-89 Re: Mach performance? [Long]
   Rick.Rashid@CS.CMU.EDU (1425)

   > Actually, Rich is only partly correct.

   I didn't mention the "short-circuit" path in my brief description of
   remote IPC because I don't think it is usable.  It is an experimental
   option.  I don't even know if the code would still compile if one turned
   on the option.  It certainly isn't in use anywhere.  NeXT tried to put
   the short-circuit code into their production system and found it was too
   buggy to use; they had to back it out.

I have been aware of Rashid's Accent IPC for almost ten years,
and about five years ago ported it to to System V, and did a
netmsgserver.  It is quite possible and actually I think (I did
not actually do it) fairly easy to move the netmsgserver in the
kernel, so that context switch time is essentially nullified.

This is what 4.xBSD essentially does; by default their
netmsgserver is stuck right in the kernel. Not many seem to have
noticed that this actually is not the only option you have under
4.xBSD, indeed there are two alternatives:

1) all user programs could use the Unix domain, where you can
(modulo some bugs) send filedescriptors between processes.  All
processes open Unix domain connections to a network server
process, and this is the only one that opens TCP/IP or whatever
connections to the outside world. You can use the existing TCP
sockets, or use raw IP sockets and reimplement TCP in the server,
or whatever. This is exactly like in Mach.

2) an unimplemented feature of 4.xBSD is user implemented IPC
domains.  The idea was to give user processes the ability to
register with the kernel as servers for sockets of some
particular domain, and the kernel would pass to that process all
operations on sockets of that domain. This facility has never
been implemented, just like 4.xBSD wrappers (it actually is
related to them).

Interestingly enough, option 1) is possible also under streams,
and I am quite sure that the two crucial points of Rashid's IPC,
the ability to send file/port descriptors with messages and the access
to global addresses only through address space local file/port
descriptors has been inspired in both cases by Accent (even if
both points are circumvented under 4.xBSD by direct access to a
kernel based netmsgserver).

   I think the short-circuit code was a successful experiment.  The
   improved times it produced confirmed that the netmsgserver is a
   bottleneck in remote Mach IPC.

Naturally all this netmsg server trouble happens because of a
fundamental limitation of Accent/Mach, the inabilities for
threads to change address space and, possibly, to have multiple
address spaces mapped together (yes, I know about sharing address
spaces, it's not quite the same thing). I suspect that these
limitations are there also possibly because otherwise the
architecture would be very different from the Unix one, and CMU
have been badly burnt with Accent that was too unlike Unix.

Context switching for RPC implies three distinct overheads:
security checking, address space switching, thread switching.

There are therefore three possible levels of extra sophistication
beyond Accent/Mach (which is already two levels beyond 4.xBSD):

1) If it were possible for threads to jump between address
spaces, the thread switching overhead would be nullified.

2) If it were possible to map multiple address spaces together
even address space switching would be nullified.

3) If it were possible to inform the OS that an address space trusted
another, security checking in that direction would be eliminated.

As an historical note, Multics had all three for communication
between *rings* in the same address space, and even had support
in hardware to do 3) in the reverse direction. Capability
machines with a single global address space for all threads are
best of course, and since security checking is automagically done
in both directions by hardware, point 3) is moot.

An OS called Psyche (from Rochester, not by chance) allows you to
do all three things on fairly conventional (non-capability,
non-ring) hardware; you can then have your netmsgserver comapped
with your user address space, the thread that wants to do the
network RPC just jumps to the netmsgserver code, and since you
say that you trust it, half of the checking is eliminated as
well.

I think that this should give excellent performance; I have had
correspondence with other people working on similar lines (e.g.
from AT&T), and from our limited data it is apparent you do not
pay much more than for a local procedure call (and maybe even
less than an intra address space thread rendezvouz, as you don't
have synch and thread switch costs).

I have been working (since 1983 on and off... but I have now
apparently found a way to switch to full time for this) on
something that does 1) by default, but will only do 2) for
selected, statically configured, modules, and 3) only for the
kernel. While this is a less general mechanism than Psyche, I
think that the Psyche mechanism is excessively fine grained for
my tastes, and I'd rather be more restrictive, and not even offer
the option to do 2) and 3) in a general way.

There is of course a difference in perspective: I am a
minimalist, and I don't want to add mechanisms that are not
relevant, or may even encourage programming styles at variance
with my target environment, distributed systems, where it is
important to cut overheads, but also to encourage the programmer
to be aware of communication boundaries, and not to expect to be
able to map address spaces together, as they may be on different
machines. It is *possible* to support transparently distributed
shared memory, indeed it is in principle possible and fairly easy
with the existing Accent/Mach architecture, but I think that
hiding communication boundaries while attractive from a
conceptual point of view would also hide the underlying reality
in terms of cost and reliability. Also, most current machines do
not have long enough virtual addresses that you can expect to map
many user address spaces together.

The Psyche people have of course a completely different attitude,
and I would dare say that to them points 2) and especially 3) are
the most important because their target is a NUMA machine, that
is a not-too-loosely coupled multiprocessor and not a (possibly
over a wide area) distributed system (Mach seems ever more
oriented to very closely coupled multiprocessors), and where
hardware effort has been expended to provide some credible,
efficient illusion of global shared memory, that should be
exploited.

In other words, my reckoning is that while in current Accent/Mach
kernels there is a "short-circuit" path as an exception to the
normal mechanism, the (sw) architecture should be such that such
a thing is actually the standard.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

Richard.Draves@CS.CMU.EDU (12/26/89)

Excerpts from netnews.comp.os.mach: 23-Dec-89 Re: Mach performance?
[Long] Piercarlo Grandi@rupert. (7140)

> It is quite possible and actually I think (I did
> not actually do it) fairly easy to move the netmsgserver in the
> kernel, so that context switch time is essentially nullified.

I wouldn't want to move the netmsgserver into the Mach kernel.  The
netmsgserver is a pretty hefty program.  As Rick mentioned, it can
handle multiple network protocols.  It has security capabilities, to
protect port rights.  (See Robert Sansom's thesis.)  It is the
base-level name server.  (Once two parties on different machines have
exchanged send rights, they can transmit more port rights back and
forth, but the netmsgserver has to be involved in the initial bootstrap
that sets up the first remote send right.)  It reformats data in
messages (like fixing byte order), according to the type descriptors in
the message and the hardware architectures involved.

The "short-circuit" experiment did demonstrate that it is feasible to
avoid the netmsgserver on performance-critical paths.  I think this is
much more palatable than moving the entire netmsgserver into the kernel,
although it does compromise the Mach IPC model to some extent. 
(Theoretically, the netmsgserver is just another user task and doesn't
need any special kernel support.)

Rich

peter@ficc.uu.net (Peter da Silva) (12/28/89)

In article <IZYFjfK00hYPB2M7si@cs.cmu.edu> Richard.Draves@CS.CMU.EDU writes:
> Off-machine RPCs are relatively slow because they aren't handled
> directly by the kernel.  A user-level process, called the netmsgserver,
> handles network IPC.

Which is as it should be, right? One of the design goals of Mach was to
move stuff like this out of the kernel, then improve performance by
speeding up context switches and clever use of virtual memory. I might
be all wet on this, but by making the user pages containing the data
to be sent copy-on-write and mapping them into the netmsgserver you
could get rid of all the extra copies. And context switch time should
already be quite low.

Merely a SMOP.
-- 
`-_-' Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>.
 'U`  Also <peter@ficc.lonestar.org> or <peter@sugar.lonestar.org>.
"It was just dumb luck that Unix managed to break through the Stupidity Barrier
and become popular in spite of its inherent elegance." -- gavin@krypton.sgi.com

ast@cs.vu.nl (Andy Tanenbaum) (12/29/89)

In article <7387@pt.cs.cmu.edu> af@spice.cs.cmu.edu (Alessandro Forin) writes:
> From the SOSP proceedings I see that
> the official score seems to be:
>	Cedar:	1.1 MILLIsecs/call	Dorado
>	Amoeba:	1.4			Tadpole (68020)
>	V:	2.5			Sun 3/75
>	Topaz:	2.7			Firefly (5-way multi)
>	Sprite:	2.8			Sun 3/75
>	Topaz:	4.8			Firefly (mono)

I think it is important that everyone realize that RPC protocols are
basically CPU limited.  Thus when making comparisons, one has to normalize
for CPU speed.  In the Oct 1988 Operating System Review I wrote an article
claiming that Amoeba was the fastest distributed system in the world ON
ITS CLASS OF HARDWARE (68020 at 16 MHz --essentially Sun 3/50 type machines).
This has been widely misunderstood.  In the above list one might get the
impression that the Cedar folks wrote better software (1.1 < 1.4 etc.)  It is
important to note that the effective speed of the Dorado hardware is
something like 3 times faster than the Sun 3/50.  Getting a 20% speed gain
by using 3 times faster hardware is nice, but not Guiness Book of Records
stuff.  If one only looks at raw speed, then Kermit running on a Cray-3
is going to beat the pants off everybody.  

Conclusion: When making rankings like the above, one must normalize for
CPU speed, or at least quote it.  It would be interesting to see an honest
list, including Mach.  I believe that Amoeba is still #1 in RPC performance
and also in reading from a remote file server (677 kilobytes/sec).  How
fast can a Mach user program read data continuously from a remote machine over
the Ethernet (assuming 100% hit rate in the file server's cache, to factor
out disk speed)?

Finally, one also has to be careful that one is measuring the same thing.
The Amoeba times are user-to-user, not kernel-to-kernel, using no special
tricks, no microcode assist, no special hardware boards, etc.  Just plain
vanilla Ethernet.  I can only hope the others measure the same thing.

Andy Tanenbaum (ast@cs.vu.nl)

P.S. With a little luck, Amoeba will be made available during the course
of 1990.  I'll post an announcement to comp.os.misc when the time comes.