[comp.archives] RPC performance measurement tests comp.os.mach

af@spice.cs.cmu.edu (Alessandro Forin) (12/22/89)

Archive-name: benchmark/af-rpc
Original-posting-by: af@spice.cs.cmu.edu (Alessandro Forin)
Original-subject: Re: Mach performance? [Long]
Archive-site: testarossa.mach.cs.cmu.edu [128.2.250.252]
Archive-directory: /usr/pub
Reposted-by: emv@math.lsa.umich.edu (Edward Vielmetti)

In article <1482@crltrx.crl.dec.com>, jg@max.crl.dec.com (Jim Gettys) writes:
> I don't understand why your off-machine RCP performance is so poor;
...
> your off machine (over the net) case is only comparable to what a
> MicroVAX running
> Topaz can do, or just running over TCP (X round trip times on Ultrix
> over the net using TCP
> are less than a factor of two worse (around 3.6 ms)).  Where is the
> bottle-neck?
> 			- Jim Gettys
> 			  Digital Equipment Corporation
> 			  Cambridge Research Laboratory


I believe there is a deep misunderstanding here, and re-reading my post I 
realize that I am largely responsible for it.
[On the other hand, where did you hear of a system that does an RPC
 over the ether in 200 usecs ???? From the SOSP proceedings I see that
 the official score seems to be:
	Cedar:	1.1 MILLIsecs/call	Dorado
	Amoeba:	1.4			Tadpole (68020)
	V:	2.5			Sun 3/75
	Topaz:	2.7			Firefly (5-way multi)
	Sprite:	2.8			Sun 3/75
	Topaz:	4.8			Firefly (mono)
]

The RPC numbers I gave are LOCAL: the output from the program
is definitely misleading in the use of the terms "local" and "remote".
What the author meant is "local" for a normal procedure call (in the
same address space), "remote" for a different thread but on the same machine.
[The results should be the same for threads in the same or in separate
 address spaces, the test is across separate address spaces.]

The table is also misleading in that times are not normalized.
Here are the normalized numbers, all times in MICROSECONDS per call.

Test          syscall:    E:22       U:3        S:19
Test        localLoop:    E:0        U:0        S:0       
Test        localNull:    E:0        U:0        S:0       
Test         localAdd:    E:1        U:1        S:0       
Test       localBigIn:    E:20       U:20       S:0       
Test      localBigOut:    E:20       U:20       S:0       
Test    localBigInOut:    E:39       U:39       S:0       
Test       remoteNull:    E:198      U:1        S:112
Test        remoteAdd:    E:206      U:14       S:98
Test      remoteBigIn:    E:279      U:17       S:123
Test     remoteBigOut:    E:201      U:14       S:100
Test   remoteBigInOut:    E:276      U:39       S:126

What the test wants to show is the approx ratio between local
(LPC?) and remote procedure call (RPC!) on the given machine,
which in this case turns out to be slightly better than
a factor of 10 slower, which is not too bad given
the amount of optimization the MIPS guys put in their
compilers.

It is well known that Mach network IPC is not very fast, we always stressed
functionality over performance, e.g. the ability to transparently replace
transport protocols from TPC to UDP to VMTP to ... whatever with only
trivial changes in a single user-level process.  Maybe some day we will 
put some work on getting it fast, but right now we are not in the race.
Besides, Mach IPC is a no-cheat IPC system, with all the security 
measures necessary for a true multiuser/time-shared machine.
For instance, I would expect the guys at Trusted Information Systems
to see about the same performance on their Secure Mach.
Comparisons with unsecure systems like Topaz are therefore inappropriate, with
all due respect to all the painful work they did to get their very good
numbers. On a multiprocessor. [Same goes for Amoeba or V-kernel]
We also do well on transfering large volumes of data, see for instance
what the NeXT box does with bitmaps.

Anyways, here are the times I get between two similar pmaxen, on the same
cable, right_now, multiuser etc etc. These times ARE for network IPC and 
ARE normalized as above in usecs per call.  And Rick WILL kill me for
handing them out. [Take them as lower-bounds, on a young machine]

binding to host rvb
Test       remoteNull:    E:6563     U:15       S:203     
Test        remoteAdd:    E:6421     U:0        S:265     
Test      remoteBigIn:    E:7172     U:47       S:141     
Test     remoteBigOut:    E:6563     U:15       S:94      
Test   remoteBigInOut:    E:7140     U:94       S:266     

Why is it so slow ?  Because the path is

Machine-A:	client -> kernel -> network_server -> kernel -> ether
Machine-B:	ether -> kernel -> network_server -> kernel -> server

that is, the network_servers "interpose" between kernels. On other systems
the path is typically something like

Machine-A:	client -> kernel -> ether
Machine-B:	ether -> kernel -> server

which saves 4 `copy' operations.  Topaz then cheats by only `copying'
once, from the client's stack into a preallocated packet (and then back)
in user-visible memory.  The packet is then handed over to the device driver
by reference. If I had to guess, I'd say this way they could do something like
500 MICROsecs/call on a multiprocessor pmax.  And if they double-cheated
by using the Washington version they'd probably do even better.
The term `copy' above includes context-switching overhead, if applicable.

As far as X is concerned, I take the word of the people at MIT that
"It runs visibly faster under Mach than under Ultrix".  I am no X guru, for 
me the machine is so fast anyways that I can't see any difference.
But I would indeed expect the better ether driver to have some positive 
effects.

I'd be glad to run under Mach the benchmark you used to get the "3.6 msecs" 
figure and report the findings, where do I ftp it from ?
BTW, did you see the tex previewer for X11 by Eric Cooper on a pmax ?

sandro-
PS: I am setting up a TAR file with the benchmarks I mentioned,
except of course the Mach sources, for which you can subsitute
your favorite BSD kernel.  By tomorrow it should be available
by anonymous FTP on host testarossa.mach.cs.cmu.edu [128.2.250.252]
in the directory /usr/pub

af@spice.cs.cmu.edu (Alessandro Forin) (12/22/89)

Archive-name: benchmark/af-rpc-how-to-get
Original-posting-by: af@spice.cs.cmu.edu (Alessandro Forin)
Original-subject: Re: tests
Archive-site: testarossa.mach.cs.cmu.edu [128.2.250.252]
Archive-directory: /usr/pub
Archive-files: mach_tests.TAR.Z
Reposted-by: emv@math.lsa.umich.edu (Edward Vielmetti)


The tests I mentioned in my post are now available.  Since our ftp daemons are 
very concerned with security I suggest you do not try any fancy, just follow
this script:
	ftp testarossa.mach.cs.cmu.edu
	# if your name lookup fails try:
	# ftp 128.2.250.252
	...login: anonymous
	..Passwd: guest
	ftp> cd /usr/pub
	ftp> ls
	ftp> binary
	ftp> get mach_tests.TAR.Z
	ftp> quit

Then you should uncompress the file and un-tar it.
Have a Merry Christmas,
sandro-