[comp.os.mach] Mach RPC Throughput...

morse@quark.mpr.ca (Daryl Morse) (03/14/91)

Now that we have been throroughly reminded of some of the differences
between Amoeba and Mach, I have a question regarding the throughput of
the Mach RPC. First a bit of background...

Peterson, et al, published a paper in the May 1990 issue of IEEE
Computer entitled "The x-kernel: A platform for accessing Internet
resources." In that paper, a number of RPC throughput figures were
given for x-kernel, Mach, and several other OSes. (I don't have it
handy right now to give a full list, however.) Tanenbaum, et al,
published a paper in the December 1990 issue of Communications of the
ACM, entitled "Experiences with the Amoeba Distributed operating
System." In that paper, RPC throughput figures were given for Amoeba,
Cedar, x-kernel, V, Topax, Sprite, and Mach. (The figure for Mach was
obtained from Peterson's paper.) The throughput for Mach is between 3
and 10 times slower than that of the other OSes.

My question is simple, though its answer likely is not: Why is the
throughput of the Mach RPC so much slower than the other OSes? Are the
respective RPCs different enough that throughput is a meaningless
"apples and oranges" comparision? Has the Mach RPC simply not been
optimized as heavily as that of the other OSes?

Thanks.




--
Daryl Morse                     | Voice : (604) 293-5476
MPR Teltech Ltd. 		| Fax   : (604) 293-5787
8999 Nelson Way, Burnaby, BC    | E-Mail: morse@quark.mpr.ca
Canada, V5A 4B5                 |         quark.mpr.ca!morse@uunet.uu.net

bill@cs.columbia.edu (Bill Schilit) (03/15/91)

Our measurements of Mach RPC show that TCP is the dominant cost when
sending synchronous RPC consisting of a small request and a 4K
response.

On an i386 machine with an AT bus running Mach 2.5 we tried to account
for the measured round trip time of 33ms by computing the cost of
ethernet and bus transfer and separately measuring TCP loopback and
Mach RPC without going over the network.  The results are shown below.

			4K RPC
			------
Packets			7		(observed)
Bytes			4738		(observed)

AT bus (times 2)	9.5 ms		(computed)
Ethernet		3.8 ms		(computed)
TCP Loopback		12  ms		(measured)
Mach IPC (times 2)	4   ms		(measured)
Total		       29.3 ms

Network RPC	       33   ms		(measured)

Breaking the times down with into these four components gets pretty
close to the observed network RPC rate.

These tests are fully described in "Adaptive remote paging for mobile
computing" TR CUCS-004-91, available by anonymous FTP from
cs.columbia.edu::pub/reports.

- Bill
-- 
Bill Schilit
Columbia University Computer Science Department
bill@cs.columbia.edu

fkittred@bbn.com (Fletcher Kittredge) (03/15/91)

In article <MORSE.91Mar14091659@quark.mpr.ca> morse@quark.mpr.ca (Daryl Morse) writes:
>
>Now that we have been throroughly reminded of some of the differences
>between Amoeba and Mach, I have a question regarding the throughput of
>the Mach RPC. First a bit of background...
...
>Why is the
>throughput of the Mach RPC so much slower than the other OSes? Are the
>respective RPCs different enough that throughput is a meaningless
>"apples and oranges" comparision? Has the Mach RPC simply not been
>optimized as heavily as that of the other OSes?
>

The answer is that this is an example of comparing an old version of 
piece of software with a new version of a competing package.  Given
the dates cited in your article, the Mach RPC tested would be version
2.5 or lower.  For Mach 3.0, the RPC system was completely re-written
by Richard Draves.  One of the results was a real increase in speed.
You will be wanting to read the paper "A Revised IPC Interface", Richard
Draves, Proceedings of the October 1990 "Machnix" Conference, Burlington
Vt.

An example of the increase in performance is that the null RPC now takes
125 micro-seconds instead of 210 micro-seconds.

>Thanks.

your welcome ;-),
fletcher

>
>--
>Daryl Morse                     | Voice : (604) 293-5476
>MPR Teltech Ltd. 		| Fax   : (604) 293-5787
>8999 Nelson Way, Burnaby, BC    | E-Mail: morse@quark.mpr.ca
>Canada, V5A 4B5                 |         quark.mpr.ca!morse@uunet.uu.net


Fletcher Kittredge
Platforms and Tools Group, BBN Software Products
10 Fawcett Street,  Cambridge, MA. 02138
617-873-3465  /  fkittred@bbn.com  /  fkittred@das.harvard.edu

Richard.Draves@cs.cmu.edu (03/16/91)

> Excerpts from netnews.comp.os.mach: 15-Mar-91 Re: Mach RPC Throughput...
> Fletcher Kittredge@bbn.c (1629)

> The answer is that this is an example of comparing an old version of 
> piece of software with a new version of a competing package.  Given
> the dates cited in your article, the Mach RPC tested would be version
> 2.5 or lower.  For Mach 3.0, the RPC system was completely re-written
> by Richard Draves.  One of the results was a real increase in speed.
> You will be wanting to read the paper "A Revised IPC Interface", Richard
> Draves, Proceedings of the October 1990 "Machnix" Conference, Burlington
> Vt.


I rewrote the kernel IPC code.  I believe the Amoeba/Mach comparison was
looking at network RPC throughput.  We are looking at ways to improve
the performance of network RPCs for Mach 3.0.

Rich

schmidt@crimee.ics.uci.edu (Doug Schmidt) (03/16/91)

In article <63274@bbn.BBN.COM> fkittred@spca.bbn.com (Fletcher Kittredge) writes:
++ You will be wanting to read the paper "A Revised IPC Interface", Richard
++ Draves, Proceedings of the October 1990 "Machnix" Conference, Burlington
++ Vt.

Speaking of documentation...  Can someone please inform me where to ftp
the latest postscript sources of the CMU Mach documentation, e.g., the
kernel interface manual, the cthreads manual, etc.  I've looked on
cs.cmu.edu, but only the Mach 3.0 micro-kernel sources seem to be
there.  Is the documentation located someplace else?

	Thanks,

		Doug
--
His life was gentle, and the elements so            | Douglas C. Schmidt
Mixed in him that nature might stand up             | (schmidt@ics.uci.edu)
And say to all the world: "This was a man."         | (714) 856-4101
   -- In loving memory of Terry Williams (1971-1991)|

ast@cs.vu.nl (Andy Tanenbaum) (03/16/91)

In article <MORSE.91Mar14091659@quark.mpr.ca> morse@quark.mpr.ca (Daryl Morse) writes:
>My question is simple, though its answer likely is not: Why is the
>throughput of the Mach RPC so much slower than the other OSes? Are the
>respective RPCs different enough that throughput is a meaningless
>"apples and oranges" comparision? Has the Mach RPC simply not been
>optimized as heavily as that of the other OSes?

I am sure Rick can give the definitive answer for Mach, but I can speak for
Amoeba, and I think by implication for some of the others.  An RPC in
Amoeba follows the following path:

   1.  User process issues an RPC system call and traps to the kernel
   2.  Kernel mucks about with headers and sends a packet to the dest CPU
   3.  Dest kernel unmucks the headers and passes the packet to the user
   4.  User inspects the packet and traps to the kernel to send reply
   5.  More muck, another packet sent
   6.  Src machine gets the reply packet and passes it back to the user

These 6 steps take 1.1 msec on a Sun 3/60.  The protocol used on the
Ethernet is a straightforward protocol we have designed.  It is not IP.
No external servers of any kind are involved.  I believe that Mach 
involves external servers in the process, which of course is fatal for
the performance.  Thus we are comparing apples to apples.  From the user's
viewpoint, what is being measured is the time to send a message from itself
as client to a remote server over the Ethernet, and get the reply back.

Nevertheless, Amoeba processes can speak TCP/IP when desired.  There is an
external TCP/IP server available.  To speak TCP/IP, a client does an RPC
with the TCP/IP server and effectively says: "Please send this data as an
IP packet to such and such an IP address."  This gives full connectivity
with the Internet, but also means that the normal (local) case goes much
faster.  The loss of performance when going through the TCP/IP server is
not so important because usually the TCP connections go over a narrow-band
wide-area link anyway, so there is no way to get high-performance no matter
what.  In essence, we have chosen to optimize the local case, and accepted
worse performance when one specifically wishes to speak TCP/IP.  I believe
that Mach has chosen to do things differently.

Andy Tanenbaum (ast@cs.vu.nl)

ast@cs.vu.nl (Andy Tanenbaum) (03/17/91)

In article <4bsEGja00hsQI6K19M@cs.cmu.edu> Richard.Draves@cs.cmu.edu writes:
>I rewrote the kernel IPC code.  I believe the Amoeba/Mach comparison was
>looking at network RPC throughput.  We are looking at ways to improve
>the performance of network RPCs for Mach 3.0.

What are the current figures for the 3.0 microkernel for sending a null
message from user space on one machine over the Ethernet to another
user process and then back, i.e. the null RPC time?  Also, what is the
maximum user-to-user bandwidth in 3.0?  If possibly, what are they 
on Sun 3/60s, to compare them with the numbers I published in the Dec, 1990
CACM.

Andy Tanenbaum (ast@cs.vu.nl)

bob@MorningStar.Com (Bob Sutterfield) (03/19/91)

In article <9332@star.cs.vu.nl> ast@cs.vu.nl (Andy Tanenbaum) writes:
   The protocol used [by Amoeba] on the Ethernet ... is not IP ...
   The loss of performance when going through the TCP/IP server is not
   so important because usually the TCP connections go over a
   narrow-band wide-area link anyway, so there is no way to get
   high-performance no matter what.  In essence, we have chosen to
   optimize the local case, and accepted worse performance when one
   specifically wishes to speak TCP/IP.

What about when the non-"local" case involves RPC with a machine on a
different Ethernet in the next room, accessible via a high-bandwidth
IP router?  IP is useful in environments other than wide area
networks.  Local distributed computing environments might involve
multiple networks connected by routers, rather than bridges or
repeaters, and it seems that you're designing in a penalty for Amoeba
users whose clusters grow too big for one Ethernet.

morse@quark.mpr.ca (Daryl Morse) (03/19/91)

In article <9332@star.cs.vu.nl> ast@cs.vu.nl (Andy Tanenbaum) writes:

>   In article <MORSE.91Mar14091659@quark.mpr.ca> morse@quark.mpr.ca (Daryl Morse) writes:
>   >My question is simple, though its answer likely is not: Why is the
>   >throughput of the Mach RPC so much slower than the other OSes? Are the
>   >respective RPCs different enough that throughput is a meaningless
>   >"apples and oranges" comparision? Has the Mach RPC simply not been
>   >optimized as heavily as that of the other OSes?

>   I am sure Rick can give the definitive answer for Mach, but I can speak for
>   Amoeba, and I think by implication for some of the others.  An RPC in
>   Amoeba follows the following path:

>      1.  User process issues an RPC system call and traps to the kernel
>      2.  Kernel mucks about with headers and sends a packet to the dest CPU
>      3.  Dest kernel unmucks the headers and passes the packet to the user
>      4.  User inspects the packet and traps to the kernel to send reply
>      5.  More muck, another packet sent
>      6.  Src machine gets the reply packet and passes it back to the user

>   These 6 steps take 1.1 msec on a Sun 3/60.  The protocol used on the
>   Ethernet is a straightforward protocol we have designed.  It is not IP.

A number of people both posted, and replied directly, that the MACH
RPC runs over UDP/IP. In that light, the comparison was most certainly
one of "apples to oranges", but not for the reason given below. Unless
I am mistaken, that point was not very clearly indicated in either of
the comparisions. It would have been helpful if it was.

>   No external servers of any kind are involved.  I believe that Mach 
>   involves external servers in the process, which of course is fatal for
>   the performance.  Thus we are comparing apples to apples.  From the user's

According to one posted reply <bill@cs.columbia.edu (Bill Schilit)>,
the difference is a result of the different transport, rather than the
fact that an external server is utilized (ie. the transport, not the
architecture of Mach).

>   viewpoint, what is being measured is the time to send a message from itself
>   as client to a remote server over the Ethernet, and get the reply back.

>   Nevertheless, Amoeba processes can speak TCP/IP when desired.  There is an
>   external TCP/IP server available.  To speak TCP/IP, a client does an RPC
>   with the TCP/IP server and effectively says: "Please send this data as an
>   IP packet to such and such an IP address."  This gives full connectivity
>   with the Internet, but also means that the normal (local) case goes much
>   faster.  The loss of performance when going through the TCP/IP server is
>   not so important because usually the TCP connections go over a narrow-band
>   wide-area link anyway, so there is no way to get high-performance no matter
>   what.  In essence, we have chosen to optimize the local case, and accepted
>   worse performance when one specifically wishes to speak TCP/IP.  I believe
>   that Mach has chosen to do things differently.

Perhaps you can post some results of Amoeba RPC throughput over
TCP/IP, so we can see an "apples to apples" comparision? That would
likely be a more "fair" comparision.

>   Andy Tanenbaum (ast@cs.vu.nl)

I would also like to see an "oranges to oranges" comparision, namely
one where the Mach RPC runs over a less-expensive transport, as in the
"nrmal (local) case" for Amoeba. One respondent, who asked to be
identified only as a "highly placed source" hinted that such a
comparision might soon be possible:

>However, you might be interested to hear that X-kernel is being
>integrated into Mach as a basic, core system component.  The new "Mach
>network message server" will actually be the X-kernel, ported to Mach.
>So, once this is running, I would assume that Mach will run at
>X-kernel speeds.

Your are correct. I am interested. Perhaps someone is willing to offer
some tangible comments on that??


--
Daryl Morse                     | Voice : (604) 293-5476
MPR Teltech Ltd. 		| Fax   : (604) 293-5787
8999 Nelson Way, Burnaby, BC    | E-Mail: morse@quark.mpr.ca
Canada, V5A 4B5                 |         quark.mpr.ca!morse@uunet.uu.net

gdtltr@brahms.udel.edu (root@research.bdi.com (Systems Research Supervisor)) (03/19/91)

In article <BOB.91Mar18130146@volitans.MorningStar.Com> bob@MorningStar.Com (Bob Sutterfield) writes:
=>In article <9332@star.cs.vu.nl> ast@cs.vu.nl (Andy Tanenbaum) writes:
=>   The protocol used [by Amoeba] on the Ethernet ... is not IP ...
=>   The loss of performance when going through the TCP/IP server is not
=>   so important because usually the TCP connections go over a
=>   narrow-band wide-area link anyway, so there is no way to get
=>   high-performance no matter what.  In essence, we have chosen to
=>   optimize the local case, and accepted worse performance when one
=>   specifically wishes to speak TCP/IP.
=>
=>What about when the non-"local" case involves RPC with a machine on a
=>different Ethernet in the next room, accessible via a high-bandwidth
=>IP router?  IP is useful in environments other than wide area
=>networks.  Local distributed computing environments might involve
=>multiple networks connected by routers, rather than bridges or
=>repeaters, and it seems that you're designing in a penalty for Amoeba
=>users whose clusters grow too big for one Ethernet.

   I don't want to speak for Dr. Tanenbaum, but I like to pop in every once
in a while and pretend that I know what I am talking about. :-)
   I believe that Amoeba 4.0 deals with the multiple LAN case with a
Fast Local Internet Protocol (FLIP). This adds a hint to a capability to
determine which subnet the service is on. This still wouldn't handle
IP-specific hardware, but it would handle more general high-bandwidth
gateways. Not all of us want to be slaves to existing protocols, especially
if there is greater performance to be gained through newer ones.

                                        Gary Duzan
                                        Time  Lord
                                    Third Regeneration



-- 
                            gdtltr@brahms.udel.edu
   _o_                      ----------------------                        _o_
 [|o o|]   Two CPU's are better than one; N CPU's would be real nice.   [|o o|]
  |_o_|           Disclaimer: I AM Brain Dead Innovations, Inc.          |_o_|

schmidt@crimee.ics.uci.edu (Doug Schmidt) (03/20/91)

In article <MORSE.91Mar18110412@quark.mpr.ca> morse@quark.mpr.ca (Daryl Morse) writes:
++ "nrmal (local) case" for Amoeba. One respondent, who asked to be
++ identified only as a "highly placed source" hinted that such a
++ comparision might soon be possible:
++
++ >However, you might be interested to hear that X-kernel is being
++ >integrated into Mach as a basic, core system component.  The new "Mach
++ >network message server" will actually be the X-kernel, ported to Mach.
++ >So, once this is running, I would assume that Mach will run at
++ >X-kernel speeds.

Hum, did your highly placed source indicate whether the x-kernel
support would run in the kernel or in user-space as another server?
It seems that would make a big difference in terms of any significant
speed-up, since a major win of the x-kernel is that it avoids context
switches when moving from user-space to the kernel and vice versa.

Does anyone have any further info about this?

Doug
--
His life was gentle, and the elements so            | Douglas C. Schmidt
Mixed in him that nature might stand up             | (schmidt@ics.uci.edu)
And say to all the world: "This was a man."         | (714) 856-4101
   -- In loving memory of Terry Williams (1971-1991)|

dpj@CS.CMU.EDU (Daniel Julin) (03/20/91)

As indicated in several earlier posts, the major cost of the current
implementation of Mach IPC (RPC) appears to be the low-level transport
protocol used. The "standard" Mach netmsgserver uses TCP/IP over UNIX
sockets. TCP was selected because it provided the best combination of
performance and robustness on large, complicated networks. Note that
this issue of robustness is particularly important at CMU. There are
over 1700 machines on the CMU network, with almost 800 in the School
of Computer Science alone, of which more than 500 are running Mach. On
an average day, few of these machines might be actively engaged in
doing Mach network RPC's, but the chances that two communicating
machines are on the same physical cable are quite low. In addition, we
have to contend with a lot of background traffic, and non-negligible
packet losses.

We have been thinking about using different transport protocols in
different situations for quite a while, but never got around to it.
For the future, we are investigating the use of the x-kernel "virtual
protocols" mechanism to fulfill this function.

Our plans for the x-kernel are not finalized, but current prototypes
assume a system running at user-level in a Mach task. However, we are
also working on user-level device drivers, which should take care of
the argument against paying a protection boundary crossing for network
access. 

Finally, since someone asked, there is no "serious" netmsgserver
distributed with Mach 3.0. I have put together an adaptation of the
netmsgserver from the 2.5 system to use TCP sockets emulated in the
UNIX server, but this is clearly not an interesting long-term
solution. Again, in the longer term, we hope to use x-kernel
technology instead.


======================================================================
Daniel Julin                                            dpj@cs.cmu.edu
School of Computer Science
Carnegie Mellon University, Pittsburgh, PA 15213
======================================================================