[comp.sys.isis] Why is ISIS slower than RPC?

ken@gvax.cs.cornell.edu (Ken Birman) (02/16/91)
We get asked about performance a lot, and today I was just asked
by two different groups if I could explain why ISIS is slower both in
round-trip latency and CPU costs than SUN RPC.    This is a very
long message intended for people who, in some cases, are investing a lot
of their own time in ISIS... it will not give the whole picture, but
it should help you have a sense of what ISIS currently does and what
it costs... and why we can do much better when we rebuild the system
over a microkernel.

If you measure ISIS performance, you will find that figures vary widely
dpending on exactly how you measure it.  We have some nice graphs in
the revised "fast causal multicast" paper, but let me summarize the
picture here.

First, ISIS runs in bypass or non-bypass mode, and in the latter it
is very slow indeed.  This is discussed in the manual.  Basically,
messages get sent via a protocols server and each hop costs something.

In bypass mode, within a single group, the picture is more subtle.
If you measure the system,  you should see an ISIS RPC time of
something like 7.6ms round-trip for SUN4-SUN4 on an ethernet,
with the sort of degradation one would expect for packet size changes
or processor changes.  ISIS streaming will be around 650 null messages
per second or about 600kbytes/second, which is a little better than
TCP.  This is for small packets.

Actually, some users may see closer to 9ms, because the very first
copies of ISIS V2.1 had a problem whereby extra acks went out in some
situations.  These cost a fortune and we eventually found a way to supress
most of them, but the fix was a little too messy to post.  It is in V3.0, of
course, and has been in V2.1 since the October distribution.

ISIS will also be especially poor in local (within one machine)
performance, and somewhat closer to SUN RPC in intra-machine figures.
ISIS will do better than SUN RPC for byte swapped communications, i.e.
SUN to a MIPS system, and for large messages.

You may also find that ISIS is spending most of the extra time in
system calls and in its memory management and tasking layer.

Much worse performance than this indicates a bug in ISIS or your tests.
For example, the Los Alamos thing with acks was slowing a test of theirs
down from what should have been about 50 multicasts per second to 1
per second.  Obviously, this stood out, but if you were to hit this
just on one multicast now and then....

Basically, the picture is this.  If you don't need what ISIS is
doing, the technology is a clear performance loser relative to SUN
RPC, but if you do need it, it is much FASTER than to build an
equivalent layer over SUN RPC.  (Discussion below.)  The key question
on ISIS today, as opposed to two years out, is whether your primary
need is for fault-tolerance, consistency, and distributed control -- or
if it is basically for a fast RPC.

Let me summarize the main points I usually make on this.

First, performance is a relative thing.  The rebuild of ISIS on Mach will
probably be much faster than SUN RPC can possibly be over UNIX, because
ISIS protocols add only about a 2% overhead to "transport" costs, and
UNIX severely penalizes application-level transport protocols.  By
moving ISIS into the Mach communications kernel and hence close to the
wire, we can get down to a 1.5-2ms RPC between machines like the SUN4,
where we currently are sitting over a 2.5ms UNIX system call stack just
to get data on the wire or into our application.

SUN-RPC also has some "unfair" advantages.  One is that it is optimized for
the case of a single channel between two processes and for the case of
processes on the same machine.  And, it doesn't have any reliability
guarantees when the endpoints fail, which ISIS does.  ISIS is not optimized
within a machine -- it uses UDP which pings to the wire and back due to
a poor UNIX implementation.  And, it IS optimized for the case of multiple
destination multicasts, where it begins to do slightly better than RPC.
Thus, if you do 3 or 4 RPC's sequentially and compare with an ISIS multicast
using a cheap protocol (cbcast or fbcast in BYPASS mode), we come pretty
close.  We still lose, but not by much.

Second, you need to keep in mind that ISIS is sending big packets due
to its flexible message format and its protocol structure.  This will be
reduced in the reimplementation, but figure on a 440 byte overhead for
each packet just because it was sent via ISIS and not directly.  For
small packets this makes a huge difference; thus, a null SUN RPC versus
a null ISIS RPC compares something like 32 bytes with something like 480!
Naturally, we suffer in this comparison.

Third, SUN RPC has some advantages in the way it deals with memory management
and system calls.  ISIS tends to need to do a lot of system calls.  With
2 channels open (1 to protos, one to the interclient layer), we need a select
at least once per message.  We also need to call gettimeofday because our
protocols are timer based, and we need to call recvfrom/sendto to move the
data.  UNIX system calls are expensive and this adds up.  Further, ISIS is
task based, and needs to create the tasks (warm started, but still expensive).
SUN RPC over SUN LWP is actually slower than ISIS if you do all these system
calls too, but perhaps that is a cheat -- since you normally wouldn't use
SUN RPC with two channels, checking the time constantly, and over LWP.

Fourth (probably should be zeroth) ISIS is normally used asynchronously.
So, you don't often see RPC performance as a limiting factor.  RPC measures
the latency between nodes, but you get a producer-consumer pipeline 
effect and if both sides stay busy, the round-trip time is much less of
an issue than in RPC, which is not normally used this way.  Of course, if
you use ISIS as an RPC system, this point doesn't hold -- but if you use
it as recommended, say for managing replicated data -- you almost never
wait for replies from remote processes.

This last point is the one that matters in, say, a brokerage trading floor.
If you are distributing quotes, a latency of a few milliseconds between 
the sender and the destinations is a small cost to pay, and fault-tolerance
may be a very big win.  The mental picture is like a UNIX pipe...
Many applications act this way, and round-trip latency just isn't the
major factor for them.

So, the picture is that one really shouldn't think of ISIS V3.0 as a
blazingly fast data transport system, although it does fairly well
considering the powerful group abstraction it supports.  Rather, we
recommend that people think of ISIS as a "control" mechanism that can
be combined with other facilities (even SUN RPC, as in Deceit!) where
less semantics are needed and performance is the key thing.  The people
who are happiest with ISIS generally run applications that communicate
infrequently, perhaps a few times per second, and for this, performance
is hardly the issue.

Now, to back my first point (if you build over RPC it would cost more
than you expect), say that you need to send a message reliably to n
destinations, i.e. with all or nothing behavior.  If you plan to
garbage collect the data, it isn't hard to see that you need a 3-phase
protocol:

		sender		typical dest
Phase 1:
		1. send m	2. receive and deliver m, ack
		3. collect acks
Phase 2:
		4. send "done"  5. note "done", ack
                6. collect acks
Phase 3:
                7. send "finished" 8. forget interaction, ack
                9. forget interaction

The reason for this is that you need to be able to handle failures.
In this protocol, if the sender dies, a typical destination can take
over the protocol from the stage the sender was in when it last heard
from it.  I.e., if a destination got a message "m", it can take over
as the sender from phase 1 if the sender dies before reaching phase 2.
Of course, this could lead to a destination getting duplicates, so
destinations need to remember the interaction until it is certain that
all have the packet.  This is why you need phases 2 and 3... ANY 
fault-tolerant protocol has a structure like this.  The ISIS protocol
does too -- but the second and third phase are overlayed on subsequent
first-phase communication and so you just don't see it!

In fact, to do what ISIS is doing, you really need to do something like
3-n RPC's to send one message to n destinations, unordered.  Obviously,
ISIS is performing better than this.  And, when I say need, I mean that
if your application wants to be fault-tolerant, it MUST do the 3*n RPC's!
This is because you need enough information after a failure to still
finish the protocol and supress duplicates, and you need to eventually
forget the interaction or the state piles up.

This raises a real question of whether one should actually compare 
ISIS performance to one SUN RPC... or to 3 sequential ones....
For one destination, this will seem to be a crazy statement, but for
4 destinations, the answer is that 1 ISIS cbcast does work comparable
to what you would need something like 12 RPC's to do!

I have some data on ISIS performance, and Pat Stephenson has more.
You will see that little of the time is spent in anything having to
do with protocols or cbcast or anything (well, abcast is a little
worse).  The performance figures are dominated by the tasking subsystem
and message library (about 8% of total costs) and by the UNIX system
calls and transport of the data (about 85%).  The remainder (about 2%)
is spent in our protocol layer.  Some nice histograms and graphs of
multicast performance as a function of number of destinations, message
size, etc. will be in the revised "fast causal multicast" TR when we
re-release it (~ next week).  

We can provide that data if people would like to see it.  It will also
be in the V3.0 manual.

All this argues that our rebuild of ISIS should be able to do much better,
but also that there isn't much hope for a much faster version of ISIS over
UNIX.  I can't reduce the number of system calls, and ISIS tasks are
two to three times faster than SUN LWP for most operations.  This leaves
the message library, which probably isn't as fast as it should be, and the
ISIS windowing communication protocol, which is fast enough but not using
a very efficient message representation and hence is putting these big
headers on things (actually, 220 bytes per message, but it always puts
the user's message inside a transport message for 440 bytes minimum).
But, fixing this would only cut that 7.6ms figure to about 7.4ms...

SO... I would tell people not to plan on building anything that needs
to run super-fast and isn't actually in need of ISIS semantics directly
over ISIS -- control it with ISIS, perhaps, but use something cheaper
for the speed critical paths.  But, I would also point out that the bet
changes if you look at ISIS in the future, over Mach or Chorus, because
we should actually be able to eliminate almost all the idiocies of the
current UNIX platform in the context of the Mach network message server
or Chorus kernel.  Pure RPC may still be faster, but not much so.

What follows are our last performance measurements on the SUN4/bypass
code.  They do reflect a bug fix you might not have, because is was
made at the last minute.  So, the earliest V2.1 copies might be a little
slower and will be seen to be sending too many ack packets if you look
at a client dump.  V3.0 should be comparable.

ISIS V2.1 cost figures (but with extra-ack problem fixed).  These
figures are for BYPASS communication only.  Non-BYPASS is about 3-5
times slower for most things, 10-times slower for some especially
unfortunate cases.

Operation				Cost
Task create-destroy (null proc.)	170us (= t_fork_urgent, null proc.)
Task switch				103us (= t_wait + t_sig + task_swtch)
Null message create/destroy		24us
Same but use msg_gen("")		29us (i.e. 5us to scan null format)
Same but use msg_gen("%d")		+14us
Same but use msg_gen("%C", &x, 0)	+12.1us
Same but use msg_gen("%C", &x, 8k)	+1130us
Same but use msg_gen("%*C", &x, 8k)	+28us

RPC to self:
cbcast for request, reply (both null)	1430us (1.571 before "bug fix")
cbcast but inhibit "rcv"        	 406us   (hacked ISIS for this)
... est. for rcv, cbcast reply	        1024us
... est. for rcv a cbcast, no reply      309us
   (i.e. 1430 = 2*406 + 2*309)

fbcast for both				1195us
fbcast but discard message on "rcv"      320us
... est. for rcv, fbcast reply           795us
... est. for rcv a fbcast no, reply      239us
   (i.e. 1195 = 2*320 + 2*239)

>From these figures it will be unclear why sending a cbcast is so
costly, since one would think we are just generating a message,
putting some fields in it, and firing it off... actually, this is
true, but the message ends up having a lot of fields in it (at
25-50us each) and we do a task switch (at 103us) and we need
to run through the "BCAST" routine.

A profile shows that the time is spread fairly uniformly:
Top ten routines		What they do

bypass_send	95us		Puts VT stuff in message
isis_runtasks	88us		Picks task to switch to
task_swtch	65us		Pure switch
BCAST		61us		generates msg, calls bypass_send
invoke		57us		Part of new-task create
mallocate	48us		Memory allocation
qu_add		46us		queue node allocate and append
msg_deallocate	43us		action routine for msg_delete()
do_bcast	35us		parses cbcast_l options, calls BCAST()
msg_insertfield	35us		action routine for msg_put()

Total for 10	612us

I used the same approach to measure the costs associated with the
ISIS layer that does reliable interclient communication ("TCP over
UDP", more or less)

This is the layer BELOW the bypass send and ABOVE the various
UNIX system calls.  It looks like this

        Sender                          Recv
        gen packet
        net_send
            build intercl_packet
            insert header
            gettimeofday
            UDP xmit (sendto)
                ---------------->       select
                                        gettimeofday
                                        recvfrom
                                        msg_reconstruct
                                        net_rcv
                                              unpack messages
                                              deliver

For this test the sender and destination were on the same Sparc 1.

Size (Xmited)     Cost    Overhead: gen packet    Overhead: UNIX system calls
0   (+200)        2884 us       148 us                     1264 us
512 (+200)        3245 us       148 us                     1363 us
1k  (+200)        3445 us       148 us                     1590 us
2k  (+200)        4111 us       148 us                     2055 us
4k  (+200)        5276 us       148 us                     2984 us

... conclusions: unclear, but we could fill in the "loop" slide
a bit more accurately from this.  Here is a picture of a typical 1-way
CBCAST from a sender to the destination, split (estimated) to show
where time is spent.

		   1K IPC (two processes, same host)
		send			rcv
cbcast "gen"    191=35+61+95		24us
task subsys	206us=103*2		500us=170+103*2+misc.
		-----------------------------------
intercl code                 1855 us
		-----------------------------------
UNIX sys calls               1590 us
               = 3* gettimeofday + sendto + select + recvfrom
		-----------------------------------
Actual IO        kernal bcopy, hidden within "UNIX sys costs"

I only have one machine at home so I didn't run this for a remote
destination.  The various test programs are in my gvax account:
	itest.c -- intercl test
	udp.c   -- system call costs
	looptest.c -- gets most of the "table" of numbers
	profile.c -- for profiled test

... all the above is for a Sparc 1 with isis compiled -O3 -DBYPASS
(not a Sparc1+).