mishkin@apollo.COM (Nathaniel Mishkin) (07/25/89)
In article <6569@joshua.athertn.Atherton.COM> joshua@Atherton.COM (Flame Bait) writes: >I've finished work on a paper entitled "A Comparison of Commercial RPC >Protocols" which I gave at NCF: The Network Computing Forum, and also >as a work in progress at USENIX. > >It compares RPC (Remote Procedure Call) systems from Apollo, Sun and >Netwise for speed and dependablity. I'm posting the paper (in troff >format) to comp.misc, so interested people can get it. ... I am seeking to end any cross-posting concerning this article by posting my response to comp.protocols.misc. I suggest all further discussion take place there. -- -- Nat Mishkin Apollo Computer Inc., Chelmsford, MA mishkin@apollo.com
mishkin@apollo.COM (Nathaniel Mishkin) (07/25/89)
In article <6569@joshua.athertn.Atherton.COM> joshua@Atherton.COM (Flame Bait) writes: >I've finished work on a paper entitled "A Comparison of Commercial RPC >Protocols" which I gave at NCF: The Network Computing Forum, and also >as a work in progress at USENIX. Below are some comments on Joshua Levy's "A Comparison of Commercial RPC Systems" recently posted on comp.misc. In section 3.1 Levy discusses the implications on performance of the size of UDP/IP datagrams. It is important to note that, in general, use of 8K UDP datagrams is a losing proposition. This is because systems must fragment such datagrams into pieces that will fit into data-link (MAC) level packets. The size of such packets on ethernet is around 1500 bytes. Thus, on ethernet, an 8K UDP datagram is actually sent as 5 (or so -- I'm not bothering to figure out the size of headers and such) ethernet packets. The problem with this is that, given the way IP does fragmentation (and reassembly), if any of the 5 fragments is lost, they might as well all have been lost -- neither UDP nor IP has any support for figuring out what got lost and telling the sender what to resend. The receiving application sees none of the fragments. I have observed systems that are simply unable to handle 5 back-to-back packets, as it would have to in order to successfully process a call that had 8K of input parameters. In such a situation, the caller could retransmit its 8K UDP datagram until the cows come home and it might never be received intact. In general, fragmentation works only when you're using TCP/IP, which has mechanisms for retransmitting lost pieces of TCP streams. The problem with large UDP datagrams is exacerbated if the path between sender and receiver traverses one or more gateways. There are two reasons for this: (1) If many senders of large UDP datagrams are trying to use the same gateway simultaneously, the gateway will get lots of fragments sprayed to it at a time. The odds that it will drop one is thus increased. (2) As the number of gateways between sender and receiver increases, so does the probability that a packet will be lost (the total probability being a cumulative function of the probability of each single gateway dropping a packet). On the basis of the preceding, I would claim that the performance numbers for Sun RPC on UDP are, in general, irrelevant, because, in general, it can't work. Sure, it'll work in some restricted cases, but people need to understand the serious restrictions. In section 3.3 Levy uses the example of (putative) machines with 9 bit bytes and how this would present a problem for Apollo's data representation scheme (NDR, not RMR as in Levy's posting that announced his paper). In fact, in case it wasn't clear, Sun scheme (XDR) and (presumably) ASN.1 have the same problem -- none of the scheme has any special support for 9 bit bytes. They would presumably represent them all in some larger standard type. Thus, the fact that NDR doesn't have a tag that says "I use 9 bit bytes" produces no problems unique to NDR. I want to take this opportunity to clear up one misconception about NDR (which I know Levy doesn't have but which I know many people do): NDR's representation suite is not open-ended. NDR is best thought of as a "multi-canonical" scheme. I.e. the sender of data can choose from a variety (but fixed) set of representations. We believe the set is usefully large, but not so large as to be onerous for receivers. If a sender's native representation happens to not exist in the set, it is the sender's responsibility to convert to a one of the NDR set before sending the data. Note however, that all the most common representation schemes currently in existence are covered in NDR's suite. Thus, for systems with those representations, no conversion is required on sending. Further, for such systems, the number of conversion routines required for any single system is small: 1 for integers (from big to little endian, or from little to big endian), 1 for characters (from ascii to ebcdic, or from ebcdic to ascii), and 3 for floating point (from 3 of the 4 possible floating representation to the native representation). A total of 5 routines. Not a big deal. (Yes, the total number of conversion routines across all systems is larger, but that's just a matter of coding them up once, and you're done.) In section 4.1 Levy discusses the "silent errors" possible with Sun RPC over UDP as a result of duplicated UDP datagrams. Two other problem cases arise in this mode: (1) If the input or output parameters are larger than 8K, the UDP solution can not be used at all -- you must use Sun RPC over TCP. (2) The caller must be able to approximate the time it will take for the call to execute at the server end. In general, this is a hard number to know (it may be a function of the speed and load of the server and on the complexity of the input data presented on each call). Guessing the wrong value will lead to spurious call failures. -- -- Nat Mishkin Apollo Computer Inc., Chelmsford, MA mishkin@apollo.com
brent%terra@Sun.COM (Brent Callaghan) (07/29/89)
In article <449d9c67.12879@apollo.COM>, mishkin@apollo.COM (Nathaniel Mishkin) writes: > of 8K UDP datagrams is a losing proposition. This is because systems > must fragment such datagrams into pieces that will fit into data-link > (MAC) level packets. The size of such packets on ethernet is around > 1500 bytes. Thus, on ethernet, an 8K UDP datagram is actually sent as > 5 (or so -- I'm not bothering to figure out the size of headers and such) > ethernet packets. The problem with this is that, given the way IP does > fragmentation (and reassembly), if any of the 5 fragments is lost, they > might as well all have been lost -- neither UDP nor IP has any support > for figuring out what got lost and telling the sender what to resend. > The receiving application sees none of the fragments. > > I have observed systems that are simply unable to handle 5 back-to-back > packets, as it would have to in order to successfully process a call > that had 8K of input parameters. In such a situation, the caller could > retransmit its 8K UDP datagram until the cows come home and it might > never be received intact. In general, fragmentation works only when > you're using TCP/IP, which has mechanisms for retransmitting lost pieces > of TCP streams. > > The problem with large UDP datagrams is exacerbated if the path between > sender and receiver traverses one or more gateways. There are two reasons > for this: (1) If many senders of large UDP datagrams are trying to use > the same gateway simultaneously, the gateway will get lots of fragments > sprayed to it at a time. The odds that it will drop one is thus increased. > (2) As the number of gateways between sender and receiver increases, > so does the probability that a packet will be lost (the total probability > being a cumulative function of the probability of each single gateway > dropping a packet). > > On the basis of the preceding, I would claim that the performance numbers > for Sun RPC on UDP are, in general, irrelevant, because, in general, > it can't work. Sure, it'll work in some restricted cases, but people > need to understand the serious restrictions. I can`t follow this argument at all. You seem to be saying that 8K UDP datagrams are no good because you *might* have problems. I agree with all your observations on the problems that might happen - but in practice the problems that you described are rare. For RPC's it's a big performance win if you can encapsulate multiple request-response packets into a single UDP packet. Even if this packet is fragmented, it's still much more efficient if the network is reliable enough that packet drops are not a problem. In practice local area networks ARE reliable enough to support 8K UDP packets efficiently. Where packet drops are a problem (NFS across slow gateways and unreliable links) the answer is obvious - just reduce the packet size so that you take less of a hit from drops. NFS users can control this with "rsize" and "wsize" mount options. NFS users will also complain that performance is noticeably worse when the the packet size limit is reduced from 8K. Made in New Zealand --> Brent Callaghan @ Sun Microsystems uucp: sun!bcallaghan phone: (415) 336 1051
wesommer@athena.mit.edu (William Sommerfeld) (07/30/89)
Brent, Are you claiming that it's acceptable for a protocol to have to be tuned by the user in order to avoid congesting gateways or even to work *at all*? Are you also claiming that it's acceptable for someone writing an application using a remote procedure call package to make packet sizes user-visible? There's no excuse for explicitly using UDP fragmentation. It will get you into trouble. By the way, Brent, please check your facts. SUN RPC does NOT encapsulate multiple request or response packets into a single UDP packet.. NFS merely makes larger single requests if told to do so. - Bill --
brent%terra@Sun.COM (Brent Callaghan) (07/31/89)
In article <WESOMMER.89Jul29153901@anubis.athena.mit.edu>, wesommer@athena.mit.edu (William Sommerfeld) writes: > Are you claiming that it's acceptable for a protocol to have to be > tuned by the user in order to avoid congesting gateways or even to > work *at all*? No, I'm not claiming that it's a desireable property of an NFS implementation that a sysadmin has to fudge around with read and write sizes and timeouts. Such tuning would be unneccessary if provided by the transport layer (where it properly belongs) - unfortunately, UDP doesn't offer anything useful here. There's nothing to prevent an NFS implementation from varying timeouts and UDP packet sizes automatically based on server response statistics. I agree, congestion control should be automatic. > Are you also claiming that it's acceptable for someone writing an > application using a remote procedure call package to make packet sizes > user-visible? No, I'm not claiming that. It depends on the transport you use. If you run the RPC on UDP then you must acknowledge an upper limit on the message size. If you use TCP instead then you don't really have to worry. If I'm using an unreliable datagram protocol like UDP then I have to acknowledge a tenet of datacomm theory that short messages have a better chance of getting through OK than long ones. If you have problems sending long messages then it's worth trying shorter ones. > There's no excuse for explicitly using UDP fragmentation. It will get > you into trouble. Yes, it will get you into trouble if you have a lot of drops between client and server. If drops are negligible then you're a fool not to exploit big packets. I can only argue from experience - it seems that NFS runs quite happily in most networks with fragmentation. Users complain about poor performance when you don't fragment. What would you rather have ? > By the way, Brent, please check your facts. SUN RPC does NOT > encapsulate multiple request or response packets into a single UDP > packet. NFS merely makes larger single requests if told to do so. I'm sorry if I was unclear. I was trying to say that a single fragmented UDP packet can encapsulate the "effect" of multiple smaller packets that are not fragmented. If you have to move lots of data between client and server with RPC's it's much more efficient to do it in a few big requests than in lots of little ones. Even if the big requests get fragmented - the fragments don't have to be acknowledged individually by the destination. This presupposes an acceptably small number of drops which appears to be true for most users of our NFS implementation. Are there any NFS implementations that constrain their users from using fragmented UDP packets ? Made in New Zealand --> Brent Callaghan @ Sun Microsystems uucp: sun!bcallaghan phone: (415) 336 1051
mishkin@apollo.HP.COM (Nathaniel Mishkin) (07/31/89)
In article <118445@sun.Eng.Sun.COM> brent%terra@Sun.COM (Brent Callaghan) writes: >In article <449d9c67.12879@apollo.COM>, mishkin@apollo.COM (Nathaniel Mishkin) writes: >> of 8K UDP datagrams is a losing proposition. ... > >I can`t follow this argument at all. You seem to be saying that 8K UDP >datagrams are no good because you *might* have problems. I agree >with all your observations on the problems that might happen - but in >practice the problems that you described are rare. I guess I have to disagree with the "rare". I have seen some systems systematically fail to handle 5 back-to-back packets. >Where packet drops are a problem (NFS across slow gateways and unreliable >links) the answer is obvious - just reduce the packet size so that you >take less of a hit from drops. NFS users can control this with "rsize" >and "wsize" mount options. The idea of reducing the packet size (i.e. the number of packets sent in a burst) is fine. What's not fine is expecting users or applications to do the reducing. It's the business of the underlying RPC system to handle this. How would people react to a TCP implementation that had the property that you'd have to notice that your "send"s were failing and then make some call to reduce (say) the TCP window size? -- -- Nat Mishkin Apollo Computer Inc., Chelmsford, MA mishkin@apollo.com
beepy%commuter@Sun.COM (Brian Pawlowski) (07/31/89)
In article <449d9c67.12879@apollo.COM>, mishkin@apollo.COM (Nathaniel Mishkin) writes: > Below are some comments on Joshua Levy's "A Comparison of Commercial > RPC Systems" recently posted on comp.misc. > In section 3.1 Levy discusses the implications on performance of the > size of UDP/IP datagrams. It is important to note that, in general, use > of 8K UDP datagrams is a losing proposition. I have to object to the term "in general". I would respond that "in general, the use of 8K UDP datagrams is a win for transferring large amounts of data, and has shown to be a win in applications such as NFS." Performance for smaller transfer sizes (blocks sizes) entails greater processing overhead in the application layer, and traverses through the networking layers to accomplish the same overall data transfer. This would imply to me that the total time to transfer the same amount of data using smaller packets is greater. > ethernet packets. The problem with this is that, given the way IP does > fragmentation (and reassembly), if any of the 5 fragments is lost, they > might as well all have been lost -- neither UDP nor IP has any support > for figuring out what got lost and telling the sender what to resend. > The receiving application sees none of the fragments. For "noisy" or "loaded" networks with lots of packet loss, smaller packets could make more sense. However, it is pessimistic to put this forward as the "general" case. > I have observed systems that are simply unable to handle 5 back-to-back > packets, as it would have to in order to successfully process a call > that had 8K of input parameters. In such a situation, the caller could > retransmit its 8K UDP datagram until the cows come home and it might > never be received intact. In general, fragmentation works only when > you're using TCP/IP, which has mechanisms for retransmitting lost pieces > of TCP streams. Hmmmm... again, the in general. In general, systems incapable of handling back to back packets (sometimes only two - I've seen) have problems. Again, smaller packets would make sense so as not to thrash the poor performing system. In general, over an extensive network at Sun, 8K UDP packets provide greater throughput, less overhead. (You know: Tastes better - less filling.) I would argue against proposing "least common denominators" as the general case; this is a pessimistic strategy. I think this points to "the general" need for flexibility in a distributed application, using a datagram protocol, to allow static or dynamic (preffered) configuration of parameters such as UDP packet size (and retransmissions, with backoff strategies). > On the basis of the preceding, I would claim that the performance numbers > for Sun RPC on UDP are, in general, irrelevant, because, in general, > it can't work. Sure, it'll work in some restricted cases, but people > need to understand the serious restrictions. I like your argument, EXCEPT FOR THE FACT THAT IT FLIES IN THE FACE OF WHAT WE SEE IN LARGE nfs NETWORK INSTALLATIONS. I think you are approaching this from a pessimistic, worst case scenario which is a bad way to deal with network throughput. In general, I have to assure you - it works. > -- > -- Nat Mishkin > Apollo Computer Inc., Chelmsford, MA > mishkin@apollo.com Brian Pawlowski <beepy@sun.com> <sun!beepy> Sun Microsystems, NFS Development
craig@bbn.com (Craig Partridge) (07/31/89)
In article <118603@sun.Eng.Sun.COM> beepy%commuter@Sun.COM (Brian Pawlowski) writes: >I have to object to the term "in general". I would respond that "in general, >the use of 8K UDP datagrams is a win for transferring large amounts >of data, and has shown to be a win in applications such as NFS." I think this argument about "in general" is missing the key point. Fragmentation is known not be robust -- see Mogul and Kent's article on this in 1987 Proceedings of SIGCOMM. The basic rule is that *if you rely on fragmentation to work you are building protocols that are guaranteed to fail in many common situations.* Indeed, this is why many of us were pleased to see Sun experiment with methods for dynamically determining the transmission unit size for NFS (see Bill Nowicki's article in the April '89 issue of ACM SIGCOMM Computer Communication Review). >I like your argument, EXCEPT FOR THE FACT THAT IT FLIES IN THE FACE >OF WHAT WE SEE IN LARGE nfs NETWORK INSTALLATIONS. I think you are >approaching this from a pessimistic, worst case scenario which is >a bad way to deal with network throughput. In general, I have to assure >you - it works. This is a poor counter argument. Saying "we haven't seen problems" when there are demonstrable research results that show you will have problems in certain situations is an ostrich approach. It leads to cruddy protocols. Craig PS: I understand that there's an element of SUN vs. Apollo protocols here. All I'm asking is let's stick with agreed upon basic principles of network protocol design in the debate -- one such principle is that fragmentation is a bad idea.
TIHOR@CMCL1.NYU.EDU (Stephen Tihor) (08/01/89)
If ywe are talking real world you bet your business sorts of issues then protocols must not fail in WORST practical case. From observations of events in the Internet I assert that to the extent taht the SUN internal network convinces you that 8K UCP barrages are reasonable it is not a good model for much of the Internet. It would have been straightforwards to design simple feedback loops with a hysterisis parameter to avoid the known network problems or too direct a feedback loop into such protocols as the NFS packet size/multi packet code or RWHO. Networking protocols should BE DESIGNED NOT TO FAIL IN THE GENERAL CASE. Protocol designs have to be PESIMISTIC because you can not insure that the world is a simple catenet of ethernets with moderately compatible machines on them. Degrade performance if you can't figure out how to autotune down but don't just fail anmd require an end user to solve an internetwork level problem. The more widely used a protocol is intended to be the better it should be at handling bad enviornments. [I did not find the original explaination for using UDP o instead of TCP for NFS RPC to be very convincing once SUN started pushing NFS as a standard for remote files systems. -------
TIHOR@cmcl1.nyu.edu (Stephen Tihor) (08/01/89)
When SUN chose not to use TCP which is inteded to do much of this work they had to choose to either close their eyes and pray or to implement the missing functionality in other ways. It seems that they relaced a sledge hammer with a ballpeen and a prayer to save weight. -------
joshua@athertn.Atherton.COM (Flame Bait) (08/02/89)
Summary of the RPC wars to date: I posted a paper showing that Apollo RPC was slower than Sun RPC. Tony (of Netwise) posted a paper showing the same thing. Nat (of Apollo) pointed out that Sun's UDP based RPC used 8K packets which causes fragmentation to occur at the IP layer. Apollo uses 1K packets which never fragment. He says this fragmentation is a major problem. Other folks pointed out that (in general) fragmentation is a bad thing. Brian et al. (of Sun) says that fragmentation is not a problem in most (or almost all) configurations, and thinks that Apollo is needlessly slowing everyone down to a least common denominator speed. I have three comments on this: 1. If for any reason UDP does not work, a person using Sun's RPC system can use TCP. (This option is not available to Apollo's RPC users, but is available to Netwise RPC users.) Except for very small packets Sun's TCP based RPC protocol is faster than Apollo's RPC protocol. Because of the differences in reliability described in my paper, I always compare Apollo to Sun TCP. Since they are both equally reliable I think this is the fairest comparison. 2. The fragmentation which Nat is worried about is happening at the IP layer, which is two protocol layers below RPC. I think it is unfair to blame the RPC implementor for fragmentation which is happening outside of his control in the protocol stack. I might complain to Sun's UDP implementation group about the fragmentation, but not to their RPC group. 3. Apollo is much slower than Sun: For half K packets Sun TCP is about the same speed as Apollo. For 8K packets Sun TCP is about three times faster than Apollo. For 16K packets Sun TCP is about four times faster than Apollo. The exact numbers are in my paper. If you want a copy look through the back issues of comp.misc, or email me. I respond to all email, so if you have not heard from me, try again with a different path, or give me a call. Joshua Levy -------- Quote: "You can stand me up at the gates of Hell, Addresses: but I wont back down!" -- Tom Petty joshua@atherton.com {decwrl|sun|hpda}!athertn!joshua work: (408)734-9822 home: (415)968-3718
mishkin@apollo.HP.COM (Nathaniel Mishkin) (08/03/89)
In article <9574@joshua.athertn.Atherton.COM> joshua@atherton.com writes: > 1. If for any reason UDP does not work, a person using Sun's RPC system > can use TCP. (This option is not available to Apollo's RPC users, > but is available to Netwise RPC users.) Look: The theory is that the option is simply not necessary. You pose the point as if it is a deficiency with that the application isn't obliged to make the consideration as to what underlying protocol to use. I just don't see it that way. If NCS/RPC (on UDP) is slower -- is intrinsically slower -- than some other RPC running on TCP, then that's bad for NCS. I claim that NCS/RPC is not intrinsically slower. There's no reason it can't be at least comparable to (if not faster than) TCP/IP in bulk throughput. And in short interchanges (a single client making single calls in turn to a number of servers), it should be noticably better than TCP/IP. > 2. The fragmentation which Nat is worried about is happening at the IP > layer, which is two protocol layers below RPC. I think it is > unfair to blame the RPC implementor for fragmentation which is > happening outside of his control in the protocol stack. I might > complain to Sun's UDP implementation group about the fragmentation, > but not to their RPC group. As I thought had been made clear, the problem isn't the UDP implementation -- it's in the nature of UDP and in IP fragmentation. > 3. Apollo is much slower than Sun: > > For half K packets Sun TCP is about the same speed as Apollo. > For 8K packets Sun TCP is about three times faster than Apollo. > For 16K packets Sun TCP is about four times faster than Apollo. I know you make it clear in your paper, but I want to make it explicit here: Your tests were run with a version of NCS that simply did bulk data throughput in the most naive way conceivable, under the assumption that the primary use of RPC was not to move tons of data. I recognize this to be a very small-minded assumption, but that's life. In any case, I've repented for my sins and my observations are that the latest version of NCS (released on Apollos, in beta on non-Apollos) has increased its bulk data throughput by 2-3 times. Performance work continues and I hope to be able to provide some new numbers soon. -- -- Nat Mishkin Apollo Computer Inc., Chelmsford, MA mishkin@apollo.com
joshua@athertn.Atherton.COM (Flame Bait) (08/04/89)
Nat mishkin@apollo.com writes: >In article <9574@joshua.athertn.Atherton.COM> joshua@atherton.com writes: >> 1. If for any reason UDP does not work, a person using Sun's RPC system >> can use TCP. (This option is not available to Apollo's RPC users, >> but is available to Netwise RPC users.) > >Look: The theory is that the option is simply not necessary. You pose >the point as if it is a deficiency with that the application isn't obliged >to make the consideration as to what underlying protocol to use. I just >don't see it that way. This is an important difference between us. I say that some applications are better suited for UDP, others for TCP or VMTP. Since the application programmer understands his application the best, he should choose which protocol to use. You say that NCS's beefed up UDP protocol is always the best, for all (or almost all) applications. This may be true in the future. (I doubt it, though). It is certainly not true now. If we ever do find a single communications protocol which is best for all (or almost all) applications we should call it HGP, standing for "Holy Grail Protocol." :-) In the meantime vendors should provide flexible RPC systems. Joshua Levy -------- Quote: "The Street finds its own uses for technology." Addresses: -- William Gibbson joshua@atherton.com {decwrl|sun|hpda}!athertn!joshua work:(408)734-9822 home:(415)968-3718