gnb@bby.oz.au (Gregory N. Bond) (10/23/90)
Well, on a similar note.... I understand James' and Jon's arguments. Reliable datagrams are best implemented with TCP and a "write(len); write(data);" layer. I am looking for something a little different. Consider a net with a server and many (say, 100) workstations, and a data feed that goes to each workstation. At the moment, I have to open 100 TCP streams, and so each packet of data generates 200 TCP packets, all more-or-less identical. What would be nice would be to broadcast the packet to the local net, and have the clients request missed packets, thus implementing a sort of reliable broadcast. I would be happy for some sort of overhead for the reliability (e.g. server broadcasts empty packets with the highest serial number once every 10 seconds), but before I re-invent wheels, has someone done something like this? Greg. -- Gregory Bond, Burdett Buckeridge & Young Ltd, Melbourne, Australia Internet: gnb@melba.bby.oz.au non-MX: gnb%melba.bby.oz@uunet.uu.net Uucp: {uunet,pyramid,ubc-cs,ukc,mcvax,prlb2,nttlab...}!munnari!melba.bby.oz!gnb
craig@bbn.com (Craig Partridge) (10/23/90)
In article <GNB.90Oct23133132@leo.bby.oz.au> gnb@bby.oz.au (Gregory N. Bond) writes: >Well, on a similar note.... > >Consider a net with a server and many (say, 100) workstations, and a >data feed that goes to each workstation. At the moment, I have to >open 100 TCP streams, and so each packet of data generates 200 TCP >packets, all more-or-less identical. What would be nice would be to >broadcast the packet to the local net, and have the clients request >missed packets, thus implementing a sort of reliable broadcast. There's been some muttering over beers that this might be feasible in TCP. If you think about the idea, it isn't so crazy. To make sure the workstations are in sync, you'll need some sort of windowing mechanism. To be sure everyone has data, you'll need an acknowledgment scheme. That's pretty close to the basic service TCP offers. So if you could open a TCP connection to an IP multicast address, and figure out how to handle the remote ends cleanly at the sender, you'd be pretty far along. (And 1 sending workstation gets, at worst, 100 acks from 100 receivers -- less if receivers ack every 2nd segment). I believe Van Jacobson and Jon Crowcroft looked at this problem in more detail and may well have something more to add. Craig
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (10/24/90)
In article <GNB.90Oct23133132@leo.bby.oz.au>, gnb@bby.oz.au (Gregory N. Bond) writes: > .... > Consider a net with a server and many (say, 100) workstations, and a > data feed that goes to each workstation. At the moment, I have to > open 100 TCP streams, and so each packet of data generates 200 TCP > packets, all more-or-less identical. ... I've heard that XTP does this sort of thing now. PEI in the SGI booth at Interop 90 was showing just such an application, sending video from one camera to two screens. It would be nice if the XTP "bucket algorithm" were glommed into TCP, and an RFC written for a TCP option/extension/whatever. I've heard that it would be technically possible. Vernon Schryver, vjs@sgi.com
rpw3@rigden.wpd.sgi.com (Rob Warnock) (10/24/90)
In article <60282@bbn.BBN.COM> craig@ws6.nnsc.nsf.net.BBN.COM (Craig Partridge) writes: +--------------- | So if you could open a TCP connection to an IP multicast address, and | figure out how to handle the remote ends cleanly at the sender, you'd be | pretty far along. (And 1 sending workstation gets, at worst, 100 | acks from 100 receivers -- less if receivers ack every 2nd segment). +--------------- XTP multicast receivers send the ACKs to the multicast address too, which allows for something the XTP spec calls "damping" (which I prefer to call "stifling"). If receiver "A" hears another receiver "B" emit an ACK with "worse" news than what "A" was going to ACK, then "A" "damps" its ACK (stifles itself, stays quiet). A helpful related feature is "slotting", wherein a receiver delays a random amount of time before sending an ACK, in the hopes that someone else will send "worse" news and it can stifle itself. Damping/slotting are optional features in XTP; with a small number of receivers, they have some negative throughput implication due to the increased delay on the ACKs and the (slightly) increased overhead of running the damping algorithm. But with very large numbers of receivers they can save a good bit of network bandwidth. What would otherwise be a large burst of ACKs (one from every station) is replaced by a much smaller flurry of ACKs, each one bearing "worse news" than its predecessor. The sender retransmits based on the "worst news" (lowest ACK number) heard during each epoch of ACK-gathering (called "buckets", in the XTP spec.) And all this works in the absence of a reliable group-membership protocol. However, in order to avoid "leaving a receiver behind", you have to extend your retransmission buffers and ensure that you get "enough" epochs of ACK-gathering (enough "buckets") so that the probability of losing "too many" consecutive ACKs from the receiver with the worst news is "as low as you like". But increasing the number of epochs per unit time increases the number of ACKs, and thus the network overhead. (*sigh*) [The tradeoffs between retransmission buffer size, ACKs per RTT, and probability of losing a receiver are discussed in detail in "Appendix B" of the "XTP Protocol Definition, Revision 3.5".] Anyway, as Vernon Schryver mentioned, it certainly *ought* to be be possible to graft the XTP/multicast "bucket algorithm" onto TCP + IP/multicast... -Rob ----- Rob Warnock, MS-9U/510 rpw3@sgi.com rpw3@pei.com Silicon Graphics, Inc. (415)335-1673 Protocol Engines, Inc. 2011 N. Shoreline Blvd. Mountain View, CA 94039-7311
fddi@gec-rl-hrc.co.uk (FDDI Group (Ian Wakeman - A26)) (10/24/90)
In article <GNB.90Oct23133132@leo.bby.oz.au> gnb@bby.oz.au (Gregory N. Bond) writes: >I would be happy for some sort of overhead for the reliability (e.g. >server broadcasts empty packets with the highest serial number once >every 10 seconds), but before I re-invent wheels, has someone done >something like this? >Greg. although it may be heresy to say it in this group, XTP has some reliable multicast features inside of it, although I doubt whether they've been tested on a WAN, and claiming reliable multicast without group management facilities is a trifle absurd - how do you know that all possible respondees have replied? (Yes, I know that group management is then delegated to the session management :-) ian
rsalz@bbn.com (Rich Salz) (10/25/90)
In <GNB.90Oct23133132@leo.bby.oz.au> gnb@bby.oz.au (Gregory N. Bond) writes: |Consider a net with a server and many (say, 100) workstations, and a |data feed that goes to each workstation. At the moment, I have to |open 100 TCP streams, and so each packet of data generates 200 TCP |packets, all more-or-less identical. What would be nice would be to |broadcast the packet to the local net, and have the clients request |missed packets, thus implementing a sort of reliable broadcast. I believe this is what the ISIS distributed programming environment provides. Write to ken@gvax.cs.cornell.edu for more info, or read the Usenet newsgroup comp.sys.isis. /rich $alz -- Please send comp.sources.unix-related mail to rsalz@uunet.uu.net. Use a domain-based address or give alternate paths, or you may lose out.
santi@osf.org (Michael Santifaller) (10/25/90)
In article <GNB.90Oct23133132@leo.bby.oz.au>, gnb@bby.oz.au (Gregory N. Bond) writes: > Well, on a similar note.... > > I understand James' and Jon's arguments. Reliable datagrams are best > implemented with TCP and a "write(len); write(data);" layer. I am > looking for something a little different. > > Consider a net with a server and many (say, 100) workstations, and a > data feed that goes to each workstation. At the moment, I have to > open 100 TCP streams, and so each packet of data generates 200 TCP > packets, all more-or-less identical. What would be nice would be to > broadcast the packet to the local net, and have the clients request > missed packets, thus implementing a sort of reliable broadcast. > I would use broadcast RPC do this. SunRPC for example allows broadcasts to several servers simultaneously, you can get a reply from each and compare this with your list of recepients. I have no idea what the overhead for such a algorithm is, since the broadcasts are done through the portmapper on each system. Give it a try and make some measurements to find out its feasibility. RPC programming is easy to do. Michael Santifaller ------------------------------------------------------------------------ --------Michael Santifaller, PentaCom GmbH (Yes, OSF uses NCS, but then -- I'm not an OSF employee) --------------------------------------------------------------------------------
rpw3@rigden.wpd.sgi.com (Rob Warnock) (10/25/90)
In article <1990Oct24.094329.5037@gec-rl-hrc.co.uk> fddi@hrc63.UUCP (FDDI Group (Ian Wakeman - A26)) writes: +--------------- | although it may be heresy to say it in this group, XTP has some reliable | multicast features inside of it, although I doubt whether they've been | tested on a WAN,... +--------------- True. All of the XTP/multicast applications I know of are on LANs. But XTP incorporates most of the current thinking about "slow-open", RTT estimation, and congestion control [shamelessly borrowed from TCP!], so XTP/multicast ought to work on a WAN (except there aren't any XTP routers... yet). +--------------- | ...and claiming reliable multicast without group management | facilities is a trifle absurd - how do you know that all possible respondees | have replied? +--------------- In XTP's "mostly reliable" mode, you set a service parameter (called "E" in Appendix "B" of the XTP 3.5 spec) which is how many *consecutive* negative-ACKs from any *single* station you want to be able to tolerate losing. The XTP "bucket algorithm" then ensures that at least that many attempts have been made to hear from everyone before releasing data from retransmisison buffers. Larger values of "E" require larger retransmission buffers (or you can keep the size of the retranmission buffers down by cranking up another parameter "N", at the cost of more control packet and ACK traffic -- nothing's free). If the probability of dropping an ACK from a given station is "p" (presumably much less than 1), then the probability of that station falsely being "left behind" is not worse than p^E. As long as you have a finite error rate and enough memory for retransmission buffers (or enough spare network bandwidth), you can make p^E "as small as you like". For example, you might choose to set E such that p^E was less than the probability that the station will spontaneously crash before the connection completes. In that case, "mostly" reliable is as reliable as it gets. ;-} +--------------- | (Yes, I know that group management is then delegated to the session | management :-) | ian +--------------- Not really. There is *no* group management in the "mostly reliable" mode. Stations can join and drop out of a connection, while getting reliability "as good as they like" during the time that they're joined. Maybe the following [admittedly loose anthropomorphic] analogy will help: When you first arrive at a cocktail party, you aren't a member of, say, "that conversation over there", and no-one pays any attention to whether you are hearing or missing what's being said. But if you like, you can walk over and stand within hearing range. Still, you have done nothing overt to "join" the conversation. But now it is possible for you to send negative-ACKs ["excuse me? what did you say?"] to cause retransmission. Provided the speaker is willing to back up often and far enough for you [his "retransmission buffers" are large enough] and your ACK traffic does not exceed what is considered good taste, you can get reliability "as good as you like". Yet all you have to do to leave the conversation is walk away and cease sending NACKs. Again, no overt group membership protocol was utilized. In fact, the only effect on the conversation may be, especially if you were slow or hard of hearing, that the average data rate goes up somewhat after you leave as the RTT or congestion estimators adapt to the new set of listeners [i.e., minus you]. (Of course, if you had a good receiver [ears], a high input processing rate, and a bit of patience, you may never have had to send a NACK -- someone else may have always beat you to it. I mentioned this in a previous message about XTP's "damping" and "slotting" of ACKs.) Anyway, the "cocktail party" analogy is intended to indicate why the "mostly reliable" mode of multicast might have some domains of applicabililty. In fact, this is the mode in which most of the known XTP multicast users are operating. -Rob p.s. There is an XTP TAB sub-group activity on group management stuff to support "fully reliable" XTP multicast, but it looks to me like a longer-term standards activity... ----- Rob Warnock, MS-9U/510 rpw3@sgi.com rpw3@pei.com Silicon Graphics, Inc. (415)335-1673 Protocol Engines, Inc. 2011 N. Shoreline Blvd. Mountain View, CA 94039-7311
wunder@HPSDEL.SDE.HP.COM (Walter Underwood) (10/25/90)
I'm surprised that no one has mentioned Cornell's ISIS system. ISIS is a reliable multicast system with ordering guarantees. The hard part about multicast designs is not the protocol, but proving that the resulting system behaves correctly. ISIS provides a choice of orderings; two of the choices are causal ordering within the process group and causal ordering across all groups. This ordering makes designs much, much simpler. If you are already guaranteed that processes will see the same messages in the same order, then any deterministic calulations will get the same result. Notification of failures in the process group (or processes joining or leaving) are ordered like other messages. For more info, see the newsgroup comp.sys.isis, or send mail to Ken Birman (ken@cs.cornell.edu). The ISIS code is available, and runs on LOTS of systems. wunder
J.Crowcroft@CS.UCL.AC.UK (Jon Crowcroft) (10/26/90)
>So if you could open a TCP connection to an IP multicast address, and >figure out how to handle the remote ends cleanly at the sender, you'd be >pretty far along. > (And 1 sending workstation gets, at worst, 100 >acks from 100 receivers -- less if receivers ack every 2nd segment). >I believe Van Jacobson and Jon Crowcroft looked at this problem in >more detail and may well have something more to add. Craig, we kind of figured out the small change necessary to TCP essentially, you send to multicast address, but receive acks from each member of the multicast groups individual clas a-c addresses you have a tcpcb per member, and run each connection state machine as normal, but link the tcpcb's so you know its a group communcation to start the whole shebang, you send a syn to group, you get syn acks back and gradually build up the set of tcpcb's (instead of just alocating one at start)...when you have the full group connection, you then allow the sender to do writes on the socket... each write may block if we are still flow controlled or not acked on any one connection... for a many to one (i.e. lotsa folks sending to us) you can overload the readv interface, and return a vector of single reads... it shouldnt take a bright person with BSD source more than a day to change, and a week to debug... the same thing could be done with broadcast, without the multicast IP, but is certainly a VERY BAD IDEA:-) jon
huitema@jerry.inria.fr (Christian Huitema) (10/26/90)
The subject of reliable broadcast protocols had been adressed in the early 80's in two different contexts, i.e. satellite networks and distributed systems. An interesting approach to the use of broadcast addresses for distributed systems (in fact, broadcast LANs) is that of Dr. Maxemchuck, which proposed a token passing "conference" protocol. Basically, the station which receives the token synchronize first with the previous stations, requesting a copy of all "missed" messages; options are to deliver this copy "point to point" or globally. The procedure uses a single packet counter, and maintains a global ordering of the messages -- which is indeed very useful for managing consistently a given systems. It sort of minimizes the ack flow, as acks are only transmitted during the token exchange; there is however a large overhead in semi silent systems, due to the rotations of the token. As far as satellite network are concerned, a quite exhaustive work was conducted at INRIA in the NADIR project between 1981 and 1985, for devolopping a performant and reliable "bulk broadcasting" protocol. In this protocol, the need for "1 ACK per station per packet" was alleviated by assuming a constant (unidirectional) message flow; the ACK are explicitly requested at well spaced "check points", e.g. at end of file. An option is to use unsollicited "NACKs" in order to request rapid resynchronisation of a particular recipient; another option to pass a list of "missing packets ids" upon a check point. These protocol variants have been described in several papers by the members of the project, e.g. J-L. Grange', I. Valet, J. Radureau or myself. The project is now terminated, and I am not aware of any continuation work. The use of either of these techniques in an internet (by constrast with a simple LAN or a controlled satellite channel) would pose at least two severe problems: * a group composition problems: in order to control the reception by "a group", one must be able to individually identify all the members of a group. What if this membership changes in the course of time? * a potentially severe flow control problem: when the routes to the members of the group have widely different capacities, how does one organize the slow down of the transmission to match the most constrained channel? Anyhow, this could be the basis of an interesting research work... Christian Huitema
ken@gvax.cs.cornell.edu (Ken Birman) (10/30/90)
In article <9010251640.AA01218@hpsdel.sde.hp.com> wunder@HPSDEL.SDE.HP.COM (Walter Underwood) writes: >I'm surprised that no one has mentioned Cornell's ISIS system.... ... I don't normally follow this group, but someone pointed out the two postings that mention ISIS, so I am including some information on our system, below. The version of ISIS mentioned here (V2.1) has been available since early September and is apparently quite solid. You can copy it anonymously from several places and it does solve the reliable broadcast (and datagram!) problems. As noted in this posting, there is also a newsgroup for ISIS-related discussion. It tends to run in broadcast mode, but questions can certainly be posted there and one of my group members will respond, or I will. We have a forthcoming commercial release of ISIS, from a company called ISIS Distributed Systems Incorporated. Email to me for details and I can see to it that you get on the IDS mailing list. The company version of ISIS is somewhat extended over the public one, and faster in some modes of use, but the public copy should be fine for figuring out what ISIS is all about. In fact, many companies are using ISIS as part of distributed systems products now. The product will be priced "aggressively", and has some nice distributed application management utilities layered over it. As noted below, all the ISIS papers and the manual are available upon request, and many can be copied anonymously if you have a postscript printer. Ken Birman (reply to me, as I can't afford to follow any more newsgroups!) --- ISIS V2.1 blurb --- This is to announce the availability of a public distribution of the ISIS System, a toolkit for distributed and fault-tolerant programming. The initial version of ISIS runs on UNIX on SUN, DEC, GOULD, AUX and HP systems; ports to other UNIX-like systems are planned for the future. No kernel changes are needed to support ISIS; you just roll it in and should be able to use it immediately. The current implementation of ISIS performs well in networks of up to about 100-200 sites. Most users, however, run on a smaller number of sites (16-32 is typical) and other sites connect as "remote clients" that don't actually run ISIS directly. In this mode many hundreds of ISIS users can be clustered around a smaller set of ISIS "mother sites"; many users with large networks favor such an architecture. --- Who might find ISIS useful? --- You will find ISIS useful if you are interested in developing relatively sophisticated distributed programs under UNIX (eventu- ally, other systems too). These include programs that distribute computations over multiple processes, need fault-tolerance, coor- dinate activities underway at several places in a network, recover automatically from software and hardware crashes, and/or dynamically reconfigure while maintaining some sort of distri- buted correctness constraint at all times. ISIS is also useful in building certain types of distributed real time systems. Here are examples of problems to which ISIS has been applied: o On the factory floor, we are working with an industrial research group that is using ISIS to program decentralized cell controllers. They need to arrive at a modular, expand- able, fault-tolerant distributed system. ISIS makes it pos- sible for them to build such a system without a huge invest- ment of effort. (The ISIS group also working closely with an automation standards consortium called ANSA, headed by Andrew Herbert in Cambridge). o As part of a network file system, we built an interface to the UNIX NFS (we call ours "DECEIT") that supports tran- sparent file replication and fault-tolerance. DECEIT speaks NFS protocols but employs ISIS internally to maintain a consistent distributed state. For most operations, DECEIT performance is at worst 50-75% of that of a normal NFS -- despite supporting file replication and fault-tolerance. Interestingly, for many common operations, DECEIT substantially outperforms NFS (!) and it is actually fairly hard to come up with workloads that demonstate replication-related degradation. o A parallel "make" program. Here, ISIS was used within a control program that splits up large software recompilation tasks and runs them on idle workstations, tolerating failures and dynamically adapting if a workstation is reclaimed by its owner. o A system for monitoring and reacting to sensors scattered around the network, in software or in hardware. This system, Meta, is actually included as part of our ISIS V2.1 release. We are adding a high level language to it now, Lomita, in which you can specify reactive control rules or embed such rules into your C or Fortran code, or whatever. o In a hospital, we have looked at using ISIS to manage repli- cated data and to coordinate activities that may span multi- ple machines. The problem here is the need for absolute correctness: if a doctor is to trust a network to carry out orders that might impact on patient health, there is no room for errors due to race conditions or failures. At the same time, cost considerations argue for distributed systems that can be expanded slowly in a fully decentralized manner. ISIS addresses both of these issues: it makes it far easier to build a reliable, correct, distributed system that will manage replicated data and provide complex distributed behaviors. And, ISIS is designed to scale well. o For programming numerical algorithms. One group at Cornell used ISIS to distribute matrix computations over large numbers of workstations. They did this because the worksta- tions were available, mostly idle, and added up to a tremen- dous computational engine. Another group, at LANL, uses ISIS in a parallel plasma physics application. o In a graphics rendering application. Over an extended period, a Cornell graphics group (not even in our department) has used ISIS to build distributed rendering software for image generation. They basically use a set of machines as a parallel processor, with a server that farms out rendering tasks and a variable set of slave computing units that join up when their host machine is fairly idle and drop out if the owner comes back to use the machine again. This is a nice load sharing paradigm and makes for sexy demos too. o In a wide-area seismic monitoring system (i.e. a system that has both local-area networks and wide-area connections between them), developed by a company called SAIC on a DARPA contract. The system gathers seismic data remotely, preprocesses it, and ships event descriptions to a free-standing analysis "hub", which must run completely automatically (their people in San Diego don't like to be phoned in the middle of the night to debug problems in Norway). The hub may request data transfers and other complex computations, raising a number of wide-area programming problems. In addition, the hub system itself has a lot of programs in various languages and just keeping it running can be a challenge. o On brokerage and banking trading floors. Here, ISIS tends to be an adjunct to a technology for distributing quotes, because the special solutions for solving that specific problem are so fast that it is hard for us to compete with them (we normally don't have the freedom of specifying the hardware... many "ticker plant vendors" wire the whole floor for you). However, to the extent that these systems have problems requiring fault-tolerance, simple database integration mechanisms, dynamic restart of services, and in general need "reactive monitoring and control" mechanisms, ISIS works well. And, with our newer versions of the ISIS protocols, performance is actually good enough to handle distribution of stock quotes or other information directly in ISIS, although one has to be a bit careful in super performance intensive settings. (The commercial ISIS release should compete well with the sorts of commercial alternatives listed above on a performance basis, but more than 10 trading groups are using ISIS V2.1 despite the fact that it is definitely slower!). The problems above are characterized by several features. First, they would all be very difficult to solve using remote procedure calls or transactions against some shared database. They have complex, distributed correctness constraints on them: what hap- pens at site "a" often requires a coordinated action at site "b" to be correct. And, they do a lot of work in the application program itself, so that the ISIS communication mechanism is not the bottleneck. If you have an application like this, or are interested in taking on this kind of application, ISIS may be a big win for you. Instead of investing resources in building an environment within which to solve your application, using ISIS means that you can tackle the application immediately, and get something working much faster than if you start with RPC (remote procedure calls). On the other hand, don't think of ISIS as competing with RPC or database transactions. We are oriented towards online control and coordination problems, fault-tolerance of main-memory databases, etc. ISIS normally co-exists with other mechanisms, such as conventional streams and RPC, databases, or whatever. The system is highly portable and not very intrusive, and many of our users employ it to control some form of old code running a computation they don't want to touch at any price. --- What ISIS does --- The ISIS system has been under development for several years at Cornell University. After an initial focus on transactional "resilient objects", the emphasis shifted in 1986 to a toolkit style of programming. This approach stresses distributed con- sistency in applications that manage replicated data or that require distributed actions to be taken in response to events occurring in the system. An "event" could be a user request on a distributed service, a change to the system configuration result- ing from a process or site failure or recovery, a timeout, etc. The ISIS toolkit uses a subroutine call style interface similar to the interface to any conventional operating system. The pri- mary difference, however, is that ISIS functions as a meta- operating system. ISIS system calls result in actions that may span multiple processes and machines in the network. Moreover, ISIS provides a novel "virtual consistency" property to its users. This property makes it easy to build software in which currently executing processes behave in a coordinated way, main- tain replicated data, or otherwise satisfy a system-wide correct- ness property. Moreover, virtual synchrony makes even complex operations look atomic, which generally implies that toolkit functions will not interfere with one another. One can take advantage of this to develop distributed ISIS software in a sim- ple step-by-step style, starting with a non-distributed program, then adding replicated data or backup processes for fault- tolerance or higher availability, then extending the distributed solution to support dynamic reconfiguration, etc. ISIS provides a really unique style of distributed programming -- at least if your distributed computing problems run up against the issues we address. For such applications, the ISIS programming style is both easy and intuitive. ISIS is really intended for, and is good at, problems that draw heavily on replication of data and coordination of actions by a set of processes that know about one another's existence. For example, in a factory, one might need to coordinate the actions of a set of machine-controlled drills at a manufacturing cell. Each drill would do its part of the overall work to be done, using a coordinated scheduling policy that avoids collisions between the drill heads, and with fault-tolerance mechanisms to deal with bits breaking. ISIS is ideally suited to solving prob- lems like this one. Similar problems arise in any distributed setting, be it local-area network software for the office or a CAD problem, or the automation of a critical care system in a hospital. ISIS is not intended for transactional database applications. If this is what you need, you should obtain one of the many such systems that are now available. On the other hand, ISIS would be useful if your goal is to build a front-end in a setting that needs databases. The point is that most database systems are designed to avoid interference between simultaneously executing processes. If your application also needs cooperation between processes doing things concurrently at several places, you may find this aspect hard to solve using just a database because databases force the interactions to be done indirectly through the shared data. ISIS is good for solving this kind of problem, because it provides a direct way to replicate control informa- tion, coordinate the actions of the front-end processes, and to detect and react to failures. ISIS itself runs as a user-domain program on UNIX systems sup- porting the TCP/IP protocol suite. It currently is operational on SUN, DEC, GOULD and HP versions of UNIX. Language interfaces for C, C++, FORTRAN, and Common LISP (both Lucid and Allegro) are included, and a new C-Prolog interface is being tested now. Recent ports available in V2.1 include AUX for the Apple Mac. II, AIX on the IBM RS/6000 and also the older PC/RT. A Cray UNICOS port is (still) under development at LANL, and a DEC VMS port is being done by ISIS Distributed Systems, Inc. ISIS runs over Mach on anything that supports Mach but will probably look a little unnatural to you if you use the Mach primitives. We are planning a version of ISIS that would be more transparent in a Mach context, but it will be some time before this becomes available. Meanwhile, you can use ISIS but may find some aspects of the interface inconsistent with the way that Mach does things. The actual set of tools includes the following: o High performance mechanisms supporting lightweight tasks in UNIX, a simple message-passing facility, and a very simple and uniform addressing mechanism. Users do not work directly with things like ports, sockets, binding, connect- ing, etc. ISIS handles all of this. o A process "grouping" facility, which permits processes to dynamically form and leave symbolically-named associations. The system serializes changes to the membership of each group: all members see the same sequence of changes. Groups names can be used as a location-transparent address. o A suite of broadcast protocols integrated with a group addressing mechanism. This suite operates in a way that makes it look as if all broadcasts are received "simultane- ously" by all the members of a group, and are received in the same "view" of group membership. o Ways of obtaining distributed executions. When a request arrives in a group, or a distributed event takes place, ISIS supports any of a variety of execution styles, ranging from a redundant computation to a coordinator-cohort computation in which one process takes the requested actions while oth- ers back it up, taking over if the coordinator fails. o Replicated data with 1-copy consistency guarantees. o Synchronization facilities, based on token passing or read/write locks. o Facilities for watching a for a process or site (computer) to fail or recover, triggering execution of subroutines pro- vided by the user when the watched-for event occurs. If several members of a group watch for the same event, all will see it at the same "time" with respect to arriving mes- sages to the group and other events, such as group member- ship changes. o A facility for joining a group and atomically obtaining copies of any variables or data structures that comprise its "state" at the instant before the join takes place. The programmer who designs a group can specify state information in addition to the state automatically maintained by ISIS. o Automatic restart of applications when a computer recovers from a crash, including log-based recovery (if desired) for cases when all representatives of a service fail simultane- ously. o Ways to build transactions or to deal with transactional files and database systems external to ISIS. ISIS itself doesn't know about files or transactions. However, as noted above, this tool is pretty unsophisticated as transactional tools go... o Spooler/long-haul mechanism, for saving data to be sent to a group next time it recovers, or for sending from one ISIS LAN to another, physically remote one (e.g. from your Norway site to your San Diego installation). Note: ISIS will not normally run over communication links subject to frequent failures, al- though this long-haul interface has no such restrictions. Everything in ISIS is fault-tolerant. Our programming manual has been written in a tutorial style, and gives details on each of these mechanisms. It includes examples of typical small ISIS applications and how they can be solved. The distribution of the system includes demos, such as the parallel make facility men- tioned above; this large ISIS application program illustrates many system features. To summarize, ISIS provides a broad range of tools, including some that require algorithms that would be very hard to support in other systems or to implement by hand. Performance is quite good: most tools require between 1/20 and 1/5 second to execute on a SUN 3/60, although the actual numbers depend on how big processes groups get, the speed of the network, the locations of processes involved, etc. Overall, however, the system is really quite fast when compared with, say, file access over the network. For certain common operations a five to ten-fold performance improvement is expected within two years, as we implement a col- lection of optimizations. The system scales well with the size of the network, and system overhead is largely independent of network size. On a machine that is not participating in any ISIS application, the overhead of having ISIS running is negligible. In certain communication scenarios, ISIS performance can be quite good. These involve streaming data within a single group or certain client-server interaction patterns, and make use of a new BYPASS communication protocol suite. Future ISIS development is likely to stress extensions and optimizations at this level of the system. In addition, a lot of effort is going into scaling the system to larger environments. --- You can get a copy of ISIS now --- Version V2.1 of ISIS is now fully operational and is being made available to the public. This version consists of a C implementations for UNIX, and has been ported to AIX, SUN, UNIX, MACH, ULTRIX, Gould UNIX, HP-UX, AUX and APOLLO UNIX (release 10.1). Performance is uniformly good. A 400 page tutorial and sys- tem manual containing numerous programming examples is also available. Online manual pages are also provided. The remainder of this posting focuses on how to get ISIS, and how to get the manual. Everything is free except bound copies of the manual. Source is included, but the system is in the public domain, and is released on condition that any ports to other sys- tems or minor modifications remain in the public domain. The manual is copyrighted by the project and is available in hard- copy form or as a DVI file, with figures available for free on request. We have placed a compressed TAR images in the following places: * cu-arpa.cs.cornell.edu (anonymous login, binary mode pub/ISISV21.TAR.Z) * Doc: cu-arpa.cs.cornell.edu (pub/ISISV21-DOC.TAR.Z) * uunet.uu.net (anonymous login, binary mode networks/ISIS/ISISV21.TAR.Z) * mcsun.eu.net (anonymous login, binary mode networks/ISIS/ISISV21.TAR.Z) Also available are DVI and PS versions of our manual. Bound copies will be available at $25 each. A package of figures to glue into the DVI version will be provided free of charge. A tape containing ISIS will be provided upon payment of a charge to cover our costs in making the tape. Our resources are limited and we do not wish to do much of this. --- Copyright, restrictions --- V2.1 of ISIS is subject to a restrictive copyright; basically, you can use it without changing it in any way you like, but are not permitted to develop "derivative versions" without discussing this with us. V2.1 differs substantially from V1.3.1, which was released in the public domain and remains available without any restrictions whatsoever. On the other hand, whereas previous versions of ISIS required export licenses to be sent to certain eastern-block countries, the present version seems not to be subject to this restriction. Contact the US Dept. of Commerce for details if you plan to export ISIS to a country that might be subject to restrictions. Any place in Europe, Japan, etc. should be fine and no license is required. --- Commercial support --- We are working with a local company, ISIS Distributed Systems Inc., to provide support services for ISIS. This company will prepare distributions and work to fix bugs. Support contracts are available for an annual fee; without a contract, we will do our best to be helpful but make no promises. Other services that IDS plans to provide will include consulting on fault-tolerant distributed systems design, instruction on how to work with ISIS, bug identification and fixes, and contractual joint software development projects. The company is also prepared to port ISIS to other systems or other programming languages. Contact "birman@gvax.cs.cornell.edu" for more information. --- If you want ISIS, but have questions, let us know --- Send mail to isis@cs.cornell.edu, subject "I want ISIS", with electronic and physical mailing details. We will send you a form for acknowledging agreement with the conditions for release of the software and will later contact you with details on how to actually copy the system off our machine to yours. --- You can read more about ISIS if you like --- The following papers and documents are available from Cornell. We don't distribute papers by e-mail. Requests for papers should be transmitted to "isis@cs.cornell.edu". 1. Exploiting replication. K. Birman and T. Joseph. This is a preprint of a chapter that will appear in: Arctic 88, An advanced course on operating systems, Tromso, Norway (July 1988). 50pp. 2. Reliable broadcast protocols. T. Joseph and K. Birman. This is a preprint of a chapter that will appear in: Arctic 88, An advanced course on operating systems, Tromso, Norway (July 1988). 30pp. 3. ISIS: A distributed programming environment. User's guide and reference manual. K. Birman, T. Joseph, F. Schmuck. Cornell University, March 1988. 275pp. 4. Exploiting virtual synchrony in distributed systems. K. Birman and T. Joseph. Proc. 11th ACM Symposium on Operating Systems Principles (SOSP), Nov. 1987. 12pp. 5. Reliable communication in an unreliable environment. K. Birman and T. Joseph. ACM Transactions on Computer Systems, Feb. 1987. 29pp. 6. Low cost management of replicated data in fault-tolerant distributed systems. T. Joseph and K. Birman. ACM Transac- tions on Computer Systems, Feb. 1986. 15pp. 7. Fast causal multicast. K. Birman, A. Schiper, P. Stephenson. Dept. of Computer Science TR, May 1990. 8. Distributed application management. K. Marzullo, M. Wood, R. Cooper, K. Birman. Dept. of Computer Science TR, June 1990. We will be happy to provide reprints of these papers. Unless we get an overwhelming number of requests, we plan no fees except for the manual. We also maintain a mailing list for individuals who would like to receive publications generated by the project on an ongoing basis. The last two papers can be copied using FTP from cu-arpa.cs.cornell.edu. If you want to learn about the virtual synchrony as an approach to distributed computing, the best place to start is with refer- ence [1]. If you want to learn more about the ISIS system, how- ever, start with the manual. It has been written in a tutorial style and should be easily accessible to anyone familiar with the C programming language. References [7] and [8] are typical of our recent publications (there are others -- contact Maureen Robinson for details).
salzman@NMS.HLS.COM (Mike Salzman) (10/30/90)
I suspect that the notion of reliable and group membership dealt not with the ephemeral group that exists while a message is being multicast, but rather with the commercial application that requires the confirmation that all members of a designated group obtained the set of messages to which they were entitled. Examples include updates of Price Lists, of Catalogs, of reservations, of inventories, etc. The presumption in XTP is that all those who are alive, awake and listening will received the multicast "reliably". XTP does not, to my knowledge, address itself to the administration of a group which is addressed by a multicast. -- -------------------- salzman@hls.com ---------------------- Michael M. Salzman Voice (415) 966-7479 Fax (415)960-3738 Hughes Lan Systems 1225 Charleston Road Mt View Ca 94043