ken@gvax.cs.cornell.edu (Ken Birman) (03/22/90)
As part of the isis_remote() mechanism I am implementing a facility by which ISIS will periodically poke a client program to see if it is alive. Here's the proposed interface; I am interested in comments or feedback: 1) Initializing the mechanism. application calls isis_probe(freq, timeout) where freq = frequency to probe, in seconds timeout = speed of expected reply, in sec 2) Defaults: disabled for "local" clients of ISIS freq = 60, timeout = 30 for "remote" clients 3) Implementation: client starts sending a HELLO message every freq. seconds ISIS has a timer; if it doesn't get a HELLO within freq+timeout secs it "kills off" the client killed client that was really alive calls isis_failed, then panics with message "killed by ISIS" unless isis_failed traps the failure (i.e. by reconnecting). In addition, a remote client monitors its mother machine and calls isis_failed if the mother machine dies; the isis_probe() constants are used for this purpose, too. Robert Cooper has an idea for a very fancy mechanism that would let a client (any client) drop "offline" for a while and then come back. For example, say that you are doing a stock trading system and the broker presses "analyze". You might want to stop monitoring stocks, crunch hard for a few seconds and do some fancy display, then show all the updates that arrived while you were crunching as a batch and go back to monitoring passively. We are thinking about how this could be added to ISIS. It wouldn't be trivial, but could probably be done. Also being considered are ways to connect remote non-UNIX clients (PC's, etc). I doubt that either of these features will ever be in the public version of ISIS, but my guess is that IDS will have products extending ISIS in these ways late this year...
jhanley@oracle.com (John Hanley) (03/23/90)
In article <38955@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes: >Robert Cooper has an idea for a very fancy mechanism that would let >a client (any client) drop "offline" for a while and then come back. >... >We are thinking about how this could be added to ISIS. It wouldn't be >trivial, but could probably be done. Also being considered are ways to >connect remote non-UNIX clients (PC's, etc).... I hope that the "are you alive?" probe can be done by either the server or client, but that the result "machine1 poked machine2" is available to both server and client, with a timestamp. If it would mean less work for the Isis server I suppose the client should take primary resposibility for doing the pinging, though the overhead on the server is unclear to me. For sanity's sake I _think_ it would make sense for one or the other of the machines to "usually" do the vast majority of the pings, with the other machine making a query and a log entry if no word has been received from its partner in a while. In the event that a fileserver in the machine room loses touch with 50 PC's because a thin-ether terminator came loose, it might become aware of this failure sooner, with less scheduling overhead, if it was expecting pings from its clients, randomly distributed over a 60-second interval or whatever. It would be interesting to include a "network location" attribute in the sites file and to test the hypothesis "ether cable such-and-such failed" when a machine on that cable is observed to timeout (by pinging its neighbors). This would be a biggie for detecting network partitions, say from an important router being disconnected. Loss of power to an entire building is also an interesting group-failure case. One implication of this "usually pinging from one side" strategy is that although the interval and timeout parameters might very well be different for the two sides, at the time they are negotiated some sanity check and readjustment would be done to ensure that 99% of all pings will come from the same side. Deciding whether you prefer having the server or client do it is your option, though I suspect that having a single-tasked MeSsy-DOS client speak up rather than listen attentively is an easier proposition. As long as you're going to the trouble of pinging, it might be worthwhile to communicate some load metric to the client, such as load avg, or # of active users, or # of pages on the free list. If the metric was scaled to a uniform range of 1 to 100, applications that distribute calculations across a lot of workstations could choose to not bog down busy hosts based on this, in a portable way. The advertised metric might be artificially increased by a workstation user, or by an agent on his behalf, to control how willing or unwilling the user is at the moment to donate resources to the Isis community. Isis clients might test-ping alternate servers and change the host they are using as an Isis server based on excessive load (good-neighbor load levelling policy) or high observed latency on responses (selfish gimme-performance policy). The interval and timeout parameters might normally be very small, to rapidly detect failure of the partner or the network, with adaptive increases if the server is becoming very busy. The log noting loss of contact with the server might also note the load average ramping up on responses just before the timeout. A lack of correlation between the server's load metric and observed latencies may indicate that the server is quite healthy but that the network is melting down. In any event, the metric returned by server would have to be very cheap to compute, and could probably be cached by the server's Isis so that the host operating system wouldn't be interrogated for the metric more often than every 5 seconds or so. If you choose to have the server generate most of the pings, substantial efficiencies could be obtained by taking advantage of broadcast media. Another argument for having the server generate pings is that they could be generated by simply sending messages to a standard Isis group which all clients are a member of. Also, I suspect that the "drop offline for a little while" functionality isn't too hard to do by A) demanding an estimate of how long you plan to crunch offline for B) renegotiating the interval and timeout parameters to much larger values, and then shrinking them back to normal when rejoining the Isis community In trading off the lag in detecting a failure against the resources needed for pinging, widely varying client needs would argue for choosing to make (at least some) clients generate lots of server pings. Finally, I would like to observe that the single most important server resource in our environment is physical memory. The greatest objection that people have to running a protos process on their workstation all the time is that it always has a resident set of several hundred K, regardless of whether is doing any real work or not. Optimizing the set of pages touched whilst assuring others that "I'm up" would be a great boon. And thanks for the good work, Ken. --JH
ken@gvax.cs.cornell.edu (Ken Birman) (03/23/90)
In article <1990Mar23.093050.23923@oracle.com> jhanley@oracle.com (John Hanley) writes: >I hope that the "are you alive?" probe can be done by either the server >or client.... The implementation is pretty simple. Actually, nobody does any probing at all. The client starts sending "I'm alive" messages to the server once every <frequency> seconds. The server expects these and if one is late by <timeout> seconds it kills off the client. I'll probably tune things so that I'm alive messages are only sent if there has been no other traffic to the server, but this isn't likely to be a major issue. The client expects acks from the server, so it will notice within about 30 seconds if the server isn't acknowledging a "I'm alive" transmission. This is also how it finds out that it has been killed off. In this case (client was up but got killed off, or server died but client survived) the client code calls isis_failed(), which has the option of reconnecting to ISIS or aborting. >As long as you're going to the trouble of pinging, it might be worthwhile to >communicate some load metric to the client, such as load avg, or # of active >users, or # of pages on the free list... We are planning to solve this class of problem through Meta, which has "sensors" that include the sorts of per-process load factors you cite. Meta is a major ISIS application developed by Keith Marzullo and Mark Wood. It provides a network-wide user-extensible database of sensors (i.e. things like load, but also potentially things like the amount of space left on a disk or the length of a job-queue or even the temperature in the machine room). It also has actuators (trigger actions). Meta has several ways to query the database of sensors/actuators and supports a "when" clause (will support, at any rate) that watches in a fault-tolerant way for some event and triggers a specified action. Oh, and you can also combine sensors to build composite ones, i.e. average load or something... We are mailing out a few technical reports in the next week or so, and one covers Meta (another is on the bypass code, and another is just a progress/status report for the project). >Also, I suspect that the "drop offline for a little while" functionality >isn't too hard to do by > A) demanding an estimate of how long you plan to crunch offline for > B) renegotiating the interval and timeout parameters to much larger values, > and then shrinking them back to normal when rejoining the Isis community Well, this would force ISIS to buffer potentially large amounts of data. With lots of such data ISIS would congest and kill the client to shed load... What Robert had in mind was more along the lines of a way to have the remote client tell its services (gracefully) that it wants to drop offline, a period during which it would be offline and the services would buffer or archive data for it, and a graceful re-join mechanism that would bring them up to date again. Obviously, you could implement this now, but the question is whether we couldn't come up with a "generic" tool for this purpose. >Finally, I would like to observe that the single most important server >resource in our environment is physical memory.... Optimizing the set of >pages touched whilst assuring others that "I'm up" would be a great boon. I'm confused. Doesn't the isis_remote() stuff address this? With the remote client code, protos won't be on the workstations at all, only on the servers that run things like NFS or the main database system. The client code is down to something a bit more modest by now, 168k of ISIS related library text total. So, the typical user of a workstation dedicated to some application and running only as a "remote" client is 168k/process, which could be further reduced using shared libraries (one of those things we ought to get around to...) Ken
ken@gvax.cs.cornell.edu (Ken Birman) (03/24/90)
> From: rich@sendai.ann-arbor.mi.us (K. Richard Magill) (via email) > From: ken@gvax.cs.cornell.edu (Ken Birman) > Newsgroups: comp.sys.isis > Date: 22 Mar 90 15:50:46 GMT > Reply-To: ken@gvax.cs.cornell.edu (Ken Birman) > application calls isis_probe(freq, timeout) >Could be tricky to set these initially. Maybe something based on >typical round trip message time would be easier to deal with than hard >seconds. I've played with round-trip numbers and they really don't work for crashes. Crashes tend to be unusual events -- so are long delays -- and just because you were running fast a few seconds ago I can't safely assume that you won't be updating a display for 10 seconds sometime soon... ISIS is full of hard-coded constants at this level, as is any protocol implmentation. > 3) Implementation: > client starts sending a HELLO message every freq. seconds > ISIS has a timer; if it doesn't get a HELLO within freq+timeout secs > it "kills off" the client >I'd prefer that isis poll the client before killing. I can imagine a >situation, perhaps macOs, where the tasking and time division is such >that periodic hello's might be difficult, but where polls could be >answered immediately. Good idea. I'll add this feature. > killed client that was really alive calls isis_failed, then panics > with message "killed by ISIS" unless isis_failed traps the > failure (i.e. by reconnecting). >If the client's system has a clock, which it must in order to know >frequency and timeout, then isis on the client can recognize when it >has missed a poll. This implies that some kind of dynamic backoff >might be in order before isis outright "disowns" the client. Unclear on what you mean by this... The clib clock is by SIGALRM interrupts currently (once per second) but the sending of the I am alive message is done only when ISIS gets scheduled. So, what I am really defining as liveness is that the ISIS scheduler gets scheduled at least once every "frequency" seconds... > Robert Cooper has an idea for a very fancy mechanism that would let > a client (any client) drop "offline" for a while and then come back... >I know precisely the problem here and I'd be sorely dissappointed to >see you spend the time to solve it within isis rather than with isis. >My recommendation would be for users with this need to do one of the >following. Use a cross between isis bcast from the server to the client >and an rpc-ish from the client to the server. Or, withdraw from the group >that receives the data then rejoin. Personally, I think I'd stay in the >group, but register and unregister for the bcasts from server. That >is, message flow control, and persistence becomes the server's >problem, not isis's. I don't know; seems like it would be more in the spirit of a toolkit to provide the servers with a cleaner warning that the client wants to do this and way to spool data conveniently. But, you might be right. After all, your company would be a typical user of this sort of facility, so if you don't see it as a necessary add-on...
rich@sendai.sendai.ann-arbor.mi.us (K. Richard Magill) (03/24/90)
In article <38995@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes: > From: rich@sendai.ann-arbor.mi.us (K. Richard Magill) (via email) > From: ken@gvax.cs.cornell.edu (Ken Birman) > Newsgroups: comp.sys.isis > Date: 22 Mar 90 15:50:46 GMT > Reply-To: ken@gvax.cs.cornell.edu (Ken Birman) > Robert Cooper has an idea for a very fancy mechanism that would let > a client (any client) drop "offline" for a while and then come back... >I know precisely the problem here and I'd be sorely dissappointed to >see you spend the time to solve it within isis rather than with isis. I don't know; seems like it would be more in the spirit of a toolkit to provide the servers with a cleaner warning that the client wants to do this and way to spool data conveniently. But, you might be right. After all, your company would be a typical user of this sort of facility, so if you don't see it as a necessary add-on... Hmm.. I suspect that no matter how you implement such a facility, it would only be useful to me for a short while. In any case, I might be tempted to use it. What I meant was that this is functionality that can be provided at the user/above-isis level. I'd rather see the isis project apply their resources to the items already on their wishlist. :-). For instance, I think the scaling issue would be of use to more potential users.