ken@gvax.cs.cornell.edu (Ken Birman) (11/14/89)
A while ago I proposed to post interesting problems to the news group from time to time as a way for people to start thinking about how to use the system to build useful application software. In this spirit, I propose that we design a YP server replacement. Ideally, I hope that people will post contributions, but you can also email thoughts to me anonymously or just follow the discussion silently. YP defines a fairly specific, standardized interface using SUN RPC as its facility for talking to clients. We won't want to change that. The idea is to use ISIS internally to the YP server to (hopefully) come up with a version that propagates updates a bit faster than the standard version and is a bit more robust when crashes occur. Lets also plan for the future. The YP server we design should be one suitable for scaling to fairly large systems. Any systems design problem starts at the architecture level. So, the first question we need to consider is what an appropriate architecture would be for such a scalable YP application. The second issue, somewhat down the road, will be how to implement it without too much code. I hope that this exercise will culminate with an ISIS "YP server" in the system utilities or demos area -- ideally one that is vastly better than the current YP package because of the clever we we built it. If a few people go as far as to implement the server, it should be interesting to compare solutions. So, to start with, I suggest that readers interested in following this dialog read about YP: man ypclnt ypfiles ypserver Question 1: is the YP server as defined by SUN suitable for use in a larger-scale high availability setting?
montnaro@sprite.crd.ge.com (Skip Montanaro) (11/15/89)
In article <34173@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes:
Question 1: is the YP server as defined by SUN suitable for use in a
larger-scale high availability setting?
If you believe what you read in the "YP vs. the domain name server" articles
that have been posted recently to several Sun-related newsgroups and mailing
lists, the answer is "no". YP as Sun currently implements it apparently does
not interact well with the domain system. In particular, YP doesn't "know
when to say when" (sorry Spuds) and will continue attempting to get IP info
for machines not in its database (this is when configured to "work" with the
dns. as I understand it). This has been known to flood portions of the
Internet on occasion.
A somewhat less esoteric problem is that YP has no notion of hierarchy. A
domain must have all the information in its maps that it cares about. It
can't have incomplete information and get the remainder from a "super
domain". For instance, consider a campus-wide network with the Geography,
Computer Science, and Physics departments all in the same YP domain called
"Cornell". All people in the Geography, CS, and Physics will (by default)
have accounts on all machines in the three departments. Administratively,
this is probably a "good thing". One person can administer all the global
information.
Now, suppose the heads of the Physics, CS, and Geography departments get in
an argument at a tailgate party, and they each go away saying, "I'll be
d***ed if those bozos are going to have accounts on my department's
computers." They consult their respective system administrators, who inform
them, "We'll have to either override the YP passwd database on every machine
in our department, or break off and create a new YP domain."
Either way, they all lose. Either each machine's administrative costs
increase by the amount it takes to keep each machine's passwd file
up-to-date (that's a lot - ask your users to change their passwords on all N
machines every six months), or each departments' administrative costs
increase by nearly the amount it takes to administer a separate YP domain
and replicate all the truly global information (hosts and networks databases
jump to mind).
In the real world (out here in industry), you don't need arguments at
tailgate parties for people to decide they need their own YP domains. We're
just naturally unfriendly :-). We have about 400 Suns at GE CRD, with about
40-50 servers. I'll bet we have at least 20 YP domains. We don't have to
worry about how well YP scales. We have to worry about transferring hosts
and networks files around...
--
Skip Montanaro (montanaro@crdgw1.ge.com)
ken@gvax.cs.cornell.edu (Ken Birman) (12/18/89)
Well, with the holiday season approaching, I think we can wrap up phase one of our "YP redesign" project. The initial goal seems to be to build a small-scale module for storing the YP database. It needs to support an internal interface with a lookup/update/delete ability. Data can be assumed to consist of tuples like ("isis","udp","1234") with multiple such tuples in each of a set of "files" (like "/yp/etc/services"). If it helps, you can initially assume that everything fits in memory; later the service will need to work off of disk files. Details of the tuple matching rules and content rules can be deduced from the YP documentation if necessary. Having built this, we'll want to extend it to support the sort of import/export mechanisms that I proposed earlier; the effect will be to let us build a large-scale YP service with individual data items (tuples) living on some primary set of 2 or 3 servers, but available everywhere. I think this extension raises some interesting problems at the level of the "ISIS architecture" to use -- e.g. how groups notify one another about their desire to import/ability to export chunks of data. I suggest that we pick up this topic in January sometime. The last step of the implementation will be to hook this together with the new long-haul facility so that our YP program can span multiple LAN's running different versions of ISIS. When I have time (not soon, since I am trying to get ISIS V2.0 into a distributable form now). I plan to throw together the basic triple-replicated YP module. Hopefully, a few of you will too... My plan is to benchmark this under ISIS V2.0 in late January and to post the code in that timeframe; if we have a few competing designs, perhaps we can compare and contrast... Ken PS: This first stage is trivial, in case you haven't noticed. It resembles the replication example in the ISIS Manual so closely that you can practically type it in right from there. I suggest that you use a token-based replicated update scheme, with any server in the TMR set satisfying read requests locally; to do a write you request the token for the "file" in which the tuple resides, then CBCAST the update in a simple message that all receipients interpret in parallel. Actually, there is a minutely faster protocol that combines the token request message with the update message, but we can worry about that some other time...
stan@squazmo.solbourne.com (Stan Hanks) (12/29/89)
>> Hmmm. Interesting. My perception is that we have two basic hurdles to >> overcome in the '90s: effective use of high speed networks, and the >> fact that those of us (me included) who thought the future looked like >> lots of machines coupled by message passing networks were partly >> wrong. It's starting to look like (to me, anyway -- and you can probably >> find others who agree) that we're going to see a partitioning of the >> classes of computer available Real Soon. On the one hand, we're gonna >> have small cheap desktops a la the SPARCstation but cheaper (we'll see >> 20 MIP systems with 16 MB memory and 300 MB disk under $7k by the end >> of 1990 -- who knows what '91 looks like!). On the other hand, we're >> gonna have real workstations, which will be shared memory multiprocessor >> boxes, likely to be running some MACH, SunOS, or SystemV.4 varient. > >Yes, I'll buy this... although I would add "massive servers" to the >picture (lots of them). Oh yeah.... Gotta have those. And the things coming down the pike from all the major vendors that are looking better and better. Plus stuff like the Legato NFS accelerator, and some of the RAID technology.... >> And with the real high speed networks coming soon, I expect that we're >> going to find ourselves looking for a model which lets us treat all >> IPC as memory accesses (sort of like the CMU/IBM MEMNET stuff) but >> in a manner that really works. I really expect the point-to-point >> data reliability to happen at the hardware level exclusively by sometime >> in the early '90s. > >This, I don't buy. Problem is that you are concealing the "fact" of >physical distribution, which many applications need to know about. >For example, your view rules out a large class of control applications >that need to know about "local" (==realtime response) and remote (slow >but knowledgable). Yeah, but there are also a much broader class of problem that neither know nor care about locality issues. Like most of the application programs that people use. And almost all commercial applications. I've always viewed control, real-time, and mission-critical fault tolerant applications as basicly "special" -- we should consider them when designing things, but we should design special things to accommodate them rather than fitting accommodations for them into more general purpose things. >Also, this approach is very weak for fault-tolerant applications. >Its easy to recover when an RPC fails; hard to deal with a chunk of >memory suddenly getting unmapped. True, but we manage to handle page faults today -- I view this as sort of a "network page fault". >My feeling is that the interesting applications would rather have >powerful but visible tools... Depends on where you draw the "interesting" line -- my primary interest has usually been trying to gain maximal use of network resources for "traditional" computing. If we start looking specificly at real-time and the like, then yes, you're right. >(You might want to post this whole mail, plus your response; could >make quite an interesting comp.sys.isis discussion if anyone follows >up on it!) Challenge accepted! >> And you're right: scalability is a growing concern. As is operations over >> distant and slow networks. The ISIS view of the world as computational >> nodes connected by networks does real well for small numbers of nodes >> connected by local networks; maybe some sort of paradigm that lets you >> lump nodes into a meta-node (i.e. site? lab? etc.) connected by slower >> networks would work to get you over that hurdle? Hmmmm. Note also that >> if you take this sort of view, you can accommodate multiprocessors as >> sort of a micro-meta-node where it has computational units connected >> by very high speed network (shared memory). Not having thought about it >> more than just to write this stuff down, it looks pretty elegant. I guess >> I need to go off and push some chalk around the room for a while and >> think about it some more.... > >ISIS is moving towards hierarchical structures for just this reason. >ISIS services would tend to have 2-3 processes per "active subgroup", >perhaps a big envelope around the whole bunch per LAN, and inter-LAN >tools for building WAN services. We are close to having this now; the >commercial ISIS (mid 1990) will include such a structuring facility. > >And, the interesting thing is that it stays pretty simple to program; >structure doesn't always imply complexity. Hey, that's great. I wish more people would realize that the simpliest viable solution is oftentimes the most desirable. >I haven't looked much at multiprocessors but we are starting to think >we should I somewhat doubt that you would want to use process groups >internally on such machines, but who knows... I get asked about that all the time. We have folks who really want to put MACH or V up on our box (not the same folks, BTW) to play with stuff like that. From what I've looked at, it seems that what you'd get is (maybe) *real* concurrency for the various processes in the "active subgroup" (or thread or team or...) plus the advantages of shared memory between the components, which would let you address a whole host of interesting problems that you can't address today. >> BTW, if you're interested in fault tolerance, you need to snarf David B. >> Johnson's dissertation from Rice. He did some excellant work on fault >> tolerance in message passing environments, even to the point of coming >> up with sort of a calculus for reasoning about tolerance requirements. >> It should be available real soon -- he defended in October, but just >> recently got the offical copy over the the dean's office. His address >> is "dbj@rice.edu" in case you need it. > >As I mentioned, I've read several drafts of the paper on this. Not >bad stuff, but there has been a lot of similar work (Borg's Auragen >system, Toueg&Koo checkpointing mechanism) and this stuff has many >limitations (determinism, no lightweight threads; only tolerates a >single failure), plus it seems to deadlock under some conditions. >An old copy of ISIS did something called "retained results" with >similar limitations; we don't do this anymore because it seems to >have been a so-so idea... (But, for what its worth, I do think >the Johnson/Zwaenpoel paper is better than any other paper on this >type of message logging, mostly because of the performance figures) > >I haven't seen the calculus, though. I'll ask for a copy of the >thesis. My comments relate to "sender based message logging". Right. Same stuff. He added a whole lot of work to prove that for the cases he was considering, that his solutions where necessary and sufficient to guarantee recovery. But good old "Mr. Meta-Problem" Dave went off and developed what seems to be an excellent basis for reasoning about fault tolerance in any distributed environment in order to accomplish this. I'll be interested to hear what sort of responses people have to all this. And, of course, real interested to see how ISIS works on one of our multiprocessors. BTW, do you have ISIS for MACH yet? For what I'm looking at, it would give finer granularity than using OS/MP (our regular multiprocessor version of SunOS). Regards, -- Stanley P. Hanks Science Advisor Solbourne Computer, Inc. Phone: Corporate: (303) 772-3400 Houston: (713) 964-6705 E-mail: ...!{boulder,sun,uunet}!stan!stan stan@solbourne.com
ken@gvax.cs.cornell.edu (Ken Birman) (12/29/89)
(As you have probably figured out, Stan and I were discussing the "technology requirements" for systems built using gigabit lines and other hot hardware... I basically argued that this push to greater speed is creating more of a need for ISIS-like function; Stan, as you will have gathered, is more of a point-to-point person and hence skeptical of the need for ISIS group-style complexity...) In article <1989Dec28.173847.11878@squazmo.solbourne.com> stan@squazmo.solbourne.com (Stan Hanks) writes: >... >>S And with the real high speed networks coming soon, I expect that we're >>S going to find ourselves looking for a model which lets us treat all >>S IPC as memory accesses (sort of like the CMU/IBM MEMNET stuff) but >>S in a manner that really works. I really expect the point-to-point >>S data reliability to happen at the hardware level exclusively by sometime >>S in the early '90s. >K> >K>This, I don't buy. Problem is that you are concealing the "fact" of >K>physical distribution, which many applications need to know about. >K>For example, your view rules out a large class of control applications >K>that need to know about "local" (==realtime response) and remote (slow >K>but knowledgable). > >S>Yeah, but there are also a much broader class of problem that neither >S>know nor care about locality issues. Like most of the application programs >S>that people use. And almost all commercial applications. I've always >S>viewed control, real-time, and mission-critical fault tolerant applications >S>as basicly "special" -- we should consider them when designing things, but >S>we should design special things to accommodate them rather than fitting >S>accommodations for them into more general purpose things. I guess I buy this for some applications but I think you are arguing an untenable point: namely, that there really isn't anything in a distributed system (now or anytime soon) that needs to be "controlled". If you equate control with, say, factory floor control, sure, there is a lot of commercial stuff that doesn't need much controlling. But, there is a larger and larger collection of stand-alone services out there that need to control themselves and be highly available. E.g., your average commercial outpost in Houston selling access to a proprietary database on Texas geophysics or whatever. This system may well be spread over many nodes and will want high availability. And, it needs to control load to avoid trashing just because a few too many queries came in at once. I view this as a distributed control problem, too. And, I think that existing technology hasn't given us much of a handle on designing these kinds of self-maintaining servers or systems. So, I see ISIS as providing the "glue" that holds together a system that might well offer its clients a very vanilla RPC interface... >K>Also, this approach is very weak for fault-tolerant applications. >K>Its easy to recover when an RPC fails; hard to deal with a chunk of >K>memory suddenly getting unmapped. > >S>True, but we manage to handle page faults today -- I view this as sort >S>of a "network page fault". > >K>>My feeling is that the interesting applications would rather have >K>>powerful but visible tools... > >K>Depends on where you draw the "interesting" line -- my primary interest >K>has usually been trying to gain maximal use of network resources for >K>"traditional" computing. If we start looking specificly at real-time >K>and the like, then yes, you're right. I don't buy the "fault tolerance is just a page fault problem" line; I see little evidence that anyone has come up with systems able to reconfigure this gracefully. Page faults are really easy to deal with -- just fetch the page. Failures are more of a mess: you may need to clean up, restart programs, reattach programs to the new servers, etc... This is why we tend to favor services that have 2 or 3 processes cooperating and where you expect a reply from "anyone" and not some specific process... >>S> And you're right: scalability is a growing concern.... Well, glad we agree on something! >S>I get asked about [multiprocessors] all the time. We have folks who >S>really want to put MACH or V up on our box... >S>I'll be interested to hear what sort of responses people have to all this. >S>And, of course, real interested to see how ISIS works on one of our >S>multiprocessors. > >S>BTW, do you have ISIS for MACH yet? For what I'm looking at, it would >S>give finer granularity than using OS/MP (our regular multiprocessor >S>version of SunOS). ISIS seems fine on MACH. I'm planning to test it under the forthcoming Mt. Xinu MACH release next week, so it should be up and solid on their Beta tape. This will be ISIS V1.3.1, but V2.0 will also get checked out on their system and will be available both from Cornell and, later, on Mt. Xinu's 2.1 release when that occurs. Since MACH and OSF have lately become engaged, a few people asked what came of the ISIS submission under the OSF DE RFT. (How's that for acronyms?) Basically, OSF has ended up focusing on a lower level of the environment -- things like clock and name servers, RPC data encoding, and the file system. OSF seems to have decided to defer a decision on how (if) to include ISIS in their world until after these urgent short-term questions are settled. They did this by putting ISIS into a technology category for submissions of possible interest to them (so they won't say "no") but inappropriate for the DE part of OSF/2 (so they won't say "yes"). However, if OSF/2 is really MACH based, ISIS should be able to run on it. And, I don't expect that OSF/2 will offer some competing technology -- I know enough about the RFT technology submissions to say that ISIS is aimed in a very different direction. For example, at least half a dozen submissions were concerned with linking UNIX to PC's running OS/2....
laubach@aspen.IAG.HP.COM (Mark Laubach) (01/09/90)
From an earlier note: In article <1989Dec28.173847.11878@squazmo.solbourne.com> stan@squazmo.solbourne.com (Stan Hanks) writes: >... >>S And with the real high speed networks coming soon, I expect that we're >>S going to find ourselves looking for a model which lets us treat all >>S IPC as memory accesses (sort of like the CMU/IBM MEMNET stuff) but >>S in a manner that really works. I really expect the point-to-point >>S data reliability to happen at the hardware level exclusively by sometime >>S in the early '90s. Sorry to be so long in replying to this. The only MEMNET that I know of is the Farber/Delp research that went on at the University of Delaware. MEMNET being the networked shared memory between IBM pc's. Professor David Farber is now at the University of Pennsylvania, Gary Delp is now at the IBM Watson research center. I can anyone more information, contact, etc on the continuing research if desired. MEMNET, as published, has no connection with CMU as I know it. Mark
ken@gvax.cs.cornell.edu (Ken Birman) (01/10/90)
In article <6860001@aspen.IAG.HP.COM> laubach@aspen.IAG.HP.COM (Mark Laubach) writes: >MEMNET, as published, has no connection with CMU as I know it. > I wonder if Stan might not have had Kung's work on Nectar and Iwarp in mind. This is a high speed optical interconnect; it is so fast that it makes paging off of remote machines over a network blaze in comparison with reading from a disk. This definitely does argue for a shared memory abstraction. DEC SRC has a similar interconnect. It seems relevant that when Kung talks about this, he tends to say that software for controlling complex distributed applications with replicated data is the big obstacle, not hardware. The problem he sees is that the set of programs sharing the memory changes dynamically and the big picture is hence a very complex and very dynamic one, which presumably has to be fault-tolerant too. And, viewed from his perspective, the technology for controlling this mess is lagging way behgind the hardware. We need an effective technology for distributed control and consistency if we are to support facilities like this sort of shared memory in a robust way. Even if everyone programs using the resulting shared objects, someone needs to implement it... In effect, shared objects become yet another "tool" in our bag of tools; the underlying issues are hidden but still relevant.