[comp.sys.isis] ISIS "homework" problem

ken@gvax.cs.cornell.edu (Ken Birman) (11/14/89)

A while ago I proposed to post interesting problems to the news group
from time to time as a way for people to start thinking about how to
use the system to build useful application software.

In this spirit, I propose that we design a YP server replacement.
Ideally, I hope that people will post contributions, but you can also
email thoughts to me anonymously or just follow the discussion silently.

YP defines a fairly specific, standardized interface using SUN RPC
as its facility for talking to clients.  We won't want to change that.
The idea is to use ISIS internally to the YP server to (hopefully)
come up with a version that propagates updates a bit faster than the
standard version and is a bit more robust when crashes occur.

Lets also plan for the future.  The YP server we design should be one
suitable for scaling to fairly large systems.

Any systems design problem starts at the architecture level.  So,
the first question we need to consider is what an appropriate architecture
would be for such a scalable YP application.  The second issue, somewhat
down the road, will be how to implement it without too much code.

I hope that this exercise will culminate with an ISIS "YP server" in
the system utilities or demos area -- ideally one that is vastly better
than the current YP package because of the clever we we built it.

If a few people go as far as to implement the server, it should be
interesting to compare solutions.

So, to start with, I suggest that readers interested in following
this dialog read about YP:
	man ypclnt ypfiles ypserver

Question 1: is the YP server as defined by SUN suitable for use in a
larger-scale high availability setting?

montnaro@sprite.crd.ge.com (Skip Montanaro) (11/15/89)

In article <34173@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes:

   Question 1: is the YP server as defined by SUN suitable for use in a
   larger-scale high availability setting?

If you believe what you read in the "YP vs. the domain name server" articles
that have been posted recently to several Sun-related newsgroups and mailing
lists, the answer is "no". YP as Sun currently implements it apparently does
not interact well with the domain system. In particular, YP doesn't "know
when to say when" (sorry Spuds) and will continue attempting to get IP info
for machines not in its database (this is when configured to "work" with the
dns. as I understand it). This has been known to flood portions of the
Internet on occasion.

A somewhat less esoteric problem is that YP has no notion of hierarchy. A
domain must have all the information in its maps that it cares about. It
can't have incomplete information and get the remainder from a "super
domain". For instance, consider a campus-wide network with the Geography,
Computer Science, and Physics departments all in the same YP domain called
"Cornell".  All people in the Geography, CS, and Physics will (by default)
have accounts on all machines in the three departments. Administratively,
this is probably a "good thing". One person can administer all the global
information.

Now, suppose the heads of the Physics, CS, and Geography departments get in
an argument at a tailgate party, and they each go away saying, "I'll be
d***ed if those bozos are going to have accounts on my department's
computers." They consult their respective system administrators, who inform
them, "We'll have to either override the YP passwd database on every machine
in our department, or break off and create a new YP domain."

Either way, they all lose. Either each machine's administrative costs
increase by the amount it takes to keep each machine's passwd file
up-to-date (that's a lot - ask your users to change their passwords on all N
machines every six months), or each departments' administrative costs
increase by nearly the amount it takes to administer a separate YP domain
and replicate all the truly global information (hosts and networks databases
jump to mind).

In the real world (out here in industry), you don't need arguments at
tailgate parties for people to decide they need their own YP domains. We're
just naturally unfriendly :-). We have about 400 Suns at GE CRD, with about
40-50 servers. I'll bet we have at least 20 YP domains. We don't have to
worry about how well YP scales. We have to worry about transferring hosts
and networks files around...

--
Skip Montanaro (montanaro@crdgw1.ge.com)

ken@gvax.cs.cornell.edu (Ken Birman) (12/18/89)

Well, with the holiday season approaching, I think we can wrap up
phase one of our "YP redesign" project.

The initial goal seems to be to build a small-scale module for
storing the YP database.  It needs to support an internal interface
with a lookup/update/delete ability.  Data can be assumed to consist of
tuples like ("isis","udp","1234") with multiple such tuples in each
of a set of "files" (like "/yp/etc/services").  If it helps, you can
initially assume that everything fits in memory; later the service will
need to work off of disk files.  Details of the tuple matching rules
and content rules can be deduced from the YP documentation if necessary.

Having built this, we'll want to extend it to support the sort of
import/export mechanisms that I proposed earlier; the effect will be
to let us build a large-scale YP service with individual data items
(tuples) living on some primary set of 2 or 3 servers, but available
everywhere.

I think this extension raises some interesting problems at the level
of the "ISIS architecture" to use -- e.g. how groups notify one
another about their desire to import/ability to export chunks of 
data.  I suggest that we pick up this topic in January sometime.

The last step of the implementation will be to hook this together with
the new long-haul facility so that our YP program can span multiple LAN's
running different versions of ISIS. 

When I have time (not soon, since I am trying to get ISIS V2.0 into a
distributable form now). I plan to throw together the basic triple-replicated
YP module.  Hopefully, a few of you will too...  My plan is to
benchmark this under ISIS V2.0 in late January and to post the code
in that timeframe; if we have a few competing designs, perhaps we can
compare and contrast...

Ken

PS: This first stage is trivial, in case you haven't noticed.  It
resembles the replication example in the ISIS Manual so closely that
you can practically type it in right from there.  I suggest that you
use a token-based replicated update scheme, with any server in the TMR
set satisfying read requests locally; to do a write you request the token
for the "file" in which the tuple resides, then CBCAST the update in
a simple message that all receipients interpret in parallel.

Actually, there is a minutely faster protocol that combines the token
request message with the update message, but we can worry about that
some other time...

stan@squazmo.solbourne.com (Stan Hanks) (12/29/89)

>> Hmmm. Interesting. My perception is that we have two basic hurdles to 
>> overcome in the '90s: effective use of high speed networks, and the
>> fact that those of us (me included) who thought the future looked like
>> lots of machines coupled by message passing networks were partly 
>> wrong. It's starting to look like (to me, anyway -- and you can probably
>> find others who agree) that we're going to see a partitioning of the
>> classes of computer available Real Soon. On the one hand, we're gonna
>> have small cheap desktops a la the SPARCstation but cheaper (we'll see
>> 20 MIP systems with 16 MB memory and 300 MB disk under $7k by the end
>> of 1990 -- who knows what '91 looks like!). On the other hand, we're
>> gonna have real workstations, which will be shared memory multiprocessor
>> boxes, likely to be running some MACH, SunOS, or SystemV.4 varient.
>
>Yes, I'll buy this...  although I would add "massive servers" to the
>picture (lots of them).

Oh yeah.... Gotta have those. And the things coming down the pike
from all the major vendors that are looking better and better.  Plus 
stuff like the Legato NFS accelerator, and some of the RAID technology....

>> And with the real high speed networks coming soon, I expect that we're
>> going to find ourselves looking for a model which lets us treat all
>> IPC as memory accesses (sort of like the CMU/IBM MEMNET stuff) but
>> in a manner that really works. I really expect the point-to-point
>> data reliability to happen at the hardware level exclusively by sometime
>> in the early '90s.
>
>This, I don't buy.  Problem is that you are concealing the "fact" of
>physical distribution, which many applications need to know about.
>For example, your view rules out a large class of control applications
>that need to know about "local" (==realtime response) and remote (slow
>but knowledgable).  

Yeah, but there are also a much broader class of problem that neither
know nor care about locality issues. Like most of the application programs
that people use. And almost all commercial applications. I've always
viewed control, real-time, and mission-critical fault tolerant applications
as basicly "special" -- we should consider them when designing things, but
we should design special things to accommodate them rather than fitting
accommodations for them into more general purpose things.

>Also, this approach is very weak for fault-tolerant applications.
>Its easy to recover when an RPC fails; hard to deal with a chunk of
>memory suddenly getting unmapped.

True, but we manage to handle page faults today -- I view this as sort
of a "network page fault".

>My feeling is that the interesting applications would rather have
>powerful but visible tools...

Depends on where you draw the "interesting" line -- my primary interest
has usually been trying to gain maximal use of network resources for
"traditional" computing. If we start looking specificly at real-time
and the like, then yes, you're right.

>(You might want to post this whole mail, plus your response; could
>make quite an interesting comp.sys.isis discussion if anyone follows
>up on it!)

Challenge accepted!

>> And you're right: scalability is a growing concern. As is operations over
>> distant and slow networks. The ISIS view of the world as computational
>> nodes connected by networks does real well for small numbers of nodes
>> connected by local networks; maybe some sort of paradigm that lets you
>> lump nodes into a meta-node (i.e. site? lab? etc.) connected by slower
>> networks would work to get you over that hurdle? Hmmmm. Note also that
>> if you take this sort of view, you can accommodate multiprocessors as
>> sort of a micro-meta-node where it has computational units connected
>> by very high speed network (shared memory). Not having thought about it
>> more than just to write this stuff down, it looks pretty elegant. I guess
>> I need to go off and push some chalk around the room for a while and
>> think about it some more....
>
>ISIS is moving towards hierarchical structures for just this reason.
>ISIS services would tend to have 2-3 processes per "active subgroup",
>perhaps a big envelope around the whole bunch per LAN, and inter-LAN
>tools for building WAN services.  We are close to having this now; the
>commercial ISIS (mid 1990) will include such a structuring facility.
>
>And, the interesting thing is that it stays pretty simple to program;
>structure doesn't always imply complexity.  

Hey, that's great. I wish more people would realize that the simpliest
viable solution is oftentimes the most desirable.

>I haven't looked much at multiprocessors but we are starting to think
>we should I somewhat doubt that you would want to use process groups
>internally on such machines, but who knows...

I get asked about that all the time. We have folks who really want to 
put MACH or V up on our box (not the same folks, BTW) to play with stuff
like that. From what I've looked at, it seems that what you'd get is
(maybe) *real* concurrency for the various processes in the "active
subgroup" (or thread or team or...) plus the advantages of shared memory
between the components, which would let you address a whole host of 
interesting problems that you can't address today.

>> BTW, if you're interested in fault tolerance, you need to snarf David B.
>> Johnson's dissertation from Rice. He did some excellant work on fault 
>> tolerance in message passing environments, even to the point of coming 
>> up with sort of a calculus for reasoning about tolerance requirements.
>> It should be available real soon -- he defended in October, but just 
>> recently got the offical copy over the the dean's office. His address
>> is "dbj@rice.edu" in case you need it.
>
>As I mentioned, I've read several drafts of the paper on this.  Not
>bad stuff, but there has been a lot of similar work (Borg's Auragen
>system, Toueg&Koo checkpointing mechanism) and this stuff has many
>limitations (determinism, no lightweight threads; only tolerates a
>single failure), plus it seems to deadlock under some conditions.
>An old copy of ISIS did something called "retained results" with
>similar limitations; we don't do this anymore because it seems to
>have been a so-so idea...  (But, for what its worth, I do think
>the Johnson/Zwaenpoel paper is better than any other paper on this
>type of message logging, mostly because of the performance figures)
>
>I haven't seen the calculus, though.  I'll ask for a copy of the
>thesis.  My comments relate to "sender based message logging".

Right. Same stuff. He added a whole lot of work to prove that for the 
cases he was considering, that his solutions where necessary and sufficient
to guarantee recovery. But good old "Mr. Meta-Problem" Dave went off
and developed what seems to be an excellent basis for reasoning about
fault tolerance in any distributed environment in order to accomplish this.

I'll be interested to hear what sort of responses people have to all this.
And, of course, real interested to see how ISIS works on one of our 
multiprocessors.

BTW, do you have ISIS for MACH yet? For what I'm looking at, it would 
give finer granularity than using OS/MP (our regular multiprocessor
version of SunOS).

Regards,

-- 
Stanley P. Hanks   Science Advisor                    Solbourne Computer, Inc.
Phone:             Corporate: (303) 772-3400           Houston: (713) 964-6705
E-mail:            ...!{boulder,sun,uunet}!stan!stan        stan@solbourne.com

ken@gvax.cs.cornell.edu (Ken Birman) (12/29/89)

(As you have probably figured out, Stan and I were discussing the
"technology requirements" for systems built using gigabit lines
and other hot hardware... I basically argued that this push to
greater speed is creating more of a need for ISIS-like function;
Stan, as you will have gathered, is more of a point-to-point person
and hence skeptical of the need for ISIS group-style complexity...)

In article <1989Dec28.173847.11878@squazmo.solbourne.com> stan@squazmo.solbourne.com (Stan Hanks) writes:
>...
>>S And with the real high speed networks coming soon, I expect that we're
>>S going to find ourselves looking for a model which lets us treat all
>>S IPC as memory accesses (sort of like the CMU/IBM MEMNET stuff) but
>>S in a manner that really works. I really expect the point-to-point
>>S data reliability to happen at the hardware level exclusively by sometime
>>S in the early '90s.
>K>
>K>This, I don't buy.  Problem is that you are concealing the "fact" of
>K>physical distribution, which many applications need to know about.
>K>For example, your view rules out a large class of control applications
>K>that need to know about "local" (==realtime response) and remote (slow
>K>but knowledgable).  
>
>S>Yeah, but there are also a much broader class of problem that neither
>S>know nor care about locality issues. Like most of the application programs
>S>that people use. And almost all commercial applications. I've always
>S>viewed control, real-time, and mission-critical fault tolerant applications
>S>as basicly "special" -- we should consider them when designing things, but
>S>we should design special things to accommodate them rather than fitting
>S>accommodations for them into more general purpose things.

I guess I buy this for some applications but I think you are arguing
an untenable point: namely, that there really isn't anything in a
distributed system (now or anytime soon) that needs to be "controlled".

If you equate control with, say, factory floor control, sure, there is
a lot of commercial stuff that doesn't need much controlling.

But, there is a larger and larger collection of stand-alone services
out there that need to control themselves and be highly available.
E.g., your average commercial outpost in Houston selling access to
a proprietary database on Texas geophysics or whatever.  This system
may well be spread over many nodes and will want high availability.
And, it needs to control load to avoid trashing just because a
few too many queries came in at once.

I view this as a distributed control problem, too.  And, I think
that existing technology hasn't given us much of a handle on designing
these kinds of self-maintaining servers or systems.  So, I see
ISIS as providing the "glue" that holds together a system that
might well offer its clients a very vanilla RPC interface...

>K>Also, this approach is very weak for fault-tolerant applications.
>K>Its easy to recover when an RPC fails; hard to deal with a chunk of
>K>memory suddenly getting unmapped.
>
>S>True, but we manage to handle page faults today -- I view this as sort
>S>of a "network page fault".
>
>K>>My feeling is that the interesting applications would rather have
>K>>powerful but visible tools...
>
>K>Depends on where you draw the "interesting" line -- my primary interest
>K>has usually been trying to gain maximal use of network resources for
>K>"traditional" computing. If we start looking specificly at real-time
>K>and the like, then yes, you're right.

I don't buy the "fault tolerance is just a page fault problem" line;
I see little evidence that anyone has come up with systems able to
reconfigure this gracefully.  Page faults are really easy to deal
with -- just fetch the page.  Failures are more of a mess: you may
need to clean up, restart programs, reattach programs to the new
servers, etc...  

This is why we tend to favor services that have 2 or 3 processes
cooperating and where you expect a reply from "anyone" and not
some specific process...

>>S> And you're right: scalability is a growing concern....

Well, glad we agree on something!

>S>I get asked about [multiprocessors] all the time. We have folks who
>S>really want to put MACH or V up on our box...
>S>I'll be interested to hear what sort of responses people have to all this.
>S>And, of course, real interested to see how ISIS works on one of our 
>S>multiprocessors.
>
>S>BTW, do you have ISIS for MACH yet? For what I'm looking at, it would 
>S>give finer granularity than using OS/MP (our regular multiprocessor
>S>version of SunOS).

ISIS seems fine on MACH.  I'm planning to test it under the forthcoming
Mt. Xinu MACH release next week, so it should be up and solid on their
Beta tape.  This will be ISIS V1.3.1, but V2.0 will also get checked
out on their system and will be available both from Cornell and, 
later, on Mt. Xinu's 2.1 release when that occurs.

Since MACH and OSF have lately become engaged, a few people asked
what came of the ISIS submission under the OSF DE RFT.  (How's that
for acronyms?)  Basically, OSF has ended up focusing on a lower
level of the environment -- things like clock and name servers, 
RPC data encoding, and the file system.  OSF seems to have 
decided to defer a decision on how (if) to include ISIS in their
world until after these urgent short-term questions are settled.

They did this by putting ISIS into a technology category for
submissions of possible interest to them (so they won't say "no")
but inappropriate for the DE part of OSF/2 (so they won't say
"yes").  However, if OSF/2 is really MACH based, ISIS should
be able to run on it.  And, I don't expect that OSF/2 will offer
some competing technology -- I know enough about the RFT technology
submissions to say that ISIS is aimed in a very different direction.

For example, at least half a dozen submissions were concerned
with linking UNIX to PC's running OS/2....

laubach@aspen.IAG.HP.COM (Mark Laubach) (01/09/90)

From an earlier note:

In article <1989Dec28.173847.11878@squazmo.solbourne.com> stan@squazmo.solbourne.com (Stan Hanks) writes:
>...
>>S And with the real high speed networks coming soon, I expect that we're
>>S going to find ourselves looking for a model which lets us treat all
>>S IPC as memory accesses (sort of like the CMU/IBM MEMNET stuff) but
>>S in a manner that really works. I really expect the point-to-point
>>S data reliability to happen at the hardware level exclusively by sometime
>>S in the early '90s.

Sorry to be so long in replying to this.  The only MEMNET that I know
of is the Farber/Delp research that went on at the University of
Delaware.  MEMNET being the networked shared memory between IBM pc's.
Professor David Farber is now at the University of Pennsylvania, Gary
Delp is now at the IBM Watson research center.  I can anyone more
information, contact, etc on the continuing research if desired.

MEMNET, as published, has no connection with CMU as I know it.

Mark

ken@gvax.cs.cornell.edu (Ken Birman) (01/10/90)

In article <6860001@aspen.IAG.HP.COM> laubach@aspen.IAG.HP.COM
    (Mark Laubach) writes:
>MEMNET, as published, has no connection with CMU as I know it.
>

I wonder if Stan might not have had Kung's work on Nectar and Iwarp in
mind.  This is a high speed optical interconnect; it is so fast that
it makes paging off of remote machines over a network blaze in 
comparison with reading from a disk. This definitely does argue for
a shared memory abstraction.  DEC SRC has a similar interconnect.

It seems relevant that when Kung talks about this, he tends to say that
software for controlling complex distributed applications with replicated
data is the big obstacle, not hardware.  The problem he sees is that
the set of programs sharing the memory changes dynamically and the 
big picture is hence a very complex and very dynamic one, which presumably
has to be fault-tolerant too.  And, viewed from his perspective, the
technology for controlling this mess is lagging way behgind the hardware.

We need an effective technology for distributed control and consistency
if we are to support facilities like this sort of shared memory in a robust
way.  Even if everyone programs using the resulting shared objects, 
someone needs to implement it...  In effect, shared objects become yet
another "tool" in our bag of tools; the underlying issues are hidden but
still relevant.