[comp.sys.hp] Fault-tolerant HP-UX?

randy@aplcomm.jhuapl.edu (RANDALL SCHRICKEL (NCE) x7661) (12/18/90)

Saw a short blurb about the subject in Unix Today, or some magazine. Who can
tell me anything about it? Supposedly it can be used at sites with 2 or more
HPs, so that if one crashes the other will take over with no down time.
Pricing, reviews, any details would be appreciated. Thanx.
--
	Randy Schrickel randy@aplcomm.jhuapl.edu
	Johns Hopkins Applied Physics Lab
	Laurel, MD 20723
	"Life goes on, long after the thrill of living has gone."

dhepner@hpcuhc.cup.hp.com (Dan Hepner) (12/19/90)

From: randy@aplcomm.jhuapl.edu (RANDALL SCHRICKEL (NCE) x7661)

>Saw a short blurb about the subject in Unix Today, or some magazine. Who can
>tell me anything about it? Supposedly it can be used at sites with 2 or more
>HPs, so that if one crashes the other will take over with no down time.
>Pricing, reviews, any details would be appreciated. Thanx.

HP-UX 8.0 will have an optional feature called "SwitchOver/UX",
which allows for one HP-UX machine to back up up to seven others, 
and take over in the event that one of the seven fails.

The backup machine "becomes" the failed machine in all important
matters, and reboots as the failed machine.  It takes over the disks, 
and thus uses exactly the same '/' directory.   A special Ethernet 
address is used for each machine in these "groups",  and the backup 
takes over the Ethernet address of the failed machine.  The IP address
is similarly unchanged.  Thus, accessing the takeover from a network
is indistinguishable from accessing the original machine.

Pricing, etc, should be obtained from your local HP sales office.  If
you should have trouble locating such, send email.

Dan Hepner
dhepner@cup.hp.com

jsadler@misty.boeing.com (Jim Sadler) (12/19/90)

Ask your sales rep about switchover-UX.  Also known as APR.


jim sadler
206-234-9009	email	uunet!bcstec!jsadler | jsadler@misty.boeing.com

This service is brought to you by the computing mafia of Boeing (BCS).
Oh ya
None of the above is an opinion of The Boeing Co.

rcbc@cs.cornell.edu (Robert Cooper) (12/21/90)

You may be interested in the ISIS System, a toolkit for fault tolerant
distributed programming that runs under HP/UX and over 20 other version of
Unix. ISIS is available, free, from the Computer Science Dept., Cornell
University either via FTP, or via tape for a small handling fee. Here is
our standard blurb.  If you want to skip some of this blurb, search for
"--- How to get a copy of ISIS now ---".

-------------------------------------------------------------------

This is to announce the availability of a public distribution  of
the  ISIS  System,  a  toolkit for distributed and fault-tolerant
programming.  The initial version of ISIS runs on  UNIX  on  SUN,
DEC,  GOULD, AUX  and  HP systems; ports to other UNIX-like
systems are planned for the future.  No kernel changes are needed
to support ISIS; you just roll it in and should be able to use it
immediately.  The current implementation of ISIS performs well in
networks of up to about 100-200 sites.  Most users, however, run on
a smaller number of sites (16-32 is typical) and other sites connect
as "remote clients" that don't actually run ISIS directly. In this
mode many hundreds of ISIS users can be clustered around a smaller
set of ISIS "mother sites"; many users with large networks favor
such an architecture.


--- Who might find ISIS useful? ---

You will find ISIS useful if you  are  interested  in  developing
relatively sophisticated distributed programs under UNIX (eventu-
ally, other systems too).  These include programs that distribute
computations over multiple processes, need fault-tolerance, coor-
dinate activities  underway  at  several  places  in  a  network,
recover  automatically from software and hardware crashes, and/or
dynamically reconfigure while maintaining some  sort  of  distri-
buted  correctness  constraint at all times.  ISIS is also useful
in building certain types of distributed real time systems.

Here are examples of problems to which ISIS has been applied:

   o On the factory floor, we  are  working  with  an  industrial
     research  group  that is using ISIS to program decentralized
     cell controllers.  They need to arrive at a modular, expand-
     able, fault-tolerant distributed system.  ISIS makes it pos-
     sible for them to build such a system without a huge invest-
     ment  of  effort.  (The ISIS group also working closely with
     an automation standards consortium called  ANSA,  headed  by
     Andrew Herbert in Cambridge).

   o As part of a network file system, we built an  interface  to
     the  UNIX  NFS (we call ours "DECEIT") that supports tran-
     sparent file  replication  and  fault-tolerance.   DECEIT
     speaks NFS protocols but employs ISIS internally to maintain
     a consistent distributed state.  For  most  operations,  
     DECEIT  performance is at worst 50-75% of that of a normal NFS
     -- despite supporting file replication and fault-tolerance.
     Interestingly, for many common operations, DECEIT substantially
     outperforms NFS (!) and it is actually fairly hard to come up
     with workloads that demonstate replication-related degradation.

   o A parallel "make" program.  Here, ISIS  was  used  within  a
     control  program that splits up large software recompilation
     tasks  and  runs  them  on  idle  workstations,   tolerating
     failures  and  dynamically  adapting  if  a  workstation  is
     reclaimed by its owner.

   o A system for monitoring and reacting to sensors scattered around
     the network, in software or in hardware.  This system, Meta, is
     actually included as part of our ISIS V2.1 release.  We are adding
     a high level language to it now, Lomita, in which you can specify
     reactive control rules or embed such rules into your C or Fortran
     code, or whatever.

   o In a hospital, we have looked at using ISIS to manage repli-
     cated data and to coordinate activities that may span multi-
     ple machines.  The problem here is  the  need  for  absolute
     correctness:  if a doctor is to trust a network to carry out
     orders that might impact on patient health, there is no room
     for  errors due to race conditions or failures.  At the same
     time, cost considerations argue for distributed systems that
     can  be  expanded  slowly  in  a fully decentralized manner.
     ISIS addresses both of these issues: it makes it far  easier
     to  build  a reliable, correct, distributed system that will
     manage  replicated  data  and  provide  complex  distributed
     behaviors.  And, ISIS is designed to scale well.

   o For programming numerical algorithms.  One group at  Cornell
     used  ISIS  to  distribute  matrix  computations  over large
     numbers of workstations.  They did this because the worksta-
     tions were available, mostly idle, and added up to a tremen-
     dous computational engine.  Another group, at LANL, uses ISIS
     in a parallel plasma physics application.

   o In a graphics rendering application.  Over an extended period,
     a Cornell graphics group (not even in our department) has used
     ISIS to build distributed rendering software for image 
     generation.  They basically use a set of machines as a parallel
     processor, with a server that farms out rendering tasks and
     a variable set of slave computing units that join up when their
     host machine is fairly idle and drop out if the owner comes
     back to use the machine again.  This is a nice load sharing
     paradigm and makes for sexy demos too.

   o In a wide-area seismic monitoring system (i.e. a system that
     has both local-area networks and wide-area connections between
     them), developed by a company called SAIC on a DARPA contract.
     The system gathers seismic data remotely, preprocesses it, and
     ships event descriptions to a free-standing analysis "hub", which
     must run completely automatically (their people in San Diego don't like
     to be phoned in the middle of the night to debug problems in Norway).
     The hub may request data transfers and other complex computations,
     raising a number of wide-area programming problems.  In addition, the
     hub system itself has a lot of programs in various languages and
     just keeping it running can be a challenge.

   o On brokerage and banking trading floors.  Here, ISIS tends to be
     an adjunct to a technology for distributing quotes, because the
     special solutions for solving that specific problem are so fast
     that it is hard for us to compete with them (we normally don't
     have the freedom of specifying the hardware... many "ticker plant
     vendors" wire the whole floor for you).  However, to the extent
     that these systems have problems requiring fault-tolerance, simple
     database integration mechanisms, dynamic restart of services, 
     and in general need "reactive monitoring and control" mechanisms,
     ISIS works well.  And, with our newer versions of the ISIS protocols,
     performance is actually good enough to handle distribution of 
     stock quotes or other information directly in ISIS, although 
     one has to be a bit careful in super performance intensive settings.
     (The commercial ISIS release should compete well with the sorts of
     commercial alternatives listed above on a performance basis, but
     more than 10 trading groups are using ISIS V2.1 despite the fact that
     it is definitely slower!).

The problems above are characterized by several features.  First,
they  would all be very difficult to solve using remote procedure
calls or transactions against some shared  database.   They  have
complex,  distributed  correctness constraints on them: what hap-
pens at site "a" often requires a coordinated action at site  "b"
to  be  correct.   And,  they do a lot of work in the application
program itself, so that the ISIS communication mechanism  is  not
the bottleneck.

If you have an application like this, or are interested in taking
on  this  kind  of  application,  ISIS  may be a big win for you.
Instead of investing resources in building an environment  within
which  to  solve  your application, using ISIS means that you can
tackle the application immediately,  and  get  something  working
much faster than if you start with RPC (remote procedure calls).

On the other hand, don't think of ISIS as competing with RPC or
database transactions.  We are oriented towards online control and
coordination problems, fault-tolerance of main-memory databases, etc.
ISIS normally co-exists with other mechanisms, such as conventional
streams and RPC, databases, or whatever.  The system is highly portable
and not very intrusive, and many of our users employ it to control some
form of old code running a computation they don't want to touch at
any price.


--- What ISIS does ---

The ISIS system has been under development for several  years  at
Cornell  University.   After  an  initial  focus on transactional
"resilient objects", the emphasis shifted in 1986  to  a  toolkit
style  of  programming.   This approach stresses distributed con-
sistency in applications that  manage  replicated  data  or  that
require  distributed  actions  to  be taken in response to events
occurring in the system.  An "event" could be a user request on a
distributed service, a change to the system configuration result-
ing from a process or site failure or recovery, a timeout, etc.

The ISIS toolkit uses a subroutine call style  interface  similar
to  the interface to any conventional operating system.  The pri-
mary difference, however, is  that  ISIS  functions  as  a  meta-
operating  system.   ISIS system calls result in actions that may
span multiple processes and machines in the  network.   Moreover,
ISIS  provides  a  novel  "virtual  consistency"  property to its
users.  This property makes it easy to build  software  in  which
currently  executing processes behave in a coordinated way, main-
tain replicated data, or otherwise satisfy a system-wide correct-
ness  property.   Moreover,  virtual synchrony makes even complex
operations look atomic,  which  generally  implies  that  toolkit
functions  will  not  interfere  with  one another.  One can take
advantage of this to develop distributed ISIS software in a  sim-
ple  step-by-step style, starting with a non-distributed program,
then adding  replicated  data  or  backup  processes  for  fault-
tolerance  or higher availability, then extending the distributed
solution to support dynamic reconfiguration, etc.  ISIS  provides
a  really  unique style of distributed programming -- at least if
your distributed computing problems run up against the issues  we
address.   For  such  applications, the ISIS programming style is
both easy and intuitive.

ISIS is really intended for, and is good at, problems  that  draw
heavily  on  replication of data and coordination of actions by a
set of processes that know about one  another's  existence.   For
example,  in  a factory, one might need to coordinate the actions
of a set of machine-controlled drills at  a  manufacturing  cell.
Each  drill  would  do  its  part of the overall work to be done,
using a coordinated  scheduling  policy  that  avoids  collisions
between  the  drill heads, and with fault-tolerance mechanisms to
deal with bits breaking.  ISIS is ideally suited to solving prob-
lems  like  this  one.  Similar problems arise in any distributed
setting, be it local-area network software for the  office  or  a
CAD  problem,  or  the  automation of a critical care system in a
hospital.

ISIS is not intended for transactional database applications.  If
this  is  what  you  need, you should obtain one of the many such
systems that are now available.  On the other hand, ISIS would be
useful  if  your  goal  is to build a front-end in a setting that
needs databases.  The point is that  most  database  systems  are
designed  to  avoid interference between simultaneously executing
processes.  If your application also  needs  cooperation  between
processes  doing  things  concurrently at several places, you may
find this aspect hard to solve  using  just  a  database  because
databases  force  the  interactions to be done indirectly through
the shared data.  ISIS is good for solving this kind of  problem,
because  it  provides  a direct way to replicate control informa-
tion, coordinate the actions of the front-end processes,  and  to
detect and react to failures.

ISIS itself runs as a user-domain program on  UNIX  systems  sup-
porting  the  TCP/IP protocol suite.  It currently is operational
on SUN, DEC, GOULD and HP versions of UNIX.  Language  interfaces
for C, C++, FORTRAN, and Common LISP (both Lucid and Allegro) are
included, and a new C-Prolog interface is being tested now.  Recent 
ports available in V2.1 include AUX for the Apple Mac. II, AIX on the
IBM RS/6000 and also the older PC/RT.  A Cray UNICOS port is (still)
under development at LANL, and a DEC VMS port is being done by
ISIS Distributed Systems, Inc.

ISIS runs over Mach on anything that supports Mach but will probably
look a little unnatural to you if you use the Mach primitives.  We
are planning a version of ISIS that would be more transparent in a
Mach context, but it will be some time before this becomes available.
Meanwhile, you can use ISIS but may find some aspects of the interface
inconsistent with the way that Mach does things.

The actual set of tools includes the following:

   o High performance mechanisms supporting lightweight tasks  in
     UNIX,  a  simple message-passing facility, and a very simple
     and  uniform  addressing  mechanism.   Users  do  not   work
     directly  with things like ports, sockets, binding, connect-
     ing, etc.  ISIS handles all of this.

   o A process "grouping" facility, which  permits  processes  to
     dynamically  form and leave symbolically-named associations.
     The system serializes changes  to  the  membership  of  each
     group: all members see the same sequence of changes.  Groups
     names can be used as a location-transparent address.

   o A suite of  broadcast  protocols  integrated  with  a  group
     addressing  mechanism.   This  suite  operates in a way that
     makes it look as if all broadcasts are received  "simultane-
     ously"  by  all  the members of a group, and are received in
     the same "view" of group membership.

   o Ways of obtaining distributed executions.   When  a  request
     arrives in a group, or a distributed event takes place, ISIS
     supports any of a variety of execution styles, ranging  from
     a  redundant computation to a coordinator-cohort computation
     in which one process takes the requested actions while  oth-
     ers back it up, taking over if the coordinator fails.

   o Replicated data with 1-copy consistency guarantees.

   o Synchronization   facilities,  based  on  token  passing  or
     read/write locks.

   o Facilities for watching a for a process or  site  (computer)
     to fail or recover, triggering execution of subroutines pro-
     vided by the user when the  watched-for  event  occurs.   If
     several  members  of  a  group watch for the same event, all
     will see it at the same "time" with respect to arriving mes-
     sages  to  the group and other events, such as group member-
     ship changes.

   o A facility for joining  a  group  and  atomically  obtaining
     copies of any variables or data structures that comprise its
     "state" at the instant before the  join  takes  place.   The
     programmer who designs a group can specify state information
     in addition to the state automatically maintained by ISIS.

   o Automatic restart of applications when a  computer  recovers
     from  a crash, including log-based recovery (if desired) for
     cases when all representatives of a service fail  simultane-
     ously.

   o Ways to build transactions or  to  deal  with  transactional
     files  and  database  systems external to ISIS.  ISIS itself
     doesn't know about files or transactions.  However, as noted
     above, this tool is pretty unsophisticated as transactional 
     tools go...

   o Spooler/long-haul mechanism, for saving data to be sent to a
     group next time it recovers, or for sending from one ISIS LAN
     to another, physically remote one (e.g. from your Norway site
     to your San Diego installation).  Note: ISIS will not normally
     run over communication links subject to frequent failures, al-
     though this long-haul interface has no such restrictions.

Everything in ISIS is fault-tolerant.  Our programming manual has
been  written  in  a tutorial style, and gives details on each of
these mechanisms.  It includes examples  of  typical  small  ISIS
applications and how they can be solved.  The distribution of the
system includes demos, such as the parallel  make  facility  men-
tioned  above;  this  large  ISIS application program illustrates
many system features.

To summarize, ISIS provides a broad  range  of  tools,  including
some  that  require algorithms that would be very hard to support
in other systems or to implement by hand.  Performance  is  quite
good:  most  tools require between 1/20 and 1/5 second to execute
on a SUN 3/60, although the actual  numbers  depend  on  how  big
processes  groups get, the speed of the network, the locations of
processes involved, etc.  Overall, however, the system is  really
quite fast when compared with, say, file access over the network.
For certain common operations  a  five  to  ten-fold  performance
improvement  is expected within two years, as we implement a col-
lection of optimizations.  The system scales well with  the  size
of  the  network,  and  system overhead is largely independent of
network size.  On a machine that is not participating in any ISIS
application, the overhead of having ISIS running is negligible.

In certain communication scenarios, ISIS performance can be quite
good.  These involve streaming data within a single group or certain
client-server interaction patterns, and make use of a new BYPASS
communication protocol suite.  Future ISIS development is likely
to stress extensions and optimizations at this level of the system.
In addition, a lot of effort is going into scaling the system
to larger environments.

--- How to get a copy of ISIS now ---

Version V2.1 of ISIS is now fully operational and  is  being  made
available  to the public.  This version consists of a C implementations
for UNIX, and has been ported to AIX, SUN, UNIX, MACH, ULTRIX, Gould UNIX,
HP-UX, AUX and APOLLO UNIX  (release 10.1).  Performance is uniformly good.
A 400 page tutorial and sys- tem  manual  containing  numerous  programming
examples  is also available.  Online manual pages are also provided.

The remainder of this posting focuses on how to get ISIS, and how
to get the manual.  Everything is free except bound copies of the
manual.  Source is included, but the  system  is  in  the  public
domain, and is released on condition that any ports to other sys-
tems or minor modifications remain in  the  public  domain.   The
manual  is  copyrighted  by the project and is available in hard-
copy form or as a DVI file, with figures available  for  free  on
request.

We have placed a compressed TAR images in the following places:
 * cu-arpa.cs.cornell.edu (anonymous login, binary mode pub/ISISV21.TAR.Z)
 * Doc: cu-arpa.cs.cornell.edu (pub/ISISV21-DOC.TAR.Z)
 * uunet.uu.net (anonymous login, binary mode networks/ISIS/ISISV21.TAR.Z)
 * mcsun.eu.net (anonymous login, binary mode networks/ISIS/ISISV21.TAR.Z)
Also available are DVI and PS versions  of  our  manual.   Bound
copies  will  be  available at $25 each.  A package of figures to
glue into the DVI version will be provided free of charge.

A tape containing ISIS will be provided upon  payment of a charge to
cover our costs in making the tape.  Our resources are limited and
we do not wish to do much of this.


--- Copyright, restrictions ---

V2.1 of ISIS is subject to a restrictive copyright; basically, you can
use it without changing it in any way you like, but are not permitted
to develop "derivative versions" without discussing this with us.
V2.1 differs substantially from V1.3.1, which was released in the public
domain and remains available without any restrictions whatsoever.

On the other hand, whereas previous versions of ISIS required export
licenses to be sent to certain eastern-block countries, the present
version seems not to be subject to this restriction.  Contact the US
Dept. of Commerce for details if you plan to export ISIS to a country
that might be subject to restrictions.  Any place in Europe, Japan, etc.
should be fine and no license is required.

--- Commercial support ---

We are working with a local  company,  ISIS  Distributed  Systems
Inc.,  to  provide  support services for ISIS.  This company will
prepare distributions and work to fix  bugs.   Support  contracts
are  available  for an annual fee; without a contract, we will do
our best to be helpful but make no promises.  Other services that
IDS  plans  to  provide will include consulting on fault-tolerant
distributed systems design, instruction on how to work with ISIS,
bug  identification  and  fixes,  and  contractual joint software
development projects.  The company is also prepared to port  ISIS
to   other  systems  or  other  programming  languages.   Contact
"birman@gvax.cs.cornell.edu" for more information.


--- If you want ISIS, but have questions, let us know ---

Send mail to isis@cs.cornell.edu, subject  "I  want  ISIS",
with electronic and physical mailing details.  We will send you a
form for acknowledging agreement with the conditions for  release
of the software and will later contact you with details on how to
actually copy the system off our machine to yours.


--- You can read more about ISIS if you like ---

The following papers and documents are  available  from  Cornell.
We don't distribute papers by e-mail.  Requests for papers should
be transmitted to "isis@cs.cornell.edu".

  1. Exploiting replication.  K. Birman and T. Joseph.  This is a
     preprint  of  a  chapter  that will appear in: Arctic 88, An
     advanced course on operating systems, Tromso,  Norway  (July
     1988).  50pp.

  2. Reliable broadcast protocols.   T.  Joseph  and  K.  Birman.
     This  is a preprint of a chapter that will appear in: Arctic
     88, An advanced course on operating systems, Tromso,  Norway
     (July 1988).  30pp.

  3. ISIS: A distributed programming  environment.  User's  guide
     and  reference  manual.   K.  Birman, T. Joseph, F. Schmuck.
     Cornell University, March 1988.  275pp.

  4. Exploiting virtual synchrony  in  distributed  systems.   K.
     Birman and T. Joseph.  Proc. 11th ACM Symposium on Operating
     Systems Principles (SOSP),  Nov.  1987.  12pp.

  5. Reliable communication in  an  unreliable  environment.   K.
     Birman and T. Joseph.  ACM Transactions on Computer Systems,
     Feb. 1987.  29pp.

  6. Low cost management of  replicated  data  in  fault-tolerant
     distributed systems.  T. Joseph and K. Birman.  ACM Transac-
     tions on Computer Systems, Feb. 1986.  15pp.

  7. Fast causal multicast.  K. Birman, A. Schiper, P. Stephenson.
     Dept. of Computer Science TR, May 1990.

  8. Distributed application management.  K. Marzullo, M. Wood, R.
     Cooper, K. Birman.  Dept. of Computer Science TR, June 1990.

We will be happy to provide reprints of these papers.  Unless  we
get  an  overwhelming  number of requests, we plan no fees except
for the manual.  We also maintain a mailing list for  individuals
who  would  like to receive publications generated by the project
on an ongoing basis.  The last two papers can be copied using FTP
from cu-arpa.cs.cornell.edu.

If you want to learn about the virtual synchrony as  an  approach
to  distributed computing, the best place to start is with refer-
ence [1].  If you want to learn more about the ISIS system,  how-
ever,  start  with the manual.  It has been written in a tutorial
style and should be easily accessible to anyone familiar with the
C programming language.  References [7] and [8] are typical of our
recent publications (there are others -- contact Maureen Robinson
for details).