[mod.computers.vax] VAX Cluster Failures: Summary of Rep

McGuire_Ed@GRINNELL.MAILNET (08/04/86)

>Date: 31 Jul 86 13:21:00 PDT
>From: "ANCHOR::PEARSON" <pearson%anchor.decnet@lll-icdc>
>Subject: VAX Cluster Failures: Summary of Replies (long).
>
>A Quorum disk must be a system disk. However, note this warning:
>
>          "With a two-node cluster, you'll want to use a
>          quorum disk.  Choosing it could be tricky, if, as
>          I recall, a quorum disk can't be shadowed [ TRUE! ]:
>          If you make either of your system disks the
>          quorum disk, when that disk goes, both the system
>          that boots from it and the disk's quorum vote
>          goes, so the remaining system hangs waiting for
>          quorum - exactly what having a quorum disk and
>          two system disks was supposed to avoid!  So you'd
>          need some third non-shadowed disk to use as the
>          quorum disk...."

This seems to contradict my experience with a two-node cluster.  If a disk goes
offline, I/O queued to the disk will wait until the disk goes online or until
mount verification is aborted.  Operations on other devices, however, are not
affected.  In particular, cluster communication does not stop.

For example, system A boots from disk DA, and system B boots from DB, and DB is
the quorum disk.  There are 3 votes--quorum is 2.  If DB goes offline, votes
are reduced to 2.  Many processes on system B are liable to stall for I/O
completion on DB, but system B does not withdraw its vote from the cluster.  If
system A is configured to never do I/O to the other node's system disk, it is
not interrupted by the disk failure.

In the case that mount verification is aborted, my information is incomplete.
I've never had a system disk in mount verification.  If B crashes, then quorum
would be 1 and A would hang.  It is up to you to be sure that mount
verification is not aborted, or to use the console method of reducing the
remaining system's quorum to 1 after the crash.  This would take several
minutes.

>In a dual-HSC50 system, each disk is accessed by one HSC50 at a time.
>The path from that disk to the other HSC50 won't be used until the
>first HSC50 fails.  This means that you generally can't be too
>confident that your backup system will work when you need it.  If you
>want to find out whether one HSC50 is working, reboot the other one. (!)

You can also use the path select buttons on each disk to test HSC failover.  On
an RA81, the A button is lit if the A path is active. Pop the A button to
prevent I/O on the A path.  The B button will light shortly, as the other HSC
takes over I/O for the disk.

McGuire_Ed%GRINNELL.MAILNET@MIT-MULTICS.ARPA