[mod.computers.vax] VAX Cluster Failures: Summary of Replies

pearson%anchor.DECnet@lll-icdc ("ANCHOR::PEARSON") (07/31/86)

Profuse thanks to the numerous VAX veterans who responded to my request
for information on the stability of small VAX clusters. Each response
has been studied and greatly appreciated.

Here is a digest of the results. Quoted passages are direct excerpts
from responses from the net. Unquoted passages are my interpretations
of various inputs.


Reservations about the particular configuration I described:
------------------------------------------------------------

	"You have specified only a single HSC.  Should it
	fail, the whole thing drops dead!  Two HSCs would
	preclude this possibility."

My mistake.  We were, in fact, planning on two HSC50s.

	"A really stable cluster requires 2-HSCs and 3
	CPUs.  This is a very stable cofiguration and gets
	more stable as it gets bigger.  Our cluster is now
	6 nodes and 2 HSCs."


Concerns about "quorum":
------------------------

	"You have only two nodes on the cluster.  Should
	one fail you will lose quorum and the functioning
	node would need to be rebooted to restore
	operation.  The stated fix for this is to
	establish a "QUORUM DISK".  This serves as a
	deadlock breaker.  However, this still takes time.
	If one system fails, the second will lock for
	several seconds while the cluster reconfigures."

A valid suggestion. One possibly-significant cost is the fact that a
quorum disk adds 60 seconds to the "transition time" when the cluster
loses a node.

Quorum disks cannot be shadowed disks.

A Quorum disk must be a system disk. However, note this warning:

	"With a two-node cluster, you'll want to use a
	quorum disk.  Choosing it could be tricky, if, as
	I recall, a quorum disk can't be shadowed [ TRUE! ]:
	If you make either of your system disks the
	quorum disk, when that disk goes, both the system
	that boots from it and the disk's quorum vote
	goes, so the remaining system hangs waiting for
	quorum - exactly what having a quorum disk and
	two system disks was supposed to avoid!  So you'd
	need some third non-shadowed disk to use as the
	quorum disk...."


Things that cause cluster crashes:
---------------------------------

Power glitches during storms.

Power outages.

Software installations and VMS upgrades.

One machine's transmitter goes crazy and blasts continuously on the
CI.

A problem with a particular disk crashed every machine that tried to
access that disk. While this did not directly crash the cluster, it's
just a matter of time before the survivors are no longer a quorum.

"[Y]our DEC-Man just powers-down the HSC without any advance
notice...(It happened to us...)."


Things that force scheduled outages:
-----------------------------------

One informant reports that disk quotas "seem to get out of wack",
requiring some ritual to get them corrected again. Another says
that if you have a lot of crashes, disk space will get lost (i.e. not
in files and not in space-available maps), requiring a cluster outage
to recover it.

One cluster has to be rebooted about once a month because a tape gets
allocated to a non-existant process.

About transition times:
----------------------

For a cluster of 2-3 nodes (without quorum disks), a transition time
of 20 seconds should be achievable with appropriate parameter
settings. A quorum disk adds 60 seconds to the transition time.

During a cluster transition, VAXes are not dead: they are merely
locked at IPL 7 or 8. Interrupts are still being serviced, and
keyboard entries are still being buffered.

HSC50 idiosyncrasies:
--------------------

Rebooting an HSC50 takes something like 3 to 5 minutes, but doesn't
produce a service outage if there's another HSC50.

	"Our HSC50 has had to be rebooted about once every
	two weeks or so.  Sometimes it reboots itself.
	This doesn't take the cluster down, but all
	programs trying to do disk i/o pause for about 5
	minutes while the HSC50 reboots."

In a dual-HSC50 system, each disk is accessed by one HSC50 at a time.
The path from that disk to the other HSC50 won't be used until the
first HSC50 fails.  This means that you generally can't be too
confident that your backup system will work when you need it.  If you
want to find out whether one HSC50 is working, reboot the other one. (!)

HSC50s have consoles that can run out of paper, and this is no
trivial problem.  They also have switches that should be left in
"Secure" (versus "Enable").  If the HSC50 runs out of paper while the
switch is in "Enable", it will go down and may not come up again.


Testimonials:
------------


Two years on VMS 4.x, only one full crash (power glitches during storm).

18 months, dual 750 cluster 5 RA81s: No major problems. From time to
time, one CPU bug-checks and then recovers.

18 months, 3 785s: except for power outages, always up.

Two-node, single-HSC; also, 8 nodes, multiple HSCs: "crashes that take
down the whole thing are very, very rare - months apart, at least.
There is one exception to this: Just after a VMS upgrade, it's possible
to screw things up so that the system becomes less stable than it
normally is."

Dual-8600 single-HSC50 system crashes due to disk problems once every
couple months.

"One of our two CPUs has been crashing about once every two weeks. The
other system just pauses briefly when the first goes down."

"Most of our problems are with disks crashs."

"In our environment the reliability is quite high.  I do not have
a number for how many crashes we have had, but -- since 4.2 -- we
have had very few crashes.  The crashes we have had were on single
nodes, none for the entire cluster.  And all but one of our crashes
were caused by a locally written (still in development) driver or
by thunderstorm activity.  The Engineering cluster has also been
very reliable.  I do not believe they have had any cluster-wide
crashes either."

"When the first system comes back up again, it has to 'rebuild' the disk
caches of all the mounted volumes, which locks out the other system
from the volume set being rebuilt."


General consensus:
-----------------

The tenor of all the responses I received was that clustering of more
than two VAXes substantially enhances availability. Since VMS 4.2,
crashes of individual nodes have been tolerably common, while crashes
of whole clusters have been months apart or completely unknown.

As for the original notion of clustering 2 VAXes: After much additional
exploration and discussion, I've reached the conclusion that clustering
just 2 VAXes is not a good way to increase availability.

  -  Peter
pearson%anchor.decnet@lll-icdc.arpa
------