pearson%anchor.DECnet@lll-icdc ("ANCHOR::PEARSON") (07/31/86)
Profuse thanks to the numerous VAX veterans who responded to my request for information on the stability of small VAX clusters. Each response has been studied and greatly appreciated. Here is a digest of the results. Quoted passages are direct excerpts from responses from the net. Unquoted passages are my interpretations of various inputs. Reservations about the particular configuration I described: ------------------------------------------------------------ "You have specified only a single HSC. Should it fail, the whole thing drops dead! Two HSCs would preclude this possibility." My mistake. We were, in fact, planning on two HSC50s. "A really stable cluster requires 2-HSCs and 3 CPUs. This is a very stable cofiguration and gets more stable as it gets bigger. Our cluster is now 6 nodes and 2 HSCs." Concerns about "quorum": ------------------------ "You have only two nodes on the cluster. Should one fail you will lose quorum and the functioning node would need to be rebooted to restore operation. The stated fix for this is to establish a "QUORUM DISK". This serves as a deadlock breaker. However, this still takes time. If one system fails, the second will lock for several seconds while the cluster reconfigures." A valid suggestion. One possibly-significant cost is the fact that a quorum disk adds 60 seconds to the "transition time" when the cluster loses a node. Quorum disks cannot be shadowed disks. A Quorum disk must be a system disk. However, note this warning: "With a two-node cluster, you'll want to use a quorum disk. Choosing it could be tricky, if, as I recall, a quorum disk can't be shadowed [ TRUE! ]: If you make either of your system disks the quorum disk, when that disk goes, both the system that boots from it and the disk's quorum vote goes, so the remaining system hangs waiting for quorum - exactly what having a quorum disk and two system disks was supposed to avoid! So you'd need some third non-shadowed disk to use as the quorum disk...." Things that cause cluster crashes: --------------------------------- Power glitches during storms. Power outages. Software installations and VMS upgrades. One machine's transmitter goes crazy and blasts continuously on the CI. A problem with a particular disk crashed every machine that tried to access that disk. While this did not directly crash the cluster, it's just a matter of time before the survivors are no longer a quorum. "[Y]our DEC-Man just powers-down the HSC without any advance notice...(It happened to us...)." Things that force scheduled outages: ----------------------------------- One informant reports that disk quotas "seem to get out of wack", requiring some ritual to get them corrected again. Another says that if you have a lot of crashes, disk space will get lost (i.e. not in files and not in space-available maps), requiring a cluster outage to recover it. One cluster has to be rebooted about once a month because a tape gets allocated to a non-existant process. About transition times: ---------------------- For a cluster of 2-3 nodes (without quorum disks), a transition time of 20 seconds should be achievable with appropriate parameter settings. A quorum disk adds 60 seconds to the transition time. During a cluster transition, VAXes are not dead: they are merely locked at IPL 7 or 8. Interrupts are still being serviced, and keyboard entries are still being buffered. HSC50 idiosyncrasies: -------------------- Rebooting an HSC50 takes something like 3 to 5 minutes, but doesn't produce a service outage if there's another HSC50. "Our HSC50 has had to be rebooted about once every two weeks or so. Sometimes it reboots itself. This doesn't take the cluster down, but all programs trying to do disk i/o pause for about 5 minutes while the HSC50 reboots." In a dual-HSC50 system, each disk is accessed by one HSC50 at a time. The path from that disk to the other HSC50 won't be used until the first HSC50 fails. This means that you generally can't be too confident that your backup system will work when you need it. If you want to find out whether one HSC50 is working, reboot the other one. (!) HSC50s have consoles that can run out of paper, and this is no trivial problem. They also have switches that should be left in "Secure" (versus "Enable"). If the HSC50 runs out of paper while the switch is in "Enable", it will go down and may not come up again. Testimonials: ------------ Two years on VMS 4.x, only one full crash (power glitches during storm). 18 months, dual 750 cluster 5 RA81s: No major problems. From time to time, one CPU bug-checks and then recovers. 18 months, 3 785s: except for power outages, always up. Two-node, single-HSC; also, 8 nodes, multiple HSCs: "crashes that take down the whole thing are very, very rare - months apart, at least. There is one exception to this: Just after a VMS upgrade, it's possible to screw things up so that the system becomes less stable than it normally is." Dual-8600 single-HSC50 system crashes due to disk problems once every couple months. "One of our two CPUs has been crashing about once every two weeks. The other system just pauses briefly when the first goes down." "Most of our problems are with disks crashs." "In our environment the reliability is quite high. I do not have a number for how many crashes we have had, but -- since 4.2 -- we have had very few crashes. The crashes we have had were on single nodes, none for the entire cluster. And all but one of our crashes were caused by a locally written (still in development) driver or by thunderstorm activity. The Engineering cluster has also been very reliable. I do not believe they have had any cluster-wide crashes either." "When the first system comes back up again, it has to 'rebuild' the disk caches of all the mounted volumes, which locks out the other system from the volume set being rebuilt." General consensus: ----------------- The tenor of all the responses I received was that clustering of more than two VAXes substantially enhances availability. Since VMS 4.2, crashes of individual nodes have been tolerably common, while crashes of whole clusters have been months apart or completely unknown. As for the original notion of clustering 2 VAXes: After much additional exploration and discussion, I've reached the conclusion that clustering just 2 VAXes is not a good way to increase availability. - Peter pearson%anchor.decnet@lll-icdc.arpa ------