SYSMGR@IPG.PH.KCL.AC.UK (01/04/88)
We have a 4-node LAVC connected via a DELNI, which in turn is connected to thick wire Ethernet for comms. to other systems (not LAVC nodes). Recently, we had to extend the thickwire. I was most surprised that the fairly brief interruption to the thickwire caused all LAVC satellite nodes to perform a CLUEXIT bugcheck! I think this may be because the LAVC installation procedure reduces the SYSGEN parameter RECNXINTERVAL to 20 seconds, so that after the Ethernet is u/s for longer than that the cluster falls apart. What I would like to ask any LAVC experts out there is: 1 - Is my diagnosis right? 2 - Is there any way to prevent a cluster crash for this reason? Assuming my diagnosis is right, setting RECNXINTERVAL to something sensible like 300 should work - but why do DEC reduce it from the default 60 to 20 in the first place? Has anyone out there actually tried this fix? 3 - Why does a DELNI cause communication through itself to fail when the only fault is on the thickwire to which it is connected? Is there any way to prevent this action? A merry new year to all, and thanks in advance for any help offered. Nigel Arnot (Dept. of Physics, Kings College, the Strand, London WC2R 2LS, UK) Janet: SYSMGR@UK.AC.KCL.PH.IPG Arpa: SYSMGR%UK.AC.KCL.PH.IPG@UKACRL.BITNET UUCP: SYSMGR%UK.AC.STRATH.VAXA@UKC Bitnet/NetNorth/Earn: SYSMGR@IPG.PH.KCL.AC.UK (OR) SYSMGR%IPG.PH.KCL@AC.UK Phone: +44 1 836 6192
dp@JASPER.Palladian.COM (Jeffrey Del Papa) (01/05/88)
3 - Why does a DELNI cause communication through itself to fail when the only fault is on the thickwire to which it is connected? Is there any way to prevent this action? A DELNI merely provides a substitute for trancievers. In theory it is the same as having each machine with its own tranciever on the thickwire (in fact one of the vendors labels it equivalent box a 'tranciever fanout'). the device does not 'stage' or route packets for devices not on the delni, nor is it slower than having seperate trancievers (two common misconceptions) It does have a mode where it does not need a wire connected, so you could operate your cluster when they are servicing the thickwire by throwing the little switch on the front to separate yourself from the backbone while they work on it. otherwize the devices provide no special 'insulation' from the thick backbone. consider yourself lucky that your machine does something obvious and (by comparison) harmless as crashing. When you break the ether, a symbolics machine can destroy data on its disk. (we once lost our file server for the better part of a day, movers knocked a badly crimped connector loose, causing the file server to update the index file equivalent shifted 14 bits left.) <dp>
nagy%warner.hepnet@LBL.GOV (Frank J. Nagy, VAX Wizard & Guru) (01/05/88)
Nigel Arnot (Dept. of Physics, Kings College) writes: > We have a 4-node LAVC connected via a DELNI, which in turn is connected to > thick wire Ethernet for comms. to other systems (not LAVC nodes). Recently, > we had to extend the thickwire. I was most surprised that the fairly brief > interruption to the thickwire caused all LAVC satellite nodes to perform a > CLUEXIT bugcheck! > 1 - Is my diagnosis right? (RECNXINTERVAL caused crash) No quite, the system parameter which controls the polling for new cluster boot nodes or failed cluster circuits is PAPOLLINTERVAL. Don't be fooled by the documentation talking about the CI; the major difference between a CI VAXCluster and an LAVC is the PEDRIVER which provides a CI Port Emulator for the LAVC. So the same "CI" parameters apply in an LAVC also. From the V4.4 Release Notes on RECNXINTERVAL: "This parameter specifies the amount of time that the connection manager waits between the loss of a connection to a remote node and the initiation of a cluster transition to remove the failed node from the cluster." And since in an LAVC, once communication to the boot node has been lost the satellite node is defunct; the satellite nodes bugcheck with CLUEXIT. > 2 - Is there any way to prevent a cluster crash for this reason? Assuming my > diagnosis is right, setting RECNXINTERVAL to something sensible like 300 > should work - but why do DEC reduce it from the default 60 to 20 in the > first place? Has anyone out there actually tried this fix? See answer #3 below. Sounds plausible and worth a try at least. Anyone want to experiment and report to the net? > 3 - Why does a DELNI cause communication through itself to fail when the only > fault is on the thickwire to which it is connected? Is there any way to > prevent this action? The DELNI is just replacing (up to) 8 transceivers and a length of EtherHose (the thick yellow/orange cable). It provides no electrical or protocol buffering and (except for a time delay) acts just like a transceiver tapped directly to the EtherHose. Since your entire LAVC is connected to the DELNI, you could have just (before the EtherHose was opened), flipped the small switch on the DELNI to local operation. In this mode, the DELNI will ignore the tap on the EtherHose and the nodes on the DELNI could continue to function (sans any outside connections). When the EtherHose is online again, you just flip the switch back to establish outside connections. No problems with flipping the DELNI switch with the systems live; this is something I have done in the past (not on LAVCs, but no reason why it shouldn't work there also). = Frank J. Nagy "VAX Guru & Wizard" = Fermilab Research Division EED/Controls = HEPNET: WARNER::NAGY (43198::NAGY) or FNAL::NAGY (43009::NAGY) = BitNet: NAGY@FNAL = USnail: Fermilab POB 500 MS/220 Batavia, IL 60510
SYSRUTH@utorphys.BITNET (01/07/88)
The DELNI has a switch on it (the only switch, in fact) which can be put in either of 2 positions: 1 to talk to the coax, and the other to keep communications inside itself (standalone). You can't mix the two. If you are talking to other machines on the coax, your cluster's communications are also going out onto the coax, and the DELNI retrieves all packets from the coax as well. Hence your cluster members lost touch with each other when you took the coax terminator off. In future, when you plan to do this, you should flip the switch on the DELNI to standalone *before* killing the thickwire, and then your cluster won't be affected by the work (it will not be able to talk to anything not on the DELNI, but then it can't anyway under those conditions). Your users will only notice this if they are using terminal servers which are not on the DELNI, but at least the cluster stays up, which greatly reduces the impact of the work on the system as a whole. Setting RECNXINTRVL to 300 is likely not a good idea. If one of your cluster satellite nodes crashes, the remaining members will hang for 5 minutes attempting to re-establish communications. This may not be a problem if it can reboot within that time, but if it can't, it's an unnecessary wait. We don't run an LAVC, but certainly that's how a regular cluster behaves, and I would expect the LAVC to do something similar. The most obvious reason I can think of for reducing it from the usual 60s to 20s is so that if someone shuts down their desktop VAXstation, everyone else isn't sitting around twiddling their thumbs for so long. Although, again, I'm not that familiar with LAVC's so there might have been another, more technical, reason. Personally, I'd have been far more surprised if your entire cluster had stayed up! Ruth Milner Systems Manager University of Toronto Physics SYSRUTH@UTORPHYS.BITNET
tihor@acf4.UUCP (Stephen Tihor) (01/08/88)
The cluster is only oging to hang in so far as the dead node hold votes needed for the cluster quorum (unlikely with a satellite in a LAVc) or holds locks on resources (such as files, or volumes) which other nodes need.
miw@uqcspe.OZ (Mark Williams) (01/14/88)
In article <8801041936.AA01332@ucbvax.Berkeley.EDU> SYSMGR@IPG.PH.KCL.AC.UK writes: >We have a 4-node LAVC connected via a DELNI, which in turn is connected to >thick wire Ethernet for comms. to other systems (not LAVC nodes). Recently, >we had to extend the thickwire. I was most surprised that the fairly brief >interruption to the thickwire caused all LAVC satellite nodes to perform a >CLUEXIT bugcheck! > [stuff deleted] >3 - Why does a DELNI cause communication through itself to fail when the only > fault is on the thickwire to which it is connected? Is there any way to > prevent this action? The communication through the DELNI fails, because it is repeating all the junk reflections ,bogus collisions, and stuff that are happening out on the Co-axial cable. Being on the DELNI is almost the same as being attached to the cable directly. However, if you KNOW you are going to be disturbing the CO-AX, you can isolate the DELNI from the cable by switching the little black switch to the right of the co-ax connector to the [\] position. Mark Williams ccwilliams%wombat.decnet.uq.oz@uunet.uu.net -- The views expressed above are not necessarily those of my employer. In a couple of hours they may not even be my own. Pound for pound, the amoeba is the most vicious creature on Earth.