[comp.os.vms] LAVC help/info request

SYSMGR@IPG.PH.KCL.AC.UK (01/04/88)

We have a 4-node LAVC connected via a DELNI, which in turn is connected to
thick wire Ethernet for comms. to other systems (not LAVC nodes). Recently,
we had to extend the thickwire. I was most surprised that the fairly brief
interruption to the thickwire caused all LAVC satellite nodes to perform a
CLUEXIT bugcheck!

I think this may be because the LAVC installation procedure reduces the
SYSGEN parameter RECNXINTERVAL to 20 seconds, so that after the Ethernet is
u/s for longer than that the cluster falls apart. What I would like to ask
any LAVC experts out there is:

1 - Is my diagnosis right?

2 - Is there any way to prevent a cluster crash for this reason? Assuming my
    diagnosis is right, setting RECNXINTERVAL to something sensible like 300
    should work - but why do DEC reduce it from the default 60 to 20 in the
    first place? Has anyone out there actually tried this fix?

3 - Why does a DELNI cause communication through itself to fail when the only
    fault is on the thickwire to which it is connected? Is there any way to
    prevent this action?

    A merry new year to all, and thanks in advance for any help offered.


Nigel Arnot (Dept. of Physics, Kings College, the Strand, London WC2R 2LS, UK)

                 Janet: SYSMGR@UK.AC.KCL.PH.IPG
                 Arpa:  SYSMGR%UK.AC.KCL.PH.IPG@UKACRL.BITNET
                 UUCP:  SYSMGR%UK.AC.STRATH.VAXA@UKC
 Bitnet/NetNorth/Earn:  SYSMGR@IPG.PH.KCL.AC.UK (OR) SYSMGR%IPG.PH.KCL@AC.UK
                 Phone: +44 1 836 6192

dp@JASPER.Palladian.COM (Jeffrey Del Papa) (01/05/88)

    3 - Why does a DELNI cause communication through itself to fail when the only
	fault is on the thickwire to which it is connected? Is there any way to
	prevent this action?

A DELNI merely provides a substitute for trancievers. In theory it is the same
as having each machine with its own tranciever on the thickwire (in fact one of
the vendors labels it equivalent box a 'tranciever fanout'). the device does not
'stage' or route packets for devices not on the delni, nor is it slower than
having seperate trancievers (two common misconceptions) It does have a mode
where it does not need a wire connected, so you could operate your cluster when
they are servicing the thickwire by throwing the little switch on the front to
separate yourself from the backbone while they work on it. otherwize the devices
provide no special 'insulation' from the thick backbone.

consider yourself lucky that your machine does something obvious and (by
comparison) harmless as crashing. When you break the ether, a symbolics machine
can destroy data on its disk. (we once lost our file server for the better part
of a day, movers knocked a badly crimped connector loose, causing the file
server to update the index file equivalent shifted 14 bits left.)

<dp>

nagy%warner.hepnet@LBL.GOV (Frank J. Nagy, VAX Wizard & Guru) (01/05/88)

Nigel Arnot (Dept. of Physics, Kings College) writes:

> We have a 4-node LAVC connected via a DELNI, which in turn is connected to
> thick wire Ethernet for comms. to other systems (not LAVC nodes). Recently,
> we had to extend the thickwire. I was most surprised that the fairly brief
> interruption to the thickwire caused all LAVC satellite nodes to perform a
> CLUEXIT bugcheck!

> 1 - Is my diagnosis right? (RECNXINTERVAL caused crash)

No quite, the system parameter which controls the polling for new cluster
boot nodes or failed cluster circuits is PAPOLLINTERVAL.  Don't be fooled
by the documentation talking about the CI; the major difference between
a CI VAXCluster and an LAVC is the PEDRIVER which provides a CI Port
Emulator for the LAVC.  So the same "CI" parameters apply in an LAVC
also.  

From the V4.4 Release Notes on RECNXINTERVAL: "This parameter specifies
the amount of time that the connection manager waits between the loss of
a connection to a remote node and the initiation of a cluster transition
to remove the failed node from the cluster."  And since in an LAVC, once
communication to the boot node has been lost the satellite node is defunct;
the satellite nodes bugcheck with CLUEXIT.

> 2 - Is there any way to prevent a cluster crash for this reason? Assuming my
>     diagnosis is right, setting RECNXINTERVAL to something sensible like 300
>     should work - but why do DEC reduce it from the default 60 to 20 in the
>     first place? Has anyone out there actually tried this fix?

See answer #3 below.  Sounds plausible and worth a try at least.  Anyone
want to experiment and report to the net?

> 3 - Why does a DELNI cause communication through itself to fail when the only
>     fault is on the thickwire to which it is connected? Is there any way to
>     prevent this action?

The DELNI is just replacing (up to) 8 transceivers and a length of EtherHose
(the thick yellow/orange cable).  It provides no electrical or protocol
buffering and (except for a time delay) acts just like a transceiver tapped
directly to the EtherHose.  Since your entire LAVC is connected to the
DELNI, you could have just (before the EtherHose was opened), flipped the
small switch on the DELNI to local operation.  In this mode, the DELNI
will ignore the tap on the EtherHose and the nodes on the DELNI could
continue to function (sans any outside connections).  When the EtherHose
is online again, you just flip the switch back to establish outside
connections.  No problems with flipping the DELNI switch with the systems
live; this is something I have done in the past (not on LAVCs, but no
reason why it shouldn't work there also).

= Frank J. Nagy   "VAX Guru & Wizard"
= Fermilab Research Division EED/Controls
= HEPNET: WARNER::NAGY (43198::NAGY) or FNAL::NAGY (43009::NAGY)
= BitNet: NAGY@FNAL
= USnail: Fermilab POB 500 MS/220 Batavia, IL 60510

SYSRUTH@utorphys.BITNET (01/07/88)

The DELNI has a switch on it (the only switch, in fact) which can be put
in either of 2 positions: 1 to talk to the coax, and the other to keep
communications inside itself (standalone). You can't mix the two. If you are
talking to other machines on the coax, your cluster's communications are
also going out onto the coax, and the DELNI retrieves all packets from
the coax as well. Hence your cluster members lost touch with each other
when you took the coax terminator off. In future, when you plan to do this,
you should flip the switch on the DELNI to standalone *before* killing the
thickwire, and then your cluster won't be affected by the work (it will
not be able to talk to anything not on the DELNI, but then it can't anyway
under those conditions). Your users will only notice this if they are
using terminal servers which are not on the DELNI, but at least the cluster
stays up, which greatly reduces the impact of the work on the system as a
whole.
     
Setting RECNXINTRVL to 300 is likely not a good idea. If one of your cluster
satellite nodes crashes, the remaining members will hang for 5 minutes
attempting to re-establish communications. This may not be a problem if
it can reboot within that time, but if it can't, it's an unnecessary wait.
We don't run an LAVC, but certainly that's how a regular cluster behaves,
and I would expect the LAVC to do something similar. The most obvious reason
I can think of for reducing it from the usual 60s to 20s is so that if someone
shuts down their desktop VAXstation, everyone else isn't sitting around
twiddling their thumbs for so long. Although, again, I'm not that familiar
with LAVC's so there might have been another, more technical, reason.
     
Personally, I'd have been far more surprised if your entire cluster had
stayed up!
     
Ruth Milner
Systems Manager
University of Toronto Physics
     
SYSRUTH@UTORPHYS.BITNET

tihor@acf4.UUCP (Stephen Tihor) (01/08/88)

The cluster is only oging to hang in so far as the dead node hold votes needed
for the cluster quorum (unlikely with a satellite in a LAVc) or holds locks 
on resources (such as files, or volumes) which other nodes need.

miw@uqcspe.OZ (Mark Williams) (01/14/88)

In article <8801041936.AA01332@ucbvax.Berkeley.EDU> SYSMGR@IPG.PH.KCL.AC.UK writes:
>We have a 4-node LAVC connected via a DELNI, which in turn is connected to
>thick wire Ethernet for comms. to other systems (not LAVC nodes). Recently,
>we had to extend the thickwire. I was most surprised that the fairly brief
>interruption to the thickwire caused all LAVC satellite nodes to perform a
>CLUEXIT bugcheck!
>                 [stuff deleted]
>3 - Why does a DELNI cause communication through itself to fail when the only
>    fault is on the thickwire to which it is connected? Is there any way to
>    prevent this action?

The communication through the DELNI fails, because it is repeating all the
junk reflections ,bogus collisions, and stuff that are happening out on
the Co-axial cable. Being on the DELNI is almost the same as being attached
to the cable directly.
	However, if you KNOW you are going to be disturbing the CO-AX, 
you can isolate the DELNI from the cable by switching the little black
switch to the right of the co-ax connector to the    [\] position.

Mark Williams
ccwilliams%wombat.decnet.uq.oz@uunet.uu.net

-- 
The views expressed above are not necessarily those of my employer. In a
couple of hours they may not even be my own.

Pound for pound, the amoeba is the most vicious creature on Earth.