[comp.unix.ultrix] Pair of DS5400's with cross-mounted RA90's in AB mode

D. Allen [CGL]" <idallen@watcgl.waterloo.edu> (08/23/90)

Is what I'm doing here safe?

I have a pair of DS5400 machines, each with a pair of RA90's and a
pair of controllers.  They connect as expected:

    I mount cabinet 0 disk 0 port A on controller 0 of cpu0
    I mount cabinet 0 disk 1 port A on controller 1 of cpu0

    I mount cabinet 1 disk 0 port A on controller 0 of cpu1
    I mount cabinet 1 disk 1 port A on controller 1 of cpu1

That's the main stuff; each machine is connected to its to local disks. 
Now for the cross-mount in case of failure of one of the cpu's:

    I mount cabinet 0 disk 0 port B as the second disk on controller 1 of cpu1
    I mount cabinet 0 disk 1 port B as the second disk on controller 0 of cpu1

    I mount cabinet 1 disk 0 port B as the second disk on controller 1 of cpu0
    I mount cabinet 1 disk 1 port B as the second disk on controller 0 of cpu0

Thus, each disk is connected to each cpu; each controller has one
disk named "0" and one disk named "1".

If I leave all the RA90 "AB" switches both enabled, booting either 5400
alone lets it find all four disks, which are configured as ra0, ra1,
ra2, ra3.  Once one 5400 has found all four disks, it mounts the local
two in its cabinet.  If I then boot the second 5400, it only finds the
remaining two unmounted disks (which are in its cabinet), but not the
two disks mounted by the first 5400.  If I then shut down the first
5400, the second 5400 suddenly (without even a reboot) finds the two
external disks and prints a message to that effect on its console.
Booting the first 5400 again, it only finds its two local disks
(because the second 5400 has its local two mounted).

My question is: is giving each 5400 access simultaneously to a disk going
to cause problems, even though I don't actually mount any disk on more
than one CPU?  That is, should I be running with both AB switches
enabled, or should I really only enable access to one cpu at a time?

Having both AB enabled is a great convenience, since I don't have to
be physically at the machine to push buttons and switch disks over if one
cpu fails; I just mount the other disks.  But is letting both kernels
know about the disks at the same time safe?
-- 
-IAN! (Ian! D. Allen) idallen@watcgl.uwaterloo.ca idallen@watcgl.waterloo.edu
 [129.97.128.64]  Computer Graphics Lab/University of Waterloo/Ontario/Canada

alan@shodha.dec.com ( Alan's Home for Wayward Notes File.) (08/23/90)

In article <1990Aug23.021922.13346@watcgl.waterloo.edu>, idallen@watcgl.waterloo.edu (Ian! D. Allen [CGL]) writes:
} Is what I'm doing here safe?
} 
} [ A long description of how two RA90s are dual ported between
}   between two different ULTRIX systems. ]
} 
} My question is: is giving each 5400 access simultaneously to a disk going
} to cause problems, even though I don't actually mount any disk on more
} than one CPU?  That is, should I be running with both AB switches
} enabled, or should I really only enable access to one cpu at a time?

	First, a standard semi-official comment.  This is probably
	an untested and therefore unsupported configuration.  If it
	breaks or doesn't work as expected don't be surprised when
	a DEC support person say, "Sorry, not our problem".

	Now for a more useful answer.  One of the nice things about
	the RA series disk is that the A and B ports appear to be
	mutually exclusive.  If you have a disk mounted through one
	you CAN'T get at from the other.  I say "appear" because I
	can't quote chapter and verse from some specification that
	this is the way it will ALWAYS work.  It's probably a feature
	of the hardware electronics, but it may be possible for it to
	break.

	If you pay a great deal of attention to which system has a
	disk mounted and only try touching it when the other doesn't
	see it, then you'll probably be safe.  There will come a point
	when both systems will be able to see the drives, but neither
	has it mounted.  For example:

	System A crashes at some obnoxious hour of the morning and upon
	releasing the port (probably via a controller timeout) system
	B sees the drives.  Nobody is around to do the manual failover
	procedure though and in a few minutes A reboots and sees the
	drives...  Now have both systems able to access the drives.

	System A should via it's reboot fsck the file systems (assuming 
	you have file systems on them) and will remount them when it 
	finishes coming up.  There is though a period of time when both 
	systems have equal access to the drives without either having 
	them mounted.  If the procedure is manual and the operators knows 
	what to expect when part of it fails, then you might not have
	too much of a problem.

	I haven't had the opportunity to spend a lot of time which
	such a configuration so I don't know all the problems that
	might occur.  My personal inclination is that without a lot
	of testing (at each new release of ULTRIX) I'd ensure access
	from only one port at a time, by using the port buttons.
} 
} Having both AB enabled is a great convenience, since I don't have to
} be physically at the machine to push buttons and switch disks over if one
} cpu fails; I just mount the other disks.  But is letting both kernels
} know about the disks at the same time safe?

	"How safe" is the real question?  Having two systems that can get
	to a disk doubles the chance that something can go wrong.  What
	if the hardware that makes A and B exclusive breaks at the same
	time one of the controllers also breaks and starts writting random
	bits?  Not very likely, but it could happen.  Actually you don't
	even have to have the disk break.  If both controllers break at
	about the same, one allows the disk to go offline letting the other
	have access to it you get the same result.

	Have I been negitive enough?  I suspect it's probably safe
	enough, compared to other supported configurations.  I have
	two systems with access to a common HSC.  There is very little
	to prevent me from accidently mounting a file system on a disk
	while the other already has it mounted.  V4.0 has hooks in radisk(8)
	for making this safer.

} -IAN! (Ian! D. Allen) idallen@watcgl.uwaterloo.ca idallen@watcgl.waterloo.edu
}  [129.97.128.64]  Computer Graphics Lab/University of Waterloo/Ontario/Canada


-- 
Alan Rollow				alan@nabeth.enet.dec.com

D. Allen [CGL]) (08/26/90)

>	finishes coming up.  There is though a period of time when both 
>	systems have equal access to the drives without either having 
>	them mounted.
[...]
>	"How safe" is the real question?  Having two systems that can get
>	to a disk doubles the chance that something can go wrong.  What
>	if the hardware that makes A and B exclusive breaks at the same
>	time one of the controllers also breaks and starts writting random
>	bits?  Not very likely, but it could happen.  Actually you don't
>	even have to have the disk break.  If both controllers break at
>	about the same, one allows the disk to go offline letting the other
>	have access to it you get the same result.

Yes, one 5400 usually ends up knowing about its own two disks and the
two on the second machine (which it sees when the second machine goes
down or reboots).  The second 5400 only knows about its own two disks
(because the first has its own disks mounted, preventing the second
machine from even finding them).

Since even when a kernel knows about all four disks, it only mounts
its local two, I think I'm pretty safe from having the same disk mounted
on two machines simultaneously by mistake.

I've had some mysterious "hang" situations with the machines.  I'm
running them now with only the A ports selected, to see if the hangs
recur.  I have the nasty feeling that having two kernels recognize the
same disk is causing problems, even if only one system actually mounts
the disk.  Perhaps there are things that kernels do even to unmounted
disks that would interfere with those disks while they are being mounted
and used by another kernel.  I can imagine that when a kernel goes to
find out if a disk is there, it might do something that would interfere
with the concurrent use of that disk by another kernel.  Or, a disk
might generate some message or interrupt to the kernel that would end
up being fielded by *both* kernels, and funny things might happen.

Oh well.  It would have been so convenient.
I'll try A+B mode again in a few days.
-- 
-IAN! (Ian! D. Allen) idallen@watcgl.uwaterloo.ca idallen@watcgl.waterloo.edu
 [129.97.128.64]  Computer Graphics Lab/University of Waterloo/Ontario/Canada