[comp.sys.apollo] Updating Large Rings

collins@nvpna1.UUCP (Donie Collins 44091) (09/28/87)

At our site we have roughly 120 Apollo stations, which are split into
4 rings connected by EtherBridges.  Most of the machines are used for IC-design
and the rest are used to develop CAD software.

The Operating System, local software and 3rd party software is controlled
by our system managers. So when we plan to go to a new release of the OS
(as we are doing now, for sr9.6) the upgrade is done by the system managers
on all machines. Because of our environment we must keep all production 
machines on the same level of the OS. ie At the moment all machines are at
9.2.x

We estimate that there will be at least a full weeks disruption while
we upgrade all the machines. I think that this is unacceptable. Given
that Apollo have at least 1 major and 2 minor releases per year, we are talking
about 3 weeks per year. God only knows what it will be like for sr10.

I am interested in the reactions/experiences of other System Managers on 
this.  120 node is NOT a big site, especially by U.S standards, so how
do System Managers of large rings handle this problem ?

-- 
Donal O'Coileain.    Go n-eirigh an bothar leat
..mcvax!prle!nvpna1!collins

krowitz@mit-richter.UUCP (David Krowitz) (09/29/87)

I have a much smaller ring, but I use a technique which may be useful
when doing large installations. I read the release tapes of all of
base and optional software onto one of our DSP's and get it set up
the way we like it. From that DSP I then install the software onto another
DSP (I now have two nodes with complete sets of software). I then pop up a
second window and do installations to two more nodes simultaneously. I then
pop up two more windows and do four installations at once. At this point, I've
run out of space for comfortably sized windows on my screen so I grab another
node and pop up four windows on it. Using two nodes I now do eight installations
at once, and I'm finished. I've done 16 nodes in 4 steps (5 if you count reading
the original tapes) each of which take somewhat less than 2 hours. I could
finish 128 nodes in 7 steps (8 counting the tape reading) -- less than 16 hours
for the complete installation. The final installation (64 nodes installing to
64 additional nodes) requires four or 5 people each controlling about 4 nodes
with about four windows on each node. I can't keep track of more than about
a dozen installations at once by myself, but the installation scripts are 
reasonably automatic if you get the first couple of installations put
together correctly.


 -- David Krowitz

mit-erl!mit-kermit!krowitz@eddie.mit.edu
mit-erl!mit-kermit!krowitz@mit-eddie.arpa
krowitz@mit-mc.arpa
(in order of decreasing preference)

dennis@PEANUTS.NOSC.MIL.UUCP (09/30/87)

[referring to Dave Krowitz's binary installation scheme...]

Clever, Dave.  But how do you answer all those questions endlessly
asked by the installation programs--in a script made up ahead of time?
If so, then every installation needs a different script, right?  This
also assumes that every node gets the same software (which I've just
about decided is the only way to maintain sanity).

Still, it seems to me what you would like to have is a master network
configuration description.  Then an installation for Pascal, for
instance, would look at the network configuration to find which nodes
get it (and which only get links) and would go ahead and install it.
The installation program could use your binary installation scheme to
speed things up if the configuration allowed for it.

	Dennis Cottel  Naval Ocean Systems Center, San Diego, CA  92152
	(619) 225-2406      dennis@nosc.MIL      sdcsvax!noscvax!dennis

lnz@rainbow-warrior.UUCP (Leonard Zubkoff) (09/30/87)

I can't say that I handle a large network, but I must admit that software
installation can be a significant undertaking even in a 10-15 node ring.  Since
our environment is mostly homogeneous, and we currently use most local disks
only for paging, I solved this problem as follows:

I prepare a prototype node, installing all software in a normal way, and
verifying that all our necessary supporting software operates properly on the
new release.  When I am ready to propagate the new release throughout the
network, I create one or more master tapes by dumping the prototype node to
cartridge tape.  I then restore the master tapes to each node in turn, either
from a local tape drive or via the network.  Minor fixups are required to
compensate for cloning the node under a new name, but I've been using this type
of procedure for almost a year without difficulties.  This typically lets me
completely replace the software on a node in about 45 minutes or so.

Another approach which I have yet to try is to use COPY_TREE (CPT) to copy the
prototype node over the network.  I suspect that for nodes without local tape
drives, this would be faster than CRPing an RBAK from tape as I do now.

If the local disk's are not to be INVOLed each time because they contain user
software beyond what cloning installs, I expect I would follow the same
procedure except for INVOLing the disk and would then merge the new system
software into the old disk, replacing only the system directories.

What do other people do to install software?  Do most people just run the
APOLLO install scripts for every node?

		Leonard Zubkoff
		LUCID, Incorporated

krowitz@mit-richter.UUCP.UUCP (09/30/87)

Well, I do admit that I still have to sit in front of the node
for several hours, but since I'm generally doing about 4 installations
at a time (working from 4 windows on my node that have been CRP'd
onto the nodes which already contain the software) I don't get
quite so bored -- I have all of those questions to answer. While
one machine is doing, say, a Pascal installation I can be answering
the questions for another machine doing a Fortran installation.
It's still a pain in the rear -- but a shorter lasting pain than
doing them installations one by one.


 -- David Krowitz

mit-erl!mit-kermit!krowitz@eddie.mit.edu
mit-erl!mit-kermit!krowitz@mit-eddie.arpa
krowitz@mit-mc.arpa
(in order of decreasing preference)

collins@nvpna1.UUCP (Donie Collins 44091) (10/01/87)

I have received a few helpful messages about my first posting. The
general idea is to build a master copy and start multiple installations
from there.  This is ok for a small net, but how many simultaneous
installations can one hope to monitor without making mistakes ? Then
again how many simultaneous installations can the net handle before
performance drops ? No-one has mentioned sr10, when we have to invol
ALL the disks.

I brought this point up at a recent meeting with
some people from various Apollo offices in Europe (England, France and The
Netherlands) and they finally agreed that this was a weak point in their
setup.

Something else that annoys me is the fact that it is impossible to keep
a homogenous network. Each time Apollo releases a new piece of hardware
we get a new version of Aegis to go with it.

Our net has grown from 4 to 120 nodes in the last 16 months and in that
time we have received 9.2.3, 9.2.5+, 9.2.6, 9.2.7, 9.5.1, 9.6. Now
I hear we need 9.6.1 for the DN4000 and that 9.7 will be released in
Europe sometime around January '88. How do we keep up ??
As I say I've tried to talk to Apollo about this, but all I get is
another presentation on NCS !!!
-- 
Donal O'Coileain.    Go n-eirigh an bothar leat

"I believe in the bells of the Christchurch - Ringing out for this land
 I believe in the powers that be - But they won't overpower me" A Celebration

bares@apollo.uucp (Vittorio Bares) (10/01/87)

[referring to Leonard Zubkoff's installation scheme...]

This installation scheme is fine but only under certain controlled
conditions. For instance, presently, the only way to preserve a subsystem
manager stamp on an object from magnetic tape using RBAK is to include the
-SACL option for each restore. If you are restoring a network which is 
'open' (i.e. does not use registries or does not protect its files using
acl's.) then this would not be a problem. Otherwise, the node being restored
would inherit all the acl's presently on the tape. If your network is one in
which each developer or group has their own protections for their disks then
you would surely receive complaints from people having to re-acl their nodes
each time a restore was done. With CPT you are safe in this respect, because
CPT carries subsystem stamps over to the new destination automatically by
default.

Secondly, what happens if there are links on the node you are restoring that 
point to another volume ? RBAK and CPT will both copy through the link,
possibly causing another node to now contain inconsistent software, which
could result in crashing that node.

Thirdly, what happens to template files which have been customized by the node
owner ? Do you have a scheme that assists the user in protecting their
customized files from getting overwritten by the standard template files ?

As stated earlier, the scheme outlined by Leonard Zubkoff, is undoubtedly 
quicker than the standard Apollo scripts but, there is no such thing as a
free lunch :-), this scheme may not be applicable to all Apollo networks.

Cheers,

    Vittorio Bares
    bares@apollo.UUCP

bob@sdcsvax.UCSD.EDU (Robert Hofkin) (10/05/87)

The 9.2.x -> 9.5.1 upgrade was the first one we did.  Our system
manager built "one true copy" of the installed system, then he hacked
up the install scripts so they ran more or less unattended.  We cut
the network traffic by installing a few selected nodes on the first
pass, then partitioning the network into one ring per "selected" node.
(It seemed that 10 installs on one ring was about right.)  We managed
to upgrade 140 or so nodes in a weekend.

By the way, our system manager gave a talk on this very subject at
ADUS.  If your boss didn't let you go, tell him/her how stupidly
shortsighted he/she/it is!

lnz@edsel.UUCP (Leonard Zubkoff) (10/13/87)

   Date: 1 Oct 87 18:01:00 GMT
   From: navajo!apollo!bares@eddie.mit.edu  (Vittorio Bares)
   Organization: Apollo Computer, Chelmsford, Mass.
   Sender: navajo!apollo-request@YALE.ARPA

   [referring to Leonard Zubkoff's installation scheme...]

   This installation scheme is fine but only under certain controlled
   conditions. For instance, presently, the only way to preserve a subsystem
   manager stamp on an object from magnetic tape using RBAK is to include the
   -SACL option for each restore. If you are restoring a network which is 
   'open' (i.e. does not use registries or does not protect its files using
   acl's.) then this would not be a problem. Otherwise, the node being restored
   would inherit all the acl's presently on the tape. If your network is one in
   which each developer or group has their own protections for their disks then
   you would surely receive complaints from people having to re-acl their nodes
   each time a restore was done. With CPT you are safe in this respect, because
   CPT carries subsystem stamps over to the new destination automatically by
   default.

Exactly.  I decided that the reduced maintenance overhead in our environment
would be worth the trouble of enforcing uniformity.  While individual groups
keep their own directories protected as they please, I install the system
software with secure system acls.  We let anyone know a root password who needs
it, but I strongly encourage them not to mung the system software.  The RBAK
command indeed does use -sacl and -pdt.  I left most of the exact details of
the procedure out, preferring instead to explain the basic idea.  I'll happily
mail out an example if anyone requests it.

A more obnoxious problem was that the local registry must be recreated.

   Secondly, what happens if there are links on the node you are restoring that 
   point to another volume ? RBAK and CPT will both copy through the link,
   possibly causing another node to now contain inconsistent software, which
   could result in crashing that node.

I normally do this particular restore procedure to virgin disks, as a way of
loading up the system software cheaply.  In the past, most of our machines had
only small disks and so only system software and paging space came from the
local disk; with larger disks that people use for user software, I haven't yet
decided the best procedure.

   Thirdly, what happens to template files which have been customized by the node
   owner ? Do you have a scheme that assists the user in protecting their
   customized files from getting overwritten by the standard template files ?

Actually, we do not allow node owners to customize the startup scripts.
Rather, for anything except for starting up system servers, the login startup
scripts expect the user's login script to create the initial shell.  The system
software is uniform; the user side is not.  The few node specific files in
`node_data (thishost, networks, siomonit_file (if present) and startup_sio.sh
(if present)) are saved away and restored at the end of the procedure.

   As stated earlier, the scheme outlined by Leonard Zubkoff, is undoubtedly 
   quicker than the standard Apollo scripts but, there is no such thing as a
   free lunch :-), this scheme may not be applicable to all Apollo networks.

I concur.  It took several tries to make this whole process work, but it does
have advantages when it is applicable.  For example, the same technology allows
me to dump the system software from a standalone machine so that I can restore
it without the normal installation, and without needing an apollo bootable
cartridge tape (I make my own as part of the dump).  I used this when I knew I
would be having a disk replaced on a standalone nodem and again at the AAAI
show to clone a copy of an environment onto a node we would be demonstrating
on.

   Cheers,

       Vittorio Bares
       bares@apollo.UUCP