[mod.computers.vax] War Stories

curley@wharton-10.UUCP.UUCP (03/05/87)
War Stories from TSO Financial.

First the introduction.  My name is Paul D. Clayton and I work for TSO
Financial Corp. in Horsham, Pennsylvania.  On the urging of Bob Curley,
whose account is being used as point of entry to INFO-VAX, I  have
written this account of 'lessons learned the hard way' in the last
three weeks for general information and comment.  

Our computers:  one 8700, one 8500, one 8200, three 785's and one 750.
Disk drives:  one HSC50, one HSC70, eighteen RA-81's, four RA-60's  and
some SI gear.
All but the 8200 and 750 are connected  to a star coupler and
attempting to perform various functions, either  separately or in
tandem.

I have discovered the following in the past three weeks: 

1. Late last year we ordered a memory upgrade option for the 8500 to
increase the usable memory to 44MB, from the shipped  size of 20MB.
Clearly a step in the right direction in light of the size of VMS and
software/users we are running. The memory upgrade package consists of
two 16MB memory arrays and a new revision of the memory backplane.
We have the second 'production' unit off the  8500 assembly line, and
the result was many installation problems.  After three  attempts,
spread over two months (the last of which included tearing the  machine
apart and then finding a required element was missing), the upgrade was
completed. 
	An interesting note here is that while throughout the
documentation it highlights the various revision levels of the diags
and MCL card, NO WHERE  does it discuss the required revision level
of the PRO 380 console software.  It turns out that the memory upgrade
REQUIRES version 4.0 of the PRO 380 console.  Another note is that the
Version 4 software is  distributed to your field service office not
you, and consists of 17 floppies that take 3+ hours to install.  The
obvious question here is why on earth is it not distributed on RD disk
media so that a swap of disk drives  taking 10 minutes is required?  
	 Another point is that having more than one 16MB memory card
on a 8500 CPU with the 'right' amount of user load  will cause the
system to belly up. The current fixes coming from DEC is an  updated
microcode version for the 8500 CPU, and two (or more) 16MB memory
arrays that have been 'tuned' to one another.

Moral of this section: BEWARE the box that contains parts and
documentation marked 'PRELIMINARY'. In  others words, wait for version
2 of everything.

2. During the course of the last two episodes of attempted installation
of the memory upgrade, the machine went through a truly gut wrenching
series of tear downs and rebuilds.  For the first three days after the
final rebuild,  every morning between 8:45 and 9:30 the CPU would
display the  message dear to every system managers heart, 'CPU DBL ERR
HALT'. After three  days of constant surveillance by us and DEC CE's, I
had a brain storm.  Attempting, and failing, to keep my cool I strutted
down the hall on the second floor just outside the computer room with
ALL the weight my 175 pounds could muster.  Lo and behold the console
read 'CPU DBL  ERR HALT'.  After booting the machine back to life, I
grabbed a DEC CE  and together we strode forcefully down the hall in an
attempt to prove my idea.  To our glee, upon entering the computer room
there waiting for us was the infamous 'CPU DBL ERR HALT'.  When asked
to explain how I came to the  proper conclusion, I stated how in the
past three days, the data communications  manager, who weighs over 210
pounds, and the voice communications manager whose due date was any
day, came in around those times for the past three days. The  result
was three boards that were found to be vibration sensitive. Most of
these boards were ones that had been removed by DEC to clean the edge
connectors since the ZIF connectors are used. A couple of notes here,
first we  found a 'QA' sticker covering part of one pad that a ZIF
connector would touch,  and second is the large use of 'cooling towers'
on top of the various chips in  order to keep them cool.  By my
reasoning, these cooling towers can cause  excessive vibrational loads
on the chips causing premature failures.

Moral of this section: BEWARE of oversized people and pregnant women
who waddle around a computer  room if the computer room is not on a
slab on the ground.

3. Concurrently, during the three days of ups and downs, we were
pushing our local office for a replacement CPU since ours was clearly
not ready for long term workloads. The mean time between failures was
two hours, for three  days.  After many more calls then should have
been needed, including some by  us to Maynard, a startling revelation
came to light.  After a  world-wide search for a replacement 8500 CPU,
the answer came back that  there was NONE to be had.  We order
everything with DEC's 'Recover-All'  option, and thoughts of pushing
the damn CPU down a stairwell where starting  to enter our dreams.  The
result was that we had to wait till the next  'build' cycle was
completed in DEC for a replacement CPU to be shipped to  us. The
'build/test/burn-in' cycle for the 8500 is four (4) weeks. The only
thing that saved us, was that DEC was in the fourth week of the build
cycle  and we need only wait for the end of the week to get a
replacement. It became  VERY clear that we came close to being in a ton
of the brown stuff if our  failure had happened three weeks earlier. It
has since been indicated to us  that having 'Recover-All' does NOT mean
instant replacement of parts that  are longer functioning for what ever
reason. You have to wait your turn and  NO guarantees are given on the
speed of delivery.

Moral of this section: BEWARE the DEC sales person that spouts
RECOVER-ALL is gods answer to  disaster recovery.

4. At the same time as this, I also upgraded to VAX/VMS 4.5 to take
advantage of disk shadowing through our HSC's.  During the course of
upgrading ALL our systems to 4.5, I encountered a problem with the 8200
system disk.  VMSINSTAL kept spitting out messages that images where
not at  the expected revision levels.  Researching the problem, I
discovered that the system programmer had built the system disk from a
cluster common disk  using VMSKITBLD and answered the questions
according to the instructions.  The resulting system disk is, as it
turns out,  not upgradeable but is bootable. You do not know you have a
problem till its way past the point of easy fixing.  In talking to DEC
in Colorado about this,  it comes to light that the 'correct' answer to
the question about what to  use for the input disk to VMSKITBLD should
be 'SYS$SYSROOT' NOT  'SYS$SYSDEVICE'.  The net result is having to do
a complete rebuild of the  disk from scratch and then attempting to
update the various parts, such as the SYSUAF file, to get it current
again.

Moral of this section: BEWARE VMSKITBLD when playing with fire such as
cluster common system disks.

5. The whole point of going to version 4.5 of VMS was for shadowing on
the  HSC controllers.  Our local contract DEC guru, Mr. Al Piccolo
has found a number of interesting items concerning shadowing.  The
first is that version 1.0, the current version,  provides a MAXIMUM of
4 shadow sets PER HSC.  Now, we have more then 4 disks that we want to
shadow so the quick thought is that, NO PROBLEM we will have  4 shadows
on one HSC and the rest on the other HSC.  I remind you that the HSC is
a PDP-11 based machine and is subject to little problems  like hardware
failures and software crashes.  That is why you buy two of the  little
creatures, at $50,000+ each, from your local DEC sales person, along
with duplicate requester cards at $9,000+ each.  This way the disks can
'fail-over' to the other HSC when one goes down.  Now the interesting
part,  what happens when one HSC has 4 shadow sets, the other has 3
shadow sets and  one of the HSC's decides that it is time to go to
lunch for awhile. The ONLY  answer Mr. Piccolo could come up with from
DEC internal is that any shadow  sets over  the maximum per HSC would
go to never-never land, and you would be  hard put to find them. SO
MUCH FOR REDUNDANCY.  They also suggested that we wait  for Version
2.0, appearing soon, when the maximum goes to 7 shadow sets per  HSC.
Big whoop, the maximum of 7 per HSC in my reality means that I can ONLY
have a TOTAL of 7 PER CLUSTER, since I have no control over which
shadow set fails over to which HSC in the event of failure.  The other
item is to FORGET doing automatic reboots when you use shadow sets
since you can not control  which HSC is going to be used when you issue
the MOUNT command.  The result  here is that one disk may be on HSC A
while the other is on HSC B.  An unsupported condition for shadowing at
the present time, although this was  hinted at for the future.

Moral of this section: BEWARE the incompatibilities of SHADOWING for
speed/integrity enhancement,  and FAIL-OVER for redundancy.

6. There was a little tidbit of information that was passed along by
Mr.  Piccolo concerning quorum disks as they are used in VAX/VMS in
support of  clusters.  We are currently using the standard 'VOTES' in
addition to having  a quorum disk.  It turns out that having a quorum
disk enables each CPU in the cluster to place its 'MARK' in the quorum
file so that other members  know who is out there and thereby avoids
the case of cluster partitions.  In  our particular cluster
configuration, some members mount certain disks while  other members
mount different disks.  In the interest of security, never the  two
'types' of cluster members will mount the same disks.  The question
then  is where to put the quorum file.  It has ALWAYS been the
STRONGEST advice  from DEC that you do not touch a disk unless it is
mounted, so as not to  corrupt it.  But I want NO disks to be mounted
by the two member types.  HAVE  NO FEAR, DEC IS HERE.  VMS is 'smart'
enough to interpret the ODS-2 structure  of a disk even if said disk is
not mounted by the system.  This to me is  truly scary stuff and not to
be attempted by the weak of heart.  The sad part  of this little foray
is that while VMS has the where-with-all to understand  unmounted disk
structures, it does not place the data into the file in a  format that
RMS can understand.  Hence the process abort when a hapless user
issues a TYPE command of the cluster quorum file.  The question here is
that why not have VMS take it one step further, and not have us mount
ANY disks  at all.  Let's eliminate the overhead and drudgery of
maintaining the command  files to mount disks in the first place?

Moral for this section: BEWARE the system that starts to out think
itself in the interest of better  abilities.

Enough for now.  If you would like to discuss any points made above, I
can  be contacted at (215) 657-4000 Ext. 5187. 

Paul D. Clayton
TSO Financial Corporation
Horsham, Pennsylvania


..........................................................................
. Bob Curley                                                             .
. Department of Radiation Therapy                                        .
. University of Pennsylvania                                             .
. P.O. Box 7806                                     (voice) 215-662-3083 .
. Philadelphia, PA 19101                          (ARPA ) CURLEY@WHARTON .
..........................................................................
------