curley@wharton-10.UUCP.UUCP (03/05/87)
War Stories from TSO Financial. First the introduction. My name is Paul D. Clayton and I work for TSO Financial Corp. in Horsham, Pennsylvania. On the urging of Bob Curley, whose account is being used as point of entry to INFO-VAX, I have written this account of 'lessons learned the hard way' in the last three weeks for general information and comment. Our computers: one 8700, one 8500, one 8200, three 785's and one 750. Disk drives: one HSC50, one HSC70, eighteen RA-81's, four RA-60's and some SI gear. All but the 8200 and 750 are connected to a star coupler and attempting to perform various functions, either separately or in tandem. I have discovered the following in the past three weeks: 1. Late last year we ordered a memory upgrade option for the 8500 to increase the usable memory to 44MB, from the shipped size of 20MB. Clearly a step in the right direction in light of the size of VMS and software/users we are running. The memory upgrade package consists of two 16MB memory arrays and a new revision of the memory backplane. We have the second 'production' unit off the 8500 assembly line, and the result was many installation problems. After three attempts, spread over two months (the last of which included tearing the machine apart and then finding a required element was missing), the upgrade was completed. An interesting note here is that while throughout the documentation it highlights the various revision levels of the diags and MCL card, NO WHERE does it discuss the required revision level of the PRO 380 console software. It turns out that the memory upgrade REQUIRES version 4.0 of the PRO 380 console. Another note is that the Version 4 software is distributed to your field service office not you, and consists of 17 floppies that take 3+ hours to install. The obvious question here is why on earth is it not distributed on RD disk media so that a swap of disk drives taking 10 minutes is required? Another point is that having more than one 16MB memory card on a 8500 CPU with the 'right' amount of user load will cause the system to belly up. The current fixes coming from DEC is an updated microcode version for the 8500 CPU, and two (or more) 16MB memory arrays that have been 'tuned' to one another. Moral of this section: BEWARE the box that contains parts and documentation marked 'PRELIMINARY'. In others words, wait for version 2 of everything. 2. During the course of the last two episodes of attempted installation of the memory upgrade, the machine went through a truly gut wrenching series of tear downs and rebuilds. For the first three days after the final rebuild, every morning between 8:45 and 9:30 the CPU would display the message dear to every system managers heart, 'CPU DBL ERR HALT'. After three days of constant surveillance by us and DEC CE's, I had a brain storm. Attempting, and failing, to keep my cool I strutted down the hall on the second floor just outside the computer room with ALL the weight my 175 pounds could muster. Lo and behold the console read 'CPU DBL ERR HALT'. After booting the machine back to life, I grabbed a DEC CE and together we strode forcefully down the hall in an attempt to prove my idea. To our glee, upon entering the computer room there waiting for us was the infamous 'CPU DBL ERR HALT'. When asked to explain how I came to the proper conclusion, I stated how in the past three days, the data communications manager, who weighs over 210 pounds, and the voice communications manager whose due date was any day, came in around those times for the past three days. The result was three boards that were found to be vibration sensitive. Most of these boards were ones that had been removed by DEC to clean the edge connectors since the ZIF connectors are used. A couple of notes here, first we found a 'QA' sticker covering part of one pad that a ZIF connector would touch, and second is the large use of 'cooling towers' on top of the various chips in order to keep them cool. By my reasoning, these cooling towers can cause excessive vibrational loads on the chips causing premature failures. Moral of this section: BEWARE of oversized people and pregnant women who waddle around a computer room if the computer room is not on a slab on the ground. 3. Concurrently, during the three days of ups and downs, we were pushing our local office for a replacement CPU since ours was clearly not ready for long term workloads. The mean time between failures was two hours, for three days. After many more calls then should have been needed, including some by us to Maynard, a startling revelation came to light. After a world-wide search for a replacement 8500 CPU, the answer came back that there was NONE to be had. We order everything with DEC's 'Recover-All' option, and thoughts of pushing the damn CPU down a stairwell where starting to enter our dreams. The result was that we had to wait till the next 'build' cycle was completed in DEC for a replacement CPU to be shipped to us. The 'build/test/burn-in' cycle for the 8500 is four (4) weeks. The only thing that saved us, was that DEC was in the fourth week of the build cycle and we need only wait for the end of the week to get a replacement. It became VERY clear that we came close to being in a ton of the brown stuff if our failure had happened three weeks earlier. It has since been indicated to us that having 'Recover-All' does NOT mean instant replacement of parts that are longer functioning for what ever reason. You have to wait your turn and NO guarantees are given on the speed of delivery. Moral of this section: BEWARE the DEC sales person that spouts RECOVER-ALL is gods answer to disaster recovery. 4. At the same time as this, I also upgraded to VAX/VMS 4.5 to take advantage of disk shadowing through our HSC's. During the course of upgrading ALL our systems to 4.5, I encountered a problem with the 8200 system disk. VMSINSTAL kept spitting out messages that images where not at the expected revision levels. Researching the problem, I discovered that the system programmer had built the system disk from a cluster common disk using VMSKITBLD and answered the questions according to the instructions. The resulting system disk is, as it turns out, not upgradeable but is bootable. You do not know you have a problem till its way past the point of easy fixing. In talking to DEC in Colorado about this, it comes to light that the 'correct' answer to the question about what to use for the input disk to VMSKITBLD should be 'SYS$SYSROOT' NOT 'SYS$SYSDEVICE'. The net result is having to do a complete rebuild of the disk from scratch and then attempting to update the various parts, such as the SYSUAF file, to get it current again. Moral of this section: BEWARE VMSKITBLD when playing with fire such as cluster common system disks. 5. The whole point of going to version 4.5 of VMS was for shadowing on the HSC controllers. Our local contract DEC guru, Mr. Al Piccolo has found a number of interesting items concerning shadowing. The first is that version 1.0, the current version, provides a MAXIMUM of 4 shadow sets PER HSC. Now, we have more then 4 disks that we want to shadow so the quick thought is that, NO PROBLEM we will have 4 shadows on one HSC and the rest on the other HSC. I remind you that the HSC is a PDP-11 based machine and is subject to little problems like hardware failures and software crashes. That is why you buy two of the little creatures, at $50,000+ each, from your local DEC sales person, along with duplicate requester cards at $9,000+ each. This way the disks can 'fail-over' to the other HSC when one goes down. Now the interesting part, what happens when one HSC has 4 shadow sets, the other has 3 shadow sets and one of the HSC's decides that it is time to go to lunch for awhile. The ONLY answer Mr. Piccolo could come up with from DEC internal is that any shadow sets over the maximum per HSC would go to never-never land, and you would be hard put to find them. SO MUCH FOR REDUNDANCY. They also suggested that we wait for Version 2.0, appearing soon, when the maximum goes to 7 shadow sets per HSC. Big whoop, the maximum of 7 per HSC in my reality means that I can ONLY have a TOTAL of 7 PER CLUSTER, since I have no control over which shadow set fails over to which HSC in the event of failure. The other item is to FORGET doing automatic reboots when you use shadow sets since you can not control which HSC is going to be used when you issue the MOUNT command. The result here is that one disk may be on HSC A while the other is on HSC B. An unsupported condition for shadowing at the present time, although this was hinted at for the future. Moral of this section: BEWARE the incompatibilities of SHADOWING for speed/integrity enhancement, and FAIL-OVER for redundancy. 6. There was a little tidbit of information that was passed along by Mr. Piccolo concerning quorum disks as they are used in VAX/VMS in support of clusters. We are currently using the standard 'VOTES' in addition to having a quorum disk. It turns out that having a quorum disk enables each CPU in the cluster to place its 'MARK' in the quorum file so that other members know who is out there and thereby avoids the case of cluster partitions. In our particular cluster configuration, some members mount certain disks while other members mount different disks. In the interest of security, never the two 'types' of cluster members will mount the same disks. The question then is where to put the quorum file. It has ALWAYS been the STRONGEST advice from DEC that you do not touch a disk unless it is mounted, so as not to corrupt it. But I want NO disks to be mounted by the two member types. HAVE NO FEAR, DEC IS HERE. VMS is 'smart' enough to interpret the ODS-2 structure of a disk even if said disk is not mounted by the system. This to me is truly scary stuff and not to be attempted by the weak of heart. The sad part of this little foray is that while VMS has the where-with-all to understand unmounted disk structures, it does not place the data into the file in a format that RMS can understand. Hence the process abort when a hapless user issues a TYPE command of the cluster quorum file. The question here is that why not have VMS take it one step further, and not have us mount ANY disks at all. Let's eliminate the overhead and drudgery of maintaining the command files to mount disks in the first place? Moral for this section: BEWARE the system that starts to out think itself in the interest of better abilities. Enough for now. If you would like to discuss any points made above, I can be contacted at (215) 657-4000 Ext. 5187. Paul D. Clayton TSO Financial Corporation Horsham, Pennsylvania .......................................................................... . Bob Curley . . Department of Radiation Therapy . . University of Pennsylvania . . P.O. Box 7806 (voice) 215-662-3083 . . Philadelphia, PA 19101 (ARPA ) CURLEY@WHARTON . .......................................................................... ------