CLAYTON@XRT.UPENN.EDU ("Clayton, Paul D.") (07/21/87)
Information From TSO Financial - The Saga Continues... Chapter 13 - July 20, 1987 (Monday) The following is a horror story. Names have NOT been changed to protect the GUILTY. We have a VAXCluster that has been the topic of many previous blips from me and has provided yet another basis to write on. Our disk farm, which consists of SI83C and RA-81 disks is dual-ported, for the most part, between a HSC50 and a HSC70. Due to having 7 requestor boards in the HSC70, a secondary power supply is required and was installed some time ago. Recently the HSC70 has been acting FLAKEY, for example the CRT does not respond and the only way to get it to respond is to turn it off and back on. The problem here is that in so doing the HSC performs a REBOOT. The result is a TON of paper used on each system console in the VAXCluster printing all the mount verification messages from all six systems in the cluster. I would love a couple of private minutes with the individual(s) who made the decision that the OPCOM messages NEED to be printed on ALL consoles, regardless of system origination. But as I said in the beginning MOST of the disks are dual ported. The problem is that when the HSC70 rebooted and the disks failed over, the I/O load and one disk drive that is acting up CRASHED the HSC50. The result is a RACE to see who completes first, MOUNT VERIFICATION TIMEOUT or a BOOT FROM A TU58. This is known as a CLOSE CALL. It is also the basis for my suggesting that people change the SYSGEN parameter MVTIMEOUT to something HUGE, like 65000 which is 17.7 hours for a timeout to occur. This is needed considering that a TU58 is SLOW. Do this change in the MODPARAMS.DAT file and then do AUTOGEN from SAVPARAMS through SETPARAMS. The MVTIMEOUT is a DYNAMIC parameter and will save your butt as soon as its set. Getting back to the HSC70 which was acting flacky, resulted in a service call being placed on it at 9:45 AM. At 13:45 I called the local unit manager and threatened to blow his house up and make ONE MORE CALL, which would NOT be to him. Someone else immediately called back and gave me an update. The contract calls for four (4) hours to BE IN MY FACILITY, a phone call did NOT make the grade. In the past I have even logged a HSC and disk problem against my 8700's hoping to get two (2) hour response, but no dice, yet. At 14:30 the 'last available person' showed up. You know these types, they are the ones who immediately after introducing themselves call RDC to find out how to open the gizmo up and LOOK at the boards. Anyway, the problem to start with was a fan in the secondary power supply which was not working. Now this fan is INSIDE the power supply and to fix the fan, the ENTIRE secondary power supply has to be replaced. To take the power supply out, the front AND back doors have to come off, which is tricky considering that there is cabling for the infamous 'ENABLE/SECURE' switch. Once the doors are off, the assembly simply slide out and the new one slides in. Putting the front door BACK ON takes 15 minutes due to having the damn ENABLE/SECURE switch covering the pin that is used to hinge the top of the door. Now the supply is working and the fan is merrily spinning. I then raised the question concerning heat sensitive boards that resulted from the span not spinning and after more calls to RDC, the CPU card was swapped as a 'precautionary' measure. The reason being that the CPU is in the slot that is closest to the power supplies. Now I get to wait and see if more problems come up with the HSC70. The question here is where is a sensor to detect air flow and temp in the secondary power supply, like the one in the primary supply?? The flakey disk drive is a SI83C controller problem. Now BEFORE everyone gets the screamies, the problem has a fix and I have to limp along till Sunday when I get the machines to myself. Or sooner if everything dies on me. The problem here is that SI has a controller between the FUJI drives and the HSC requestor cards. The firmware has a bug in it that results in error messages being sent to a HSC which does not have the drive selected. The result, if there a lot of errors can be a CRASHED HSC. The HSC tries to handle the problem by invoking ILDISK and ILDISK says 'what drive, I see NO drive'. The result gets ugly. The fix IS available and anyone who currently has SI83C drives is suggested to get the fix. There are new front panels that are part of the upgrade also. I am writing a article on the SI83C drives and my expierences with them to date. Stay tuned. The next item is that anyone with HSC's should be at 3.5 of the HSC code. The 3.0 code reported a TON of error messages and 3.5 does not. The release notes say that a number of SDI errors are handled better/correctly now. Who knows, they may just be not printing the messages to make things look better. I used to get a LOT of messages BEFORE the SI83C drives arrived so the errors can NOT be chalked off to them. The next item is an unfortunate installation of a SNA Gateway to 'ALLOW' us access to an IBM shop. I tried to blow up the delivery van, but the driver said his insurance would not replace the truck. The model we ordered AND received was the DECSA-FA which is PDP 11/23 in a nondiscript brown box. The box is the same one used for the terminal server version of the DECSA as well as a Ethernet router. The difference is the software loaded into it. The things that bother me are the following. 1. The box is about 15 inches high, 24 inches wide and 24 inches deep. There are indents on the top of the box in which to place the rubber feet of another DECSA so that they can be 'stacked'. Okay so far. There are two (2) CABLE troughs, one per side, at the bottom of the cabinet on the outside edges. These are for the cables to pass through BECAUSE the line cards that the DECSA use are put in from the FRONT and the DB25 or whatever type connector is ON the line card. You MIGHT be saying okay then, the DECSA can then be put against a wall, out of the way since the line card cables are attached in the front. WRONG. The ETHERNET connector, which is how we are connecting the DECSA to our systems, is ON THE BACK, along with the POWER PLUG, CABLE, FUSE and BREAKER. I like a clean design. 2. The only indicator for the DECSA to tell the world if there are problems when no software or outside connections are made is through a 4 digit LED display in the front of the machine. This might not sound bad but, consider that the SNA Gateway software is loaded by local DEC, then a group from DEC Corp comes down to us and configures it to talk with IBM. Therfore the initial hardware installation which is also done by local DEC can only use the 4 digit display. The test sequence takes close to 30 minutes. Just BEFORE the end of the test, the LED's FLASH and the next information displayed is the Ethernet address that the DECSA has defined for use. Now the lights flash to tell you its comming but if you miss it, you wait another 30 minutes, trying to stay awake, and catch it the next time around. Its also tough reading the display when the address is in HEX and the decimal '6' shows up with a HEX 'B' which is displayed as a lowercase letter. All this in a seven segment display. The CE also said that if they get tired of looking at the LED's for the address, they connect the system up to Ethernet and let it scream over the network in search of a LOAD HOST. The resulting error messages, also printed on every VAX console in the network, display the Ethernet address. A solution that definately LACKS class. And we paid $15,000.00 for the BOX, the software was extra. I can hardly wait for the next step in the installation. I am sure the future steps will provide fodder for the cannons. Stay tuned. All these neat things happening to a site that is in a district that has received the 'Excellance' award for Field Service within DEC and has also made it POSSIBLE for our sales lady to vacation in Hawaii complements of DEC for reaching far and above the sales targets. I would hate to be in a another district, who knows what service we would be getting.?? I love Mondays.. :-) Paul D. Clayton - Manager Of Systems TSO Financial - Horsham, Pa. USA Address - CLAYTON%XRT@CIS.UPENN.EDU
SIT.BUSH@CU20B.COLUMBIA.EDU (Nick Bush) (07/22/87)
Actually, the PDP11 in the DECSA box does have a console terminal port. If you remove the trim panel on the front of the box there is a DB25 connector there which is the PDP11's console port. Of course, there isn't much you can do with it if you don't have a system set up to load something into the DECSA - the only thing in ROM are the minimal diags and ODT. One useful purpose it does serve is in displaying real error messages from the loadable diagnositcs. - Nick Bush Sterling-Winthrop Research Institute Rensselaer, NY ARPA: SIT.BUSH@CU20B.COLUMBIA.EDU BITNET: SIT.BUSH@CU20B -------