[comp.os.vms] Another Horror Chapter In The Continuing Saga. Subtitle=I Hate Mondays.

CLAYTON@XRT.UPENN.EDU ("Clayton, Paul D.") (07/21/87)

Information From TSO Financial - The Saga Continues...
Chapter 13 - July 20, 1987 (Monday)

The following is a horror story. Names have NOT been changed to protect the
GUILTY.

We have a VAXCluster that has been the topic of many previous blips from me 
and has provided yet another basis to write on.

Our disk farm, which consists of SI83C and RA-81 disks is dual-ported, for the
most part, between a HSC50 and a HSC70. Due to having 7 requestor boards in
the HSC70, a secondary power supply is required and was installed some time ago.
Recently the HSC70 has been acting FLAKEY, for example the CRT does not respond
and the only way to get it to respond is to turn it off and back on. The problem
here is that in so doing the HSC performs a REBOOT. The result is a TON of paper
used on each system console in the VAXCluster printing all the mount 
verification messages from all six systems in the cluster. I would love a 
couple of private minutes with the individual(s) who made the decision that the
OPCOM messages NEED to be printed on ALL consoles, regardless of system 
origination. But as I said in the beginning MOST of the disks are dual ported.
The problem is that when the HSC70 rebooted and the disks failed over, the I/O
load and one disk drive that is acting up CRASHED the HSC50. The result is a 
RACE to see who completes first, MOUNT VERIFICATION TIMEOUT or a BOOT FROM A 
TU58. This is known as a CLOSE CALL. It is also the basis for my suggesting 
that people change the SYSGEN parameter MVTIMEOUT to something HUGE, like 
65000 which is 17.7 hours for a timeout to occur. This is needed considering 
that a TU58 is SLOW. Do this change in the MODPARAMS.DAT file and then do 
AUTOGEN from SAVPARAMS through SETPARAMS. The MVTIMEOUT is a DYNAMIC parameter
and will save your butt as soon as its set. 

Getting back to the HSC70 which was acting flacky, resulted in a service call
being placed on it at 9:45 AM. At 13:45 I called the local unit manager and
threatened to blow his house up and make ONE MORE CALL, which would NOT be to
him. Someone else immediately called back and gave me an update. The contract
calls for four (4) hours to BE IN MY FACILITY, a phone call did NOT make the 
grade. In the past I have even logged a HSC and disk problem against my 8700's 
hoping to get two (2) hour response, but no dice, yet. At 14:30 the 'last 
available person' showed up. You know these types, they are the ones who 
immediately after introducing themselves call RDC to find out how to open
the gizmo up and LOOK at the boards. Anyway, the problem to start with was a 
fan in the secondary power supply which was not working. Now this fan is INSIDE
the power supply and to fix the fan, the ENTIRE secondary power supply has to 
be replaced. To take the power supply out, the front AND back doors have to come
off, which is tricky considering that there is cabling for the infamous
'ENABLE/SECURE' switch. Once the doors are off, the assembly simply slide out
and the new one slides in. Putting the front door BACK ON takes 15 minutes due
to having the damn ENABLE/SECURE switch covering the pin that is used to hinge
the top of the door. Now the supply is working and the fan is merrily spinning.
I then raised the question concerning heat sensitive boards that resulted from
the span not spinning and after more calls to RDC, the CPU card was swapped as
a 'precautionary' measure. The reason being that the CPU is in the slot that 
is closest to the power supplies. Now I get to wait and see if more problems 
come up with the HSC70. The question here is where is a sensor to detect air
flow and temp in the secondary power supply, like the one in the primary 
supply??

The flakey disk drive is a SI83C controller problem. Now BEFORE everyone gets
the screamies, the problem has a fix and I have to limp along till Sunday 
when I get the machines to myself. Or sooner if everything dies on me. The 
problem here is that SI has a controller between the FUJI drives and the HSC
requestor cards. The firmware has a bug in it that results in error messages
being sent to a HSC which does not have the drive selected. The result, if 
there a lot of errors can be a CRASHED HSC. The HSC tries to handle the problem
by invoking ILDISK and ILDISK says 'what drive, I see NO drive'. The result
gets ugly. The fix IS available and anyone who currently has SI83C drives
is suggested to get the fix. There are new front panels that are part of the
upgrade also. I am writing a article on the SI83C drives and my expierences 
with them to date. Stay tuned.

The next item is that anyone with HSC's should be at 3.5 of the HSC code. The
3.0 code reported a TON of error messages and 3.5 does not. The release notes
say that a number of SDI errors are handled better/correctly now. Who knows, 
they may just be not printing the messages to make things look better. I 
used to get a LOT of messages BEFORE the SI83C drives arrived so the errors
can NOT be chalked off to them.

The next item is an unfortunate installation of a SNA Gateway to 'ALLOW' us
access to an IBM shop. I tried to blow up the delivery van, but the driver 
said his insurance would not replace the truck. The model we ordered AND
received was the DECSA-FA which is PDP 11/23 in a nondiscript brown box. The
box is the same one used for the terminal server version of the DECSA as well
as a Ethernet router. The difference is the software loaded into it. The things
that bother me are the following.

1. The box is about 15 inches high, 24 inches wide and 24 inches deep. There are
indents on the top of the box in which to place the rubber feet of another DECSA
so that they can be 'stacked'. Okay so far. There are two (2) CABLE troughs, one
per side, at the bottom of the cabinet on the outside edges. These are for the
cables to pass through BECAUSE the line cards that the DECSA use are put in 
from the FRONT and the DB25 or whatever type connector is ON the line card. You
MIGHT be saying okay then, the DECSA can then be put against a wall, out of the
way since the line card cables are attached in the front. WRONG. The ETHERNET
connector, which is how we are connecting the DECSA to our systems, is ON THE
BACK, along with the POWER PLUG, CABLE, FUSE and BREAKER. I like a clean 
design.

2. The only indicator for the DECSA to tell the world if there are problems
when no software or outside connections are made is through a 4 digit LED
display in the front of the machine. This might not sound bad but, consider
that the SNA Gateway software is loaded by local DEC, then a group from DEC Corp
comes down to us and configures it to talk with IBM. Therfore the initial 
hardware installation which is also done by local DEC can only use the 4 digit
display. The test sequence takes close to 30 minutes. Just BEFORE the end of 
the test, the LED's FLASH and the next information displayed is the Ethernet
address that the DECSA has defined for use. Now the lights flash to tell you
its comming but if you miss it, you wait another 30 minutes, trying to stay
awake, and catch it the next time around. Its also tough reading the display
when the address is in HEX and the decimal '6' shows up with a HEX 'B' which
is displayed as a lowercase letter. All this in a seven segment display. The
CE also said that if they get tired of looking at the LED's for the address,
they connect the system up to Ethernet and let it scream over the network
in search of a LOAD HOST. The resulting error messages, also printed on every
VAX console in the network, display the Ethernet address. A solution that
definately LACKS class. And we paid $15,000.00 for the BOX, the software
was extra. I can hardly wait for the next step in the installation. I am 
sure the future steps will provide fodder for the cannons. Stay tuned.

All these neat things happening to a site that is in a district that has 
received the 'Excellance' award for Field Service within DEC and has also
made it POSSIBLE for our sales lady to vacation in Hawaii complements of DEC
for reaching far and above the sales targets.

I would hate to be in a another district, who knows what service we would be
getting.??

I love Mondays.. :-)

Paul D. Clayton - Manager Of Systems
TSO Financial - Horsham, Pa. USA
Address - CLAYTON%XRT@CIS.UPENN.EDU

SIT.BUSH@CU20B.COLUMBIA.EDU (Nick Bush) (07/22/87)

Actually, the PDP11 in the DECSA box does have a console terminal port.
If you remove the trim panel on the front of the box there is a DB25
connector there which is the PDP11's console port.  Of course, there isn't
much you can do with it if you don't have a system set up to load something
into the DECSA - the only thing in ROM are the minimal diags and ODT.
One useful purpose it does serve is in displaying real error messages from
the loadable diagnositcs.

- Nick Bush
  Sterling-Winthrop Research Institute
  Rensselaer, NY

ARPA: SIT.BUSH@CU20B.COLUMBIA.EDU
BITNET: SIT.BUSH@CU20B
-------