[comp.os.vms] Horrors Continue. Subtitle = When Do You Declare A Disaster??

CLAYTON@XRT.UPENN.EDU ("Clayton, Paul D.") (08/23/87)
Information From TSO Financial - The Saga Continues...
Chapter 19 - August 22, 1987

I am giving serious consideration to renaming my computer room the 'Little 
House Of Horrors'. This chapter is a continuation of Chapters 13 and 15. It is 
also the basis for my wondering at what point do you declare a disaster and go 
someplace else and start anew.

My HSC70 appears to be running good, the HSC50 is still making everyone 
question its sanity, one 11/785 is now running a system disk all by itself (and 
trashing it on a regular basis), the SNA Gateway is functioning (the IBM 
systems its connected to are having problems though), I have a loaner RA81 
disk from DEC to make up for the additional system disk THEY asked me make and 
use, and the new 8350 is shipping TODAY.

We have had 'Area Support' in conjunction with local support in our computer 
room for the past three weeks now and things have not settled down. The 
details of the horrors are as follows.

The HSC's and disk problems were detailed earlier and the recent sequence is 
one HDA replacement, one whole RA81 replacement (as in everything but the 
slide racks goes), six requestor cards, three L100 HSC/CI Link Cards 
replaced and two HSC CPU cards. Two other disks need a format/verify to clean 
them up and then it 'should' be okay. Maybe. 

One 11/785 has been booted from its own system disk in the hope of limiting 
the damage when it gets trashed, which it still does, and the FPA was pulled 
in hopes of eliminating the corruption problem. It seems that the FPA failed 
some diag tests while VMS was still running so DEC suggested the FPA be 
replaced. As they had NO spares at the time, this is another sore point, the 
FPA was simply pulled and the machine rebooted. We are waiting for the 
corruption to show up again, at which point they want to swap the entire CI 
interface. For some reason DEC seems satified that the problem is not with the 
HSC50 on which it is the ONLY disk being worked. I have NO confidence with the 
HSC50 so the rest of the farm is on the HSC70. I also have to wonder about the 
SNA Gateway software, but more on that later. So now I am burdened with two 
system disks on which to maintain the production boot files and executables, 
and the duration of this test is not known. It has also come to light that the 
FPA diags always report failure on certain tests when run while VMS is still 
up. This was learned after the FPA was pulled and we suffered down time to 
allow it to be pulled. And DEC wonders why I have no faith in their diags??
Sigh...

The SNA Gateway software was a REAL TRIP to get loaded and working. First, I 
am to this day not convinced they are not part of my problems. I chose the 
11/785 that DEC has since asked me to boot from its own disk, as the load host 
for the SNA Gateway software. The physical Gateway is located in my computer 
room at Horsham, Pa.. The hardware was installed by the local Blue Bell office 
of DEC, which is 25 minutes away. The software was installed by a guy in the 
Washington DC office of DEC who traveled 4+ hours to get to Horsham. The final 
configuration setup and checkout was done by a lady from the Boston Mass. ACS 
group of DEC. She wasted a day and flew in. Now I know why the cost was over 
$25K. Anyway, before the ACS gal showed up, we where having problems getting 
the Gateway to load the third and final file, THE OPERATING SYSTEM. The first 
two are small, less then 20 blocks each, while the third is over 1,000 blocks. 
The errors we where getting from DECnet had the message 'Device Timeout'. 
After various configurations, it was found to load without problems if the 
Gateway and the 11/785 where the ONLY two things on a DELNI. All other 
configurations failed with the timeout error. We also have Terminal Server 
Manager (TSM) and it was showing timeout errors when talking with some 
terminal servers on the network. This network is rather large, going through 
T1's and Vitalink Ethernet bridges to eight buildings along the east coast. 
The network appeared to have no problems except for TSM and the Gateway. After 
spending one week between 23:00 and 1:00 we got it down to one T1/bridge that 
was causing the problem. We then went crawling around ceilings and floors and 
found that one Vitalink bridge had a hardware problem, another Vitalink bridge 
had software problems due to a crapped up floppy load media. The strange part 
is that with all these problems, this link maintained LAT communications. Once 
this was corrected, the Gateway loaded all three files without problems and 
now the ACS group came in to finish the job. It only took a couple of hours 
and all was working on our end. The IBM's on the other end where having 
problems of their own. I have to wonder if the condition is contagious??

So all was fine with the Gateway and the ACS gal wanted to go home. I raised 
the issue about having her install BUT not enable the Gateway software on the 
'normal' system disk, since all this was done to a system disk that is going to 
be trashed (hopefully). She asked around and the decision was NOT to install 
the software citing license issues. Now this I feel is TOTALLY inappropiate 
considering it was DEC who wanted me to run the current system disk 
configuration. I did not win the argument at the time so she went home and the 
software is loaded and running from a disk that I have every intention of 
trashing. My local office, both sales and field service have been made aware 
of my feelings on this and they will foot the bill to have everyone back on a 
return visit when the time comes. I knew the Gateway was trouble. Should have 
blown up the delivery truck.

The 'normal' display on the front of the Gateway consists of two cirles, side
by side, with different segments of the LED's lighting up in sequence. The 
appearance is that of two circles NEVER touching but always going around one
another. I think it fits DEC and IBM as corporations perfectly. I have to 
wonder if this was a marketing decision or a programmer stumbled into it??

Then my HIGH capacity tape drives from Emulex showed up. The ones I am getting 
hold 650+MB (formatted) per cartridge. They use the Emulex TC13 controller and 
look like a 'MS' device to VMS. Well the guy that showed up to do the 
installation was not a UNIBUS address guru and the result was a machine down 
all day and still no tape drives. The UNIBUS these first drives are going on 
has a TS11 already on it, and a couple of other devices that you have to do 
the 'connect' to yourself, and he could not understand what was going on. They 
try again on Monday and another person is scheduled to show up that knows about 
UNIBUS addressing. They are slick little drives and cartridges. More on these 
in the future.

Then there was the lunch with the VP of Field Service from System Industries 
to go over some recent problems with the local office and the new SI83/93 
drives that I have. Also the final test of a product we beta tested for DEC to 
be announced at DECWorld that should help some of you out there. More on this 
after DECWorld. Then there was a couple days vacation to recover from the 
tribulations and to see if TSO still needed me. They didn't. Luckily I am not 
looking for job security.

The question remains. At what point do you declare a disaster??

Paul D. Clayton - Manager Of Systems
TSO Financial - Horsham, Pa. USA
Address - CLAYTON%XRT@CIS.UPENN.EDU