[comp.os.vms] Chapter 13 Continued... Subtitle = When It Rains It Pours.

CLAYTON@XRT.UPENN.EDU ("Clayton, Paul D.") (07/30/87)
Information From TSO Financial - The Saga Continues...
Chapter 15 - July 24, 1987 (Friday)

When it rains it pours. I have TRULY come to appreciate those words of wisdom
from my parents. I also beleive that they have NO idea to what depths the
saying is applicable to the world of computers and system management. I will
only hit the highpoints (?) of the past five days of hell. There are things
to learn from this!!

Monday was documented in Chapter 13 of 'The Saga...' and therefore is skipped
over here.

Tuesday was somewhat fun. I placed an order for the 'cornerstone' of a second 
VAXCluster for use in performing an analysis of a 500K record file that will
'consume' three (3) SI93C drives for ONE copy of the file. That translates to
1.95 GigaBytes of storage. HHHHEEEELLLLLLOOOOO VOLUME SETS. The worst part is
the file has to be SORTED umteen different ways. The processor being used, not
by choice, is a 8350 with 32MB of memory. Should be interesting to see what the
performance is and how SI93C drives handle. Also purchased Volume Shadow for
this machine. It should also make a nice test bed for VMS 5.0 and SMP checkout.
Now for the bad news. We have two Datagrafix 4800 ionization printers that do 
90+ pages per minute when working. The quality of print is very good and they 
handle forms overlays nicely. One of the reasons for choosing them, was the
proximity of their office to ours, which is about 1.5 BLOCKS or 900 YARDS. One
printer broke on Friday of last week and was waiting for parts. On Monday, the
second printer broke, naturally, so we logged a service call at 13:00. By 
19:00, a word was not not heard so I logged another call. Tuesday morning I
called again and explicitly layed the facts on the table and inquired about
a remedy. After much researching on their part they pointed to phone logs,
which are maintained by Corp in Calif., and said that I must be mistaken. The
13:00 call was made at 15:00 and the 19:00 call was NEVER made, the printer
that was broken on Friday is fixed and that they ALWAYS are on site within 
1.5 HOURS of a service call. After I told them the error of the their logs,  
their response time, and offered a PROBATIONARY period to get back in good
graces, they came right out to fix the printers. Its a shame, the printers 
are good but the service department has failed me on a number of occasions.

Also on Tuesday and Wednesday the VAXCluster has been playing 'ping-pong'
between HSC's in doing both HSC reboots, with NO messages displayed indicating
why the reboot is occuring, and passing disks back and forth. I was beginning
to think that DIGITAL was using me to test FAILOVER. Thank GOD it works. I 
have also placed a LARGE order for greenbar to use for ALL the console
messages that were printed out. We USED to have LA50 printers attached to the
PRO consoles to get a hardcopy. The problem is that the messages occur SO 
FREQUENT the LA50 gives up and prints garbage. I can NOT read garbage so the
LA50's are not used anymore. I have ALSO come to the conclusion that the 
VAX CLUSTER CONSOLE (VCS) is DEC's response to the VOLUME of messages on the
consoles and they NEEDED a FASTER system to capture, and/or print, the 
console messages. The logic is as follows. Every system prints the messages
from EVERY other system in the cluster in ADDITION to its own. If the console
is slow, ie. LA120, then OPCOM backs up. When OPCOM backs up, system managers 
do NOT get the message the CLUSTER IS DIEING. By not getting the message, the
CLUSTER heads south, quickly. My 8700's clock 6 MIPS, while the LA120's and
the PRO console clock 120 characters per second. The PRO is probably doing
LESS then 120. CONSIDER the resulting damage from a crazed 8700. By the time
you get the message, youv'e had it. This is what BACKUP's are for.

Anyway back to the ping-pong game. DEC field service came out, and RDC poked
and prodded and the consensus is that the HSC requestor cards are in the WRONG
order. The basis for this is that the HSC requestor bus is similar to the 
UNIBUS in that the SLOWER devices SHOULD be closer to the CPU card to respond to
interrupts faster. The thought here is that the slower devices do not 
interrupt as often but they need quick attention so they are closer to the CPU
and the attention they desire. You also eliminate the 'time-outs' that would
occur otherwise. As you get further from the CPU, the interrupts increase 
(maybe) but the response time may not be AS critical. The order of the HSC bus
then is as follows.
	CPU/CI  Interface
	Slow devices ( tape drives ) 		Requestor # 1
	Mediocre devices ( RA-60 drives )
	Quick devices ( RA-81 drives )		Middle Requestors
	Fast devices ( RA-82, SI83/93 drives )	High Requestors #8
As it turns out, this sequence is now HIGHLY RECOMMENDED along with Version
3.5 of the HSC code. NO ONE told me, or my local field service office, about 
this so as fate would have it both my HSC backplanes are in damn near reverse 
order. The type of errors you get if the order is not as recommended is a 
SIGNIFICANT number of SDI collision errors ( I have this ) and possible time 
out problems ( I have this also ). In regards to the time out, let me add a 
new wrinkle. My Cluster QUORUM disk is (was) a RA-81 on the FAR side of the SI 
drives. I have been getting the following messages that always sends chills.
	%CNXMAN, Error reading quorum disk
	%CNXMAN, Lost 'connection' to quorum disk
	MOUNT VERIFICATION message for quorum disk
	%CNXMAN, Established "connection" to quorum disk
This is usually results in system(s) hanging in wait for the RA-81 to show
its hand. I happen to like my Quorum disk for safety reasons and am not planning
to get rid of it to stop the messages. The fix is that my local field service
rep. and an accompliance will get to spend a Sunday in air conditioning 
redoing the sequence of requestor cards.

One other thing to note about the HSC software. We have one HSC50 and one HSC70.
The 3.5 update showed up for the HSC70 but I have yet to get the kit for the
HSC50. I REQUESTED Field Service to perform the upgrade to both, them told them
about the non-delivery. They got a TU58 kit from area and did the update. There
is a patch to 3.5 for TA81 users (I do not know about TA78 users) that is needed
to correct how the HSC does HSC BACKUPS. This puts the final rev level at 3.51.
There is also a ECO for TA81 DRIVES with 3.5 of the HSC code that will 'correct'
the following two problems.
	Blocks Lost
	Position Lost
Since I also had both of these, I appreciated them installing the kits. The
ECO is number 'EQ-01443-01'.

Then I lost two requestor cards that resulted in corrputed files. The worst file
took 18 hours to recreate from pieces. I found an interesting problem with VMS
SORT in doing the fix. The file is 1,098,690 records of 44 bytes each with the 
key being the first 17 ASCII characters of the record. If ALL the records are 
sorted at once, the primary work file CONSUMES 950K+ blocks of work space and 
wanted more. Unfortunately I had no more to give it so it aborted. When I split 
the file in roughly two equal segements, the work space for each half consumed 
about 150K blocks. Now the keys are close to being in order, so I have to
wonder what reason is behind the request for such large amounts of work space
when doing the whole file. I can already tell what the SPR response is going
to say, "Thanks...". One thing is for sure, you do NOT write SPR's to get a 
TECHNICAL answer to a problem that will get you out of a bind. This was doing
a 'record' sort, and from now on 'tag' sorts will be used.

This also brings up a good point in that a TRULY generic program is needed that
will let a user specify the file layout and key size/placement/type and a few
other details and then do sequential reads. The TRICK needed is to be able to
specify how the keys work so that the program can SKIP over bad buckets and then
continue reading the records. All the records read should be written to a 
sequential file that can then be used to create a new version of the file. I am
getting tired of writing special programs to do this everytime a file heads
west. The program should also let you start at a different key and step through
a file and pick out the records you want. This is needed if you have to use 
old BACKUPs to get the records you NO longer have.

I want to thank Matt Madison for the reply about my installing the SNA Gateway.
He has also made a VERY good point. It turns out that there is a board in the 
Gateway box that is ONLY accessible from the TOP of the box. The implication 
here being that unless you lift weights and have no fear of hernias, do NOT 
stack the Gateway boxes. With my luck, that is the first board that would pack
it in. Like I said earlier, I love a clean design.

Now I also hear that DEC has released for installation the parts needed to 
upgrade the memory on 8500 systems. The last time we tried, I got a NEW 8500
out of it. I have to ask myself if I am ready for another trade in??

The only other issues were operator problems, a computer room air conditioning
failure and a 26 hour work day on Wednesday. Other then that life was calm.

See, things CAN go from bad to worse. :-)

Paul D. Clayton - Manager Of Systems
TSO Financial - Horsham, Pa. USA
Address - CLAYTON%XRT@CIS.UPENN.EDU