CLAYTON@XRT.UPENN.EDU ("Clayton, Paul D.") (07/30/87)
Information From TSO Financial - The Saga Continues... Chapter 15 - July 24, 1987 (Friday) When it rains it pours. I have TRULY come to appreciate those words of wisdom from my parents. I also beleive that they have NO idea to what depths the saying is applicable to the world of computers and system management. I will only hit the highpoints (?) of the past five days of hell. There are things to learn from this!! Monday was documented in Chapter 13 of 'The Saga...' and therefore is skipped over here. Tuesday was somewhat fun. I placed an order for the 'cornerstone' of a second VAXCluster for use in performing an analysis of a 500K record file that will 'consume' three (3) SI93C drives for ONE copy of the file. That translates to 1.95 GigaBytes of storage. HHHHEEEELLLLLLOOOOO VOLUME SETS. The worst part is the file has to be SORTED umteen different ways. The processor being used, not by choice, is a 8350 with 32MB of memory. Should be interesting to see what the performance is and how SI93C drives handle. Also purchased Volume Shadow for this machine. It should also make a nice test bed for VMS 5.0 and SMP checkout. Now for the bad news. We have two Datagrafix 4800 ionization printers that do 90+ pages per minute when working. The quality of print is very good and they handle forms overlays nicely. One of the reasons for choosing them, was the proximity of their office to ours, which is about 1.5 BLOCKS or 900 YARDS. One printer broke on Friday of last week and was waiting for parts. On Monday, the second printer broke, naturally, so we logged a service call at 13:00. By 19:00, a word was not not heard so I logged another call. Tuesday morning I called again and explicitly layed the facts on the table and inquired about a remedy. After much researching on their part they pointed to phone logs, which are maintained by Corp in Calif., and said that I must be mistaken. The 13:00 call was made at 15:00 and the 19:00 call was NEVER made, the printer that was broken on Friday is fixed and that they ALWAYS are on site within 1.5 HOURS of a service call. After I told them the error of the their logs, their response time, and offered a PROBATIONARY period to get back in good graces, they came right out to fix the printers. Its a shame, the printers are good but the service department has failed me on a number of occasions. Also on Tuesday and Wednesday the VAXCluster has been playing 'ping-pong' between HSC's in doing both HSC reboots, with NO messages displayed indicating why the reboot is occuring, and passing disks back and forth. I was beginning to think that DIGITAL was using me to test FAILOVER. Thank GOD it works. I have also placed a LARGE order for greenbar to use for ALL the console messages that were printed out. We USED to have LA50 printers attached to the PRO consoles to get a hardcopy. The problem is that the messages occur SO FREQUENT the LA50 gives up and prints garbage. I can NOT read garbage so the LA50's are not used anymore. I have ALSO come to the conclusion that the VAX CLUSTER CONSOLE (VCS) is DEC's response to the VOLUME of messages on the consoles and they NEEDED a FASTER system to capture, and/or print, the console messages. The logic is as follows. Every system prints the messages from EVERY other system in the cluster in ADDITION to its own. If the console is slow, ie. LA120, then OPCOM backs up. When OPCOM backs up, system managers do NOT get the message the CLUSTER IS DIEING. By not getting the message, the CLUSTER heads south, quickly. My 8700's clock 6 MIPS, while the LA120's and the PRO console clock 120 characters per second. The PRO is probably doing LESS then 120. CONSIDER the resulting damage from a crazed 8700. By the time you get the message, youv'e had it. This is what BACKUP's are for. Anyway back to the ping-pong game. DEC field service came out, and RDC poked and prodded and the consensus is that the HSC requestor cards are in the WRONG order. The basis for this is that the HSC requestor bus is similar to the UNIBUS in that the SLOWER devices SHOULD be closer to the CPU card to respond to interrupts faster. The thought here is that the slower devices do not interrupt as often but they need quick attention so they are closer to the CPU and the attention they desire. You also eliminate the 'time-outs' that would occur otherwise. As you get further from the CPU, the interrupts increase (maybe) but the response time may not be AS critical. The order of the HSC bus then is as follows. CPU/CI Interface Slow devices ( tape drives ) Requestor # 1 Mediocre devices ( RA-60 drives ) Quick devices ( RA-81 drives ) Middle Requestors Fast devices ( RA-82, SI83/93 drives ) High Requestors #8 As it turns out, this sequence is now HIGHLY RECOMMENDED along with Version 3.5 of the HSC code. NO ONE told me, or my local field service office, about this so as fate would have it both my HSC backplanes are in damn near reverse order. The type of errors you get if the order is not as recommended is a SIGNIFICANT number of SDI collision errors ( I have this ) and possible time out problems ( I have this also ). In regards to the time out, let me add a new wrinkle. My Cluster QUORUM disk is (was) a RA-81 on the FAR side of the SI drives. I have been getting the following messages that always sends chills. %CNXMAN, Error reading quorum disk %CNXMAN, Lost 'connection' to quorum disk MOUNT VERIFICATION message for quorum disk %CNXMAN, Established "connection" to quorum disk This is usually results in system(s) hanging in wait for the RA-81 to show its hand. I happen to like my Quorum disk for safety reasons and am not planning to get rid of it to stop the messages. The fix is that my local field service rep. and an accompliance will get to spend a Sunday in air conditioning redoing the sequence of requestor cards. One other thing to note about the HSC software. We have one HSC50 and one HSC70. The 3.5 update showed up for the HSC70 but I have yet to get the kit for the HSC50. I REQUESTED Field Service to perform the upgrade to both, them told them about the non-delivery. They got a TU58 kit from area and did the update. There is a patch to 3.5 for TA81 users (I do not know about TA78 users) that is needed to correct how the HSC does HSC BACKUPS. This puts the final rev level at 3.51. There is also a ECO for TA81 DRIVES with 3.5 of the HSC code that will 'correct' the following two problems. Blocks Lost Position Lost Since I also had both of these, I appreciated them installing the kits. The ECO is number 'EQ-01443-01'. Then I lost two requestor cards that resulted in corrputed files. The worst file took 18 hours to recreate from pieces. I found an interesting problem with VMS SORT in doing the fix. The file is 1,098,690 records of 44 bytes each with the key being the first 17 ASCII characters of the record. If ALL the records are sorted at once, the primary work file CONSUMES 950K+ blocks of work space and wanted more. Unfortunately I had no more to give it so it aborted. When I split the file in roughly two equal segements, the work space for each half consumed about 150K blocks. Now the keys are close to being in order, so I have to wonder what reason is behind the request for such large amounts of work space when doing the whole file. I can already tell what the SPR response is going to say, "Thanks...". One thing is for sure, you do NOT write SPR's to get a TECHNICAL answer to a problem that will get you out of a bind. This was doing a 'record' sort, and from now on 'tag' sorts will be used. This also brings up a good point in that a TRULY generic program is needed that will let a user specify the file layout and key size/placement/type and a few other details and then do sequential reads. The TRICK needed is to be able to specify how the keys work so that the program can SKIP over bad buckets and then continue reading the records. All the records read should be written to a sequential file that can then be used to create a new version of the file. I am getting tired of writing special programs to do this everytime a file heads west. The program should also let you start at a different key and step through a file and pick out the records you want. This is needed if you have to use old BACKUPs to get the records you NO longer have. I want to thank Matt Madison for the reply about my installing the SNA Gateway. He has also made a VERY good point. It turns out that there is a board in the Gateway box that is ONLY accessible from the TOP of the box. The implication here being that unless you lift weights and have no fear of hernias, do NOT stack the Gateway boxes. With my luck, that is the first board that would pack it in. Like I said earlier, I love a clean design. Now I also hear that DEC has released for installation the parts needed to upgrade the memory on 8500 systems. The last time we tried, I got a NEW 8500 out of it. I have to ask myself if I am ready for another trade in?? The only other issues were operator problems, a computer room air conditioning failure and a 26 hour work day on Wednesday. Other then that life was calm. See, things CAN go from bad to worse. :-) Paul D. Clayton - Manager Of Systems TSO Financial - Horsham, Pa. USA Address - CLAYTON%XRT@CIS.UPENN.EDU