CLAYTON@xrt.upenn.EDU ("Paul D. Clayton") (06/30/87)
Information From TSO Financial - The Saga Continues... Chapter 6 - June 29, 1987 The subtitle for this letter would read, 'How Large Are The Hidden Costs When Using New Technologies'. To date, I have not read many messages from this network on users problems with the new 8XXX machines. Logically there are three answers for this. The first is that the problems that I am expierencing with my machines must be unique to me, in other words, the two 8700's and one 8500 we have are lemons from the factory. The second answer is that many people are having problems but each feels that they are alone with the new machines and thus 'live' with the problems. The third answer is that the local offices of DEC are doing everything they can to catch up with the new machines and placate the sites that have them. In the interest of placing issues in the open for discussion, I will list the various problems that we have had to date on the 8XXX systems and current shortfalls that I percieve. Problems: 1. The 8500 memory upgrade package that DEC sells for a solid 4 figure cost can not be installed on our system. The problem is defined as 'Under Extreme Loads The System Will Crash'. We bought this package back in October of 1986 and have yet to take advantage of it. It took five attempts by DEC to install it and ALL attempts failed. The net result is that the original 8500 we had, the 12th production unit, was replaced with a new CPU because they could not get it to work. The upgrade package installation is frozen by DEC until the new MCL card is released from central engineering. They are redesigning the board for sub 10 nanosecond switching times. The release date is sometime next month (July). 2. The RDC capability of the current machines is only a SUBSET of the RDC capability that the 7XX series enjoys. With the current version, you have to have someone on site to enter commands that deposit data and performs certain other actions. The next upgrade is said to correct this defiency. 3. We are constantly getting what looks like ECC correctable memory errors on both the 4MB and 16MB memory arrays. The trouble is that the system does not log the errors until a multiple of 16 errors occur. The problem is that my systems have crashed before reaching the trigger point and thus the errors are lost to recording. They may in fact point to the problem that caused the problem but we will never know. 4. I have encountered a two line message on the PRO that reads: 'Excessive number of interrupts recieved by the PPI interface. Closing down the PPI interface.' The net result of this is the 8XXX is in a 'hung' state, I know this due to it dropping all DECnet and LAT connections, and the PRO is refusing to talk to the 8XXX. This is also known as a mini 'cluster partition', or 'Mexican Standoff'. The only way to recover that I have found to date is to completely power down both the 8XXX and the PRO, then power up the PRO and then the 8XXX. The result is a bootable machine that will then do what it was intended for. The side effect is that NO crash dump is made and NO knowledge is gained to help in preventing this from happening again. 5. Do not do 'CONNECT CONSOLE' from SYSGEN on the 8XXX. The result is that the system will crash in 18 hours, guarenteed. The is rumored to be fixed in V4.6. 6. Do not use the 'HOLD SCREEN' on the PRO to stop the information so that you can read what is being displayed before it scrolls off the top. The PRO and 8XXX get 'confused' about where you are and the result is garbage on the screen from that point on. The work around is Control S/Q sequences. 7. Do not put the PRO in the 'Control' mode, at the '>>>' prompt, while the cpu is running. My local office has always saidthat this causes no problems. But everytime I have tried it, the 8XXX has crashed. 8. The jumpers that connect the memory backplane to the NMI backplane are not keyed, so they can very easly be put on incorrectly. It took our local and area support 1.5 days to realize that the cables they just put on where put on incorrectly. 9. The micro diagnostics that are delivered with the machines are good for starters only. I have run the same sequence of tests one after the other and had many DIFFERENT causes reported. The only thing that seems to ALWAYS be reported is a problem with the MCL card for ANY memory subsystem errors. I have had 2MB daughter cards on the 16MB arrays show up as MCL card problems from the diags. 10. NEVER power down the PRO from a running 8XXX system. DEC has told me that the 8XXX will continue to function, taking into consideration OPCOM problems. The three times that the PRO has gone south on me all resulted in the 8XXX going with it. 11. The new DEC hardware additions that are available to the public are not supported by VMS. The case here is with the 16MB memory arrays. The error logger for VMS does NOT handle the errors that are reported by the 16MB arrays. The CE has to decode the error by hand. The question here is why are hardware upgrades released for sale before VMS can 'fully' support them. The policy for the release of software layered products is totally opposite to this. 12. The connection from the PRO to the MDS01 RDC box is a cable that comes off the shelf. Only thing is that a gender changer is REQUIRED to connect the cable to the PRO. What prevented DEC from making sure that the cables worked with a minimum of fuss. It took our CE a SPECIAL order to get the changer since it was NOT bundled in with the MDS01. 13. The newer systems are using multi-level ZIF connectors to make all the necessary connections from the boards themselves to the backplane. The days of the gold strips on the card edges are limited. These connectors appear to be VERY susceptible to dirt, dust and oxide build up. My CE's are constantly removing cards and wiping the 'pads' that located on the card edge. During these periods, I have witnessed problems 'move' up and down the backplane just by wiping contact pads as reported by the diags. I also had a board that had a 'Qualilty Control' stamp, which is a ink stamp, placed DIRECTLY over the pads on one board. 14. The boards themselves use a significant number of static sensitive devices, such as PAL's and MOS chips. These boards are shipped inside a plastic case that is lined with anti-static foam and has allowances for connecting a static strap and a window to read the board id number without opening the case. These cases can NOT be cheap, even when you buy thousands. We have gotten a large number of replacement boards from DEC repair depos that have a 8 1/2 by 11 piece of paper inside them laying on top of the board. A closer inspection of these papers shows that they are repair logs detailing any work done to the board. Unless I am mistaken, paper is a carrier for static charge. In DEC's defense, I have also had boxes recieved that had the same piece of paper inside an anti-static bag, then placed on top of the board. 15. One one occasion, I was having a system crash everyday around the same time. It turned out to be vibration sensitive boards, five of them. The new systems make considerable use of leadless chip carriers and cooling towers to use the new chips that allow all the transistors to be packed in a very small area. I have to wonder about the ability of this configuration to withstand vibrational loads over a long period of time. There is nothing worser than a flakey machine. I always pray that when a machine dies, it dies solidly. This list has been compiled based on problems that I have had since January 1, 1987. I am sure that other problems are left out. Maybe I will create an updated list in the future, surely I have not found all the problems by now. The primary question is, at what cost to the user community are new machines being sold? When we lost the 8500 for a week the cost to us was $4.5 million per day, all due to a $500,000 computer. What other problems with the 8XXX systems that are not listed above have occured in your shops? Are there significant problems now, or on the horizon, for DEC's sales force regarding new systems? Or is all this just happening to me?? :-) Paul D. Clayton - Manager Of Systems TSO Financial - Horsham, Pa. USA Address - CLAYTON%XRT@CIS.UPENN.EDU