[comp.os.vms] 8XXX Systems - Are They Systems Released Before Their Time ??

CLAYTON@xrt.upenn.EDU ("Paul D. Clayton") (06/30/87)

Information From TSO Financial - The Saga Continues...
Chapter 6 - June 29, 1987

The subtitle for this letter would read, 'How Large Are The Hidden Costs When
Using New Technologies'. To date, I have not read many messages from this
network on users problems with the new 8XXX machines. Logically there are three
answers for this. The first is that the problems that I am expierencing with
my machines must be unique to me, in other words, the two 8700's and one 8500
we have are lemons from the factory. The second answer is that many people are
having problems but each feels that they are alone with the new machines and
thus 'live' with the problems. The third answer is that the local offices of DEC
are doing everything they can to catch up with the new machines and placate the
sites that have them.

In the interest of placing issues in the open for discussion, I will list the 
various problems that we have had to date on the 8XXX systems and current
shortfalls that I percieve.

Problems:
1. The 8500 memory upgrade package that DEC sells for a solid 4 figure cost can
not be installed on our system. The problem is defined as 'Under Extreme Loads 
The System Will Crash'. We bought this package back in October of 1986 
and have yet to take advantage of it. It took five attempts by DEC to install it
and ALL attempts failed. The net result is that the original 8500 we had, the 
12th production unit, was replaced with a new CPU because they could not get it
to work. The upgrade package installation is frozen by DEC until the new MCL 
card is released from central engineering. They are redesigning the board for
sub 10 nanosecond switching times. The release date is sometime next month 
(July).

2. The RDC capability of the current machines is only a SUBSET of the RDC
capability that the 7XX series enjoys. With the current version, you have to
have someone on site to enter commands that deposit data and performs certain
other actions. The next upgrade is said to correct this defiency.

3. We are constantly getting what looks like ECC correctable memory errors on 
both the 4MB and 16MB memory arrays. The trouble is that the system does not
log the errors until a multiple of 16 errors occur. The problem is that my 
systems have crashed before reaching the trigger point and thus the errors are
lost to recording. They may in fact point to the problem that caused the 
problem but we will never know.

4. I have encountered a two line message on the PRO that reads:
	'Excessive number of interrupts recieved by the PPI interface.
	 Closing down the PPI interface.'
The net result of this is the 8XXX is in a 'hung' state, I know this due to it
dropping all DECnet and LAT connections, and the PRO is refusing to talk to the 
8XXX. This is also known as a mini 'cluster partition', or 'Mexican Standoff'.
The only way to recover that I have found to date is to completely power down
both the 8XXX and the PRO, then power up the PRO and then the 8XXX. The result
is a bootable machine that will then do what it was intended for. The side
effect is that NO crash dump is made and NO knowledge is gained to help in
preventing this from happening again.

5. Do not do 'CONNECT CONSOLE' from SYSGEN on the 8XXX. The result is that the
system will crash in 18 hours, guarenteed. The is rumored to be fixed in V4.6.

6. Do not use the 'HOLD SCREEN' on the PRO to stop the information so that you
can read what is being displayed before it scrolls off the top. The PRO and 
8XXX get 'confused' about where you are and the result is garbage on the screen
from that point on. The work around is Control S/Q sequences.

7. Do not put the PRO in the 'Control' mode, at the '>>>' prompt, while the
cpu is running. My local office has always saidthat this causes no problems. 
But everytime I have tried it, the 8XXX has crashed.

8. The jumpers that connect the memory backplane to the NMI backplane are not
keyed, so they can very easly be put on incorrectly. It took our local and area
support 1.5 days to realize that the cables they just put on where put on 
incorrectly.

9. The micro diagnostics that are delivered with the machines are good for 
starters only. I have run the same sequence of tests one after the other and 
had many DIFFERENT causes reported. The only thing that seems to ALWAYS be 
reported is a problem with the MCL card for ANY memory subsystem errors. I have
had 2MB daughter cards on the 16MB arrays show up as MCL card problems from the 
diags.

10. NEVER power down the PRO from a running 8XXX system. DEC has told me that
the 8XXX will continue to function, taking into consideration OPCOM problems.
The three times that the PRO has gone south on me all resulted in the 8XXX 
going with it.

11. The new DEC hardware additions that are available to the public are not 
supported by VMS. The case here is with the 16MB memory arrays. The error
logger for VMS does NOT handle the errors that are reported by the 16MB 
arrays. The CE has to decode the error by hand. The question here is why are 
hardware upgrades released for sale before VMS can 'fully' support them. The
policy for the release of software layered products is totally opposite to this.

12. The connection from the PRO to the MDS01 RDC box is a cable that comes
off the shelf. Only thing is that a gender changer is REQUIRED to connect the
cable to the PRO. What prevented DEC from making sure that the cables worked
with a minimum of fuss. It took our CE a SPECIAL order to get the changer since 
it was NOT bundled in with the MDS01.

13. The newer systems are using multi-level ZIF connectors to make all the 
necessary connections from the boards themselves to the backplane. The days
of the gold strips on the card edges are limited. These connectors appear to
be VERY susceptible to dirt, dust and oxide build up. My CE's are constantly
removing cards and wiping the 'pads' that located on the card edge. During
these periods, I have witnessed problems 'move' up and down the backplane
just by wiping contact pads as reported by the diags. I also had a board that
had a 'Qualilty Control' stamp, which is a ink stamp, placed DIRECTLY over
the pads on one board.

14. The boards themselves use a significant number of static sensitive devices,
such as PAL's and MOS chips. These boards are shipped inside a plastic case
that is lined with anti-static foam and has allowances for connecting a 
static strap and a window to read the board id number without opening the case.
These cases can NOT be cheap, even when you buy thousands. We have gotten a 
large number of replacement boards from DEC repair depos that have a 8 1/2 by
11 piece of paper inside them laying on top of the board. A closer inspection 
of these papers shows that they are repair logs detailing any work done to the
board. Unless I am mistaken, paper is a carrier for static charge. In DEC's
defense, I have also had boxes recieved that had the same piece of paper inside
an anti-static bag, then placed on top of the board.

15. One one occasion, I was having a system crash everyday around the same time.
It turned out to be vibration sensitive boards, five of them. The new systems
make considerable use of leadless chip carriers and cooling towers to use
the new chips that allow all the transistors to be packed in a very small 
area.  I have to wonder about the ability of this configuration to withstand
vibrational loads over a long period of time. There is nothing worser than a
flakey machine. I always pray that when a machine dies, it dies solidly.

This list has been compiled based on problems that I have had since January 1,
1987. I am sure that other problems are left out. Maybe I will create an 
updated list in the future, surely I have not found all the problems by now.

The primary question is, at what cost to the user community are new machines
being sold? When we lost the 8500 for a week the cost to us was $4.5 million
per day, all due to a $500,000 computer. What other problems with the 8XXX 
systems that are not listed above have occured in your shops? Are there 
significant problems now, or on the horizon, for DEC's sales force regarding
new systems? 

Or is all this just happening to me?? :-)

Paul D. Clayton - Manager Of Systems
TSO Financial - Horsham, Pa. USA
Address - CLAYTON%XRT@CIS.UPENN.EDU