[comp.sys.sun] Hardware problems?

lapyun@uunet.uu.net (Lap Yun Yau) (11/30/90)

Hi, can someone help us with the following problems on Sun 4/490s.

One of the 4/490 we have crashed and reboot automatically about 6 weeks
ago.  Since then, it crashed every other day for nearly 4 weeks.  After we
replaced the CPU board 5 times, the 128 memory boards 4 times, the 32M
memory board 2 times, and the IPI controllers, the system runs fine for
about 10 days.  Then it crashed again last Saturday.  We have to power
down the machine and re-power it up in order to reboot.  Today, (Tuesday,
3 days later) it crashes again.

We have turned on the 'savecore' in /etc/rc.local and we can only capture
the core dump three times.  However only one of them is complete.  (Does
anybody know why?)  We then send the core file to Sun and the people found
out that there was a panic situation in the ie module.

The little green lights at the back of the CPU board shows different
things in different crashes.  Most of the time it is the fifth one
blinking and the rest dark, sometimes, it is the 3rd and the 6th lights on
(not blinking).  Does anybody knows how to interpret these signals?

Sun software people said it may be the NFS problems on 4/490 and the NFS
patch version 2 which they are going to send us may solve the problems.
Sun hardware people said there may be some hardware problems and they are
trying to swap parts (virtually everything except the chassis, IPI disks,
and the SCSI controller - the part that we seldom use).

We are confused and frustrated.  Our production works are severely
affected and the management is very upset. Does anybody have any advise or
insight?  The following describe the details of that 4/490.

* it is one of the first group of 4/490 Sun ship out (we got it the same
  week Sun made the announcement)

* there are 2 * 128M memory boards, 1 * 32M memory board, 1 IPI controller
  with 2 disks.

* OS version 4.0.3 and the swap space is 620M.  We run simulation jobs on
  it for months long before the first crash. 

* Similar hardware/software setup on other 4/490s are running fine without
  any problem.

* It was setup as a file server with the appropiate server setup; however,
  after many crashes, we reload the os and make it as a standalone machine
  (verse file server setup) and it still crash.

This is not the end of the story, today, another 4/490 crash without any
reason!  What will be next?  Help, help.