[comp.sys.hp] 9000/370 Problems...

krishnan@uceng.UC.EDU (Ramaswamy Krishnan) (07/20/90)

In article <13484@udenva.cair.du.edu> news@udenva.cair.du.edu (netnews) writes:
> ...
> Since about March we have been expirencing system crashes.  Generally the
> system panics about a parity error ,dumps stuff to the console, then
> hangs.
>
> Our hp people here have replaced everything in the box, and we still have
> errors.  It seems to be crashing during a compile ( at least that what it
> was doing last... ).

Yes - similar symptoms to what we had here in May.  But since the error
mesgs you mention here seem generic for a system crash, I am not sure if
it is the same kind.

Our configuration : An 840 runing 7.0 with a 7963B 0.9Gig diskbox and 4
                    7935 (400Meg each).

It all started to happen one fine day in May - a couple of months after
we went to 7.0 - I had not changed anything much during that period.

Here is what happened :

Even as I was working, the system slowed down - and after a few minutes
of such slow activity, it came to a state where even my cursor wouldn't
move.  And after a couple of minutes the system rebooted itself.
The message in the adm file was similar to (sorry for a listing the whole
mesg - but I am doing so in the hope that some HP-UX guru there may use it) :

===============
Jul 16 12:05
trap type 15, pcsq.pcoq = 0.49434, isr.ior = 0.1c

PANIC:  please wait for core dump to complete.
@(#)9245XA HP-UX (sys.A.B7.00.3L/S800) #1: Mon Oct 30 17:59:05 PST 1989
panic: (display==0xb000, flags==0x0) Data segmentation fault

PC-Offset Stack Trace (read across, most recent is 1st):
stktrc: can't find rp
  0x000d0f78  0x000d1160  0x000d12ec  0x00082854  0x000790a8  0x00049434
End Of Stack

sync'ing disks (90 buffers to flush): 90 76 67 54 43 34 26 22 19 14 11 7 4 1
0 buffers not flushed
0 buffers still dirty

dumping 25165824 bytes to dev 0x207, offset 18326 ...
Dump successfully completed.
Beginning I/O System Configuration.
cio_ca0 address = 8
   hpib0 address = 0
      disc0 lu = 0 address = 0
      disc0 lu = 1 address = 1
      disc0 lu = 2 address = 2
      disc0 lu = 3 address = 3
   mux0 lu = 0 address = 1
   hpib0 address = 2
      lpr0 lu = 1 address = 0
      lpr0 lu = 0 address = 1
      tape1 lu = 0 address = 3
      tape1 lu = 1 address = 4
      lpr1 lu = 2 address = 5
      instr0 lu = 0 address = 6
      instr0 lu = 2 address = 2
   mux0 lu = 1 address = 3
   lan0 lu = 0 address = 4
   gpio0 lu = 0 address = 5
   hpib0 address = 6
      disc0 lu = 4 address = 0
      disc0 lu = 5 address = 1
      disc0 lu = 6 address = 2
      disc0 lu = 7 address = 3
   hpib0 address = 7
      lpr0 lu = 3 address = 1
      tape1 lu = 2 address = 3
      lpr1 lu = 4 address = 5
      instr0 lu = 1 address = 7
   mux0 lu = 2 address = 8
   mux0 lu = 3 address = 9
   mux0 lu = 4 address = 10
   mux0 lu = 5 address = 11
I/O System Configuration complete.
Configure called
Beginning Subsystem Initialization
   nsnsipc0 initialized
   nsrfa0 initialized
Subsystem Initialization Complete
Beginning Filesystem Initialization
   ufs initialized
   nfs initialized
Filesystem Initialization Complete
@(#)9245XA HP-UX (sys.A.B7.00.3L/S800) #1: Mon Oct 30 17:59:05 PST 1989
real mem = 25165824
lockable mem = 17342464
avail mem = 19243008
using 614 buffers containing 2514944 bytes of memory

===============

So basically it crashed because of some data segmentation fault and
rebooted itself.

Well, I found that when this happened every morning, the pathalias was
running - pathalias should have been done in 5 mins at night, but would
carry on till morning.  Then came a hint that pathalias might be indeed
the problem as it seemed to be stuck somewhere and was just hogging
memory.

So I replaced pathalias with a new version and the system stopped crashing.

Yes - I took a core dump along with the pathalias version we were running
and also the maps and shipped them to HP about 2 months back.  They are
yet to call us back.  Incidentally, the crash occurred again this week
(as the log above shows) and this time it was not pathalias - some one
was running a large program.

So I feel that it is something to do with memory utilization - not hardware.

Any HP-UX gurus listening and can shed some light (at least someone who
can make my confidence in HP support build up) ?

> The one thing that seems to happen 90% of the time after the crash the LED
> marked 1 on the 7963 flashes constantly.  Does this mean anything?

Hmm.. I did not notice that - but could it be just that the disk is not
clean and/or is getting fsck'd when rebooting?

> Anyway we are getting near the end of the rope on this box, we've replaced
> the power coming in ( Which HP keeps insisting thats our problem... )

I wouldn't spend a dime on that power stuff if I were you - it could be
another goose chase that the response center folks had to come up with.

Though we have had some help from the HP folks on this net at times, I
guess we haven't chanced into 'that right person' in the response center
yet who would boost my confidence that 'they know their bugs'.

> Anyone have any similar expirences???
> ----------------------
> Randy Welch               UUCP    :  ...!ncar!scicom!bldr!randy  (work)

Thanks in advance for any more light some HP-UX guru can shed on this.

--
Ramaswamy Krishnan				Krishnan@UC.EDU  (ARPA)
College of Engineering				uceng!krishnan   (UUCP)
Univ. of Cincinnati				krishnan@ucbeh   (BITNET)

rjn@hpfcso.HP.COM (Bob Niland) (07/20/90)

re: > Perhaps you can help solve a mystery...

Perhaps, but we'll need more information.


> We have a 9000/370 w/16M 2disks one is a 7963B the other a 7937. ( Whether
> this matters or not, who knows... ) Attached to an ethernet network and a
> Novell Network.

Can you give us the whole configuration: cards, slots, addresses, etc.
In particular, do you have one of your disks on a 98625A card?


> Since about March we have been expirencing system crashes.  Generally the
> system panics about a parity error ,dumps stuff to the console, then
> hangs.

The exact message(s) would be helpful.  Do you mean memory parity errors?
Do you have parity or ECC RAM?  When my workstation was a 16M 350 with
parity RAM, I was experiencing random transient parity errors and crashes
about once every 3 months (cosmic rays, alpha particles or whatever).  Must
be the Colorado altitude.  Since converting to ECC, the problem disappeared.

Regards,                                              Hewlett-Packard
Bob Niland      Internet: rjn@hpfcrjn.FC.HP.COM       3404 East Harmony Road
                UUCP: [hplabs|hpfcse]!hpfcrjn!rjn     Ft Collins CO 80525-9599

burdick@hpspdra.HP.COM (Matt Burdick) (07/21/90)

> The one thing that seems to happen 90% of the time after the crash the LED
> marked 1 on the 7963 flashes constantly.  Does this mean anything?

Have you tried replacing the disk, or just all of the cards in the cpu
box?  If you have a bad disk (especially if it's used for swap), it
could cause the machine to panic.

							-matt
-- 
Matt Burdick                |   Hewlett-Packard
burdick@hpspd.spd.hp.com    |   Intelligent Networks Operation

rwelch@diana.cair.du.edu (RANDY S WELCH) (07/23/90)

In article <13330002@hpspdra.HP.COM> burdick@hpspdra.HP.COM (Matt Burdick) writes:

>   Have you tried replacing the disk, or just all of the cards in the cpu
>   box?  If you have a bad disk (especially if it's used for swap), it
>   could cause the machine to panic.

Actually the disks, and the mux card, are the only things that haven't
been swapped, yet...

And today I have seen the ultimate in swaps.  Our HP people brought in a
whole 370 to swap with ours. *sigh*

As Bob Niland requested:
>  Perhaps, but we'll need more information.

I'll try to have that monday ( hardware & software specs ).

Just as a point of interest the system was running ok for 6-8 months prior
to the memory problems...

----------------------
Randy Welch               UUCP    :  ...!ncar!scicom!bldr!randy  (work)
Vitel International       INTERNET:  rwelch@du.edu               (read)
Boulder, CO               VOICE   :  303-442-6717
"Unfortunately, life contains an unavoidable element of unpredictability"
	       -David Lynch "The Angriest Dog in the World"
-- 
Randy Welch   Mail to :  ...!ncar!scicom!bldr!randy or rwelch@du.edu
Boulder, CO   VOICE   :  303-442-6717
"Unfortunately, life contains an unavoidable element of unpredictability"
-David Lynch "The Angriest Dog in the World"

glen@hpfcmgw.HP.COM (Glen Robinson) (07/27/90)

The H-P field guys keep insisting that it is a power problem because that
is the most likely cause of parity panics on a 370.  Parity can ONLY occur
on a read from memory (that is the only time it is checked and this is
done by hardware not software).  Therefore the problem could have only
happened during the previous write to the location or at some time after
the initial write.  Problems that occur duuring writes to a memory location
can quite reliably be found with memory diagnostics including the boot time
memory diagnostic.

Problems occuring after a cell is written usually are caused by one of two
things:
    1.  A cell changes due to a soft error caused for instance by an alpha 
        particle hit.  
    2.  A cell changes due to a voltage transient on logic ground.  In this 
        scenario the specific cells affected by such a transient are those
        that are 'weakest' at the time.   While many cells might be changed
        you will only know about the first one that a read attempt is made
        on (i.e., the one that generates the parity panic).
Note that in the two cases above the location of failure will probably
be random.

The design is extremely robust in handling spikes or large transients
across AC neutral and AC phase, however, in order to pass VDE, class B, 
et. al. the designed separated AC neutral, Safety Ground and Logic Ground.  
In normal user ac power situations this is no problem.  However, when the
user has problems such as floating grounds, or peripherals on one phase
and the computer on another phase (or whatever) a measurement of the
rms voltage between AC neutral and Safety ground will indicate the problem.
The Model 370 will NOT tolerate voltage greater than 1 volt rms between
these two lines.  Often a power line monitor is required in order to
catch transients across these two lines which sometimes occur as the
result of an external event (elevator motor, or ..).

To put all of this into perspective.  There are a lot of Model 370's out
there (in the tens of thousands).  You can count the sites that have 
experienced recurring parity problems on one hand.  In every previous
case we have found that curing input power problems solved the parity 
problems.

The normal comments about this not bein an official postion of H-P etc.
apply.

Glen Robinson

rjn@hpfcso.HP.COM (Bob Niland) (07/28/90)

re: > The H-P field guys keep insisting that it is a power problem because
    > that is the most likely cause of parity panics on a 370.

It could also be a [rare] defective backplane (bent connector, cold/loose
solder joint, etc.), in which case the most recent action of swapping out
the whole box will probably correct it.

rwelch@diana.cair.du.edu (RANDY S WELCH) (07/30/90)

In article <7370182@hpfcso.HP.COM> rjn@hpfcso.HP.COM (Bob Niland) writes:

>   It could also be a [rare] defective backplane (bent connector, cold/loose
>   solder joint, etc.), in which case the most recent action of swapping out
>   the whole box will probably correct it.

Well so far the machine seems to be working ok.  It's been a week and the
only thing that killed it was a power outage.  Hope it works.  I'd like to
get on this box someday :-) ( if you only knew my office... )

Thanks to everyone who has given answers on this problem!
-- 
Randy Welch   Mail to :  ...!ncar!scicom!bldr!randy or rwelch@du.edu
Boulder, CO   VOICE   :  303-442-6717
"Unfortunately, life contains an unavoidable element of unpredictability"
-David Lynch "The Angriest Dog in the World"