krishnan@uceng.UC.EDU (Ramaswamy Krishnan) (07/20/90)
In article <13484@udenva.cair.du.edu> news@udenva.cair.du.edu (netnews) writes: > ... > Since about March we have been expirencing system crashes. Generally the > system panics about a parity error ,dumps stuff to the console, then > hangs. > > Our hp people here have replaced everything in the box, and we still have > errors. It seems to be crashing during a compile ( at least that what it > was doing last... ). Yes - similar symptoms to what we had here in May. But since the error mesgs you mention here seem generic for a system crash, I am not sure if it is the same kind. Our configuration : An 840 runing 7.0 with a 7963B 0.9Gig diskbox and 4 7935 (400Meg each). It all started to happen one fine day in May - a couple of months after we went to 7.0 - I had not changed anything much during that period. Here is what happened : Even as I was working, the system slowed down - and after a few minutes of such slow activity, it came to a state where even my cursor wouldn't move. And after a couple of minutes the system rebooted itself. The message in the adm file was similar to (sorry for a listing the whole mesg - but I am doing so in the hope that some HP-UX guru there may use it) : =============== Jul 16 12:05 trap type 15, pcsq.pcoq = 0.49434, isr.ior = 0.1c PANIC: please wait for core dump to complete. @(#)9245XA HP-UX (sys.A.B7.00.3L/S800) #1: Mon Oct 30 17:59:05 PST 1989 panic: (display==0xb000, flags==0x0) Data segmentation fault PC-Offset Stack Trace (read across, most recent is 1st): stktrc: can't find rp 0x000d0f78 0x000d1160 0x000d12ec 0x00082854 0x000790a8 0x00049434 End Of Stack sync'ing disks (90 buffers to flush): 90 76 67 54 43 34 26 22 19 14 11 7 4 1 0 buffers not flushed 0 buffers still dirty dumping 25165824 bytes to dev 0x207, offset 18326 ... Dump successfully completed. Beginning I/O System Configuration. cio_ca0 address = 8 hpib0 address = 0 disc0 lu = 0 address = 0 disc0 lu = 1 address = 1 disc0 lu = 2 address = 2 disc0 lu = 3 address = 3 mux0 lu = 0 address = 1 hpib0 address = 2 lpr0 lu = 1 address = 0 lpr0 lu = 0 address = 1 tape1 lu = 0 address = 3 tape1 lu = 1 address = 4 lpr1 lu = 2 address = 5 instr0 lu = 0 address = 6 instr0 lu = 2 address = 2 mux0 lu = 1 address = 3 lan0 lu = 0 address = 4 gpio0 lu = 0 address = 5 hpib0 address = 6 disc0 lu = 4 address = 0 disc0 lu = 5 address = 1 disc0 lu = 6 address = 2 disc0 lu = 7 address = 3 hpib0 address = 7 lpr0 lu = 3 address = 1 tape1 lu = 2 address = 3 lpr1 lu = 4 address = 5 instr0 lu = 1 address = 7 mux0 lu = 2 address = 8 mux0 lu = 3 address = 9 mux0 lu = 4 address = 10 mux0 lu = 5 address = 11 I/O System Configuration complete. Configure called Beginning Subsystem Initialization nsnsipc0 initialized nsrfa0 initialized Subsystem Initialization Complete Beginning Filesystem Initialization ufs initialized nfs initialized Filesystem Initialization Complete @(#)9245XA HP-UX (sys.A.B7.00.3L/S800) #1: Mon Oct 30 17:59:05 PST 1989 real mem = 25165824 lockable mem = 17342464 avail mem = 19243008 using 614 buffers containing 2514944 bytes of memory =============== So basically it crashed because of some data segmentation fault and rebooted itself. Well, I found that when this happened every morning, the pathalias was running - pathalias should have been done in 5 mins at night, but would carry on till morning. Then came a hint that pathalias might be indeed the problem as it seemed to be stuck somewhere and was just hogging memory. So I replaced pathalias with a new version and the system stopped crashing. Yes - I took a core dump along with the pathalias version we were running and also the maps and shipped them to HP about 2 months back. They are yet to call us back. Incidentally, the crash occurred again this week (as the log above shows) and this time it was not pathalias - some one was running a large program. So I feel that it is something to do with memory utilization - not hardware. Any HP-UX gurus listening and can shed some light (at least someone who can make my confidence in HP support build up) ? > The one thing that seems to happen 90% of the time after the crash the LED > marked 1 on the 7963 flashes constantly. Does this mean anything? Hmm.. I did not notice that - but could it be just that the disk is not clean and/or is getting fsck'd when rebooting? > Anyway we are getting near the end of the rope on this box, we've replaced > the power coming in ( Which HP keeps insisting thats our problem... ) I wouldn't spend a dime on that power stuff if I were you - it could be another goose chase that the response center folks had to come up with. Though we have had some help from the HP folks on this net at times, I guess we haven't chanced into 'that right person' in the response center yet who would boost my confidence that 'they know their bugs'. > Anyone have any similar expirences??? > ---------------------- > Randy Welch UUCP : ...!ncar!scicom!bldr!randy (work) Thanks in advance for any more light some HP-UX guru can shed on this. -- Ramaswamy Krishnan Krishnan@UC.EDU (ARPA) College of Engineering uceng!krishnan (UUCP) Univ. of Cincinnati krishnan@ucbeh (BITNET)
rjn@hpfcso.HP.COM (Bob Niland) (07/20/90)
re: > Perhaps you can help solve a mystery... Perhaps, but we'll need more information. > We have a 9000/370 w/16M 2disks one is a 7963B the other a 7937. ( Whether > this matters or not, who knows... ) Attached to an ethernet network and a > Novell Network. Can you give us the whole configuration: cards, slots, addresses, etc. In particular, do you have one of your disks on a 98625A card? > Since about March we have been expirencing system crashes. Generally the > system panics about a parity error ,dumps stuff to the console, then > hangs. The exact message(s) would be helpful. Do you mean memory parity errors? Do you have parity or ECC RAM? When my workstation was a 16M 350 with parity RAM, I was experiencing random transient parity errors and crashes about once every 3 months (cosmic rays, alpha particles or whatever). Must be the Colorado altitude. Since converting to ECC, the problem disappeared. Regards, Hewlett-Packard Bob Niland Internet: rjn@hpfcrjn.FC.HP.COM 3404 East Harmony Road UUCP: [hplabs|hpfcse]!hpfcrjn!rjn Ft Collins CO 80525-9599
burdick@hpspdra.HP.COM (Matt Burdick) (07/21/90)
> The one thing that seems to happen 90% of the time after the crash the LED > marked 1 on the 7963 flashes constantly. Does this mean anything? Have you tried replacing the disk, or just all of the cards in the cpu box? If you have a bad disk (especially if it's used for swap), it could cause the machine to panic. -matt -- Matt Burdick | Hewlett-Packard burdick@hpspd.spd.hp.com | Intelligent Networks Operation
rwelch@diana.cair.du.edu (RANDY S WELCH) (07/23/90)
In article <13330002@hpspdra.HP.COM> burdick@hpspdra.HP.COM (Matt Burdick) writes: > Have you tried replacing the disk, or just all of the cards in the cpu > box? If you have a bad disk (especially if it's used for swap), it > could cause the machine to panic. Actually the disks, and the mux card, are the only things that haven't been swapped, yet... And today I have seen the ultimate in swaps. Our HP people brought in a whole 370 to swap with ours. *sigh* As Bob Niland requested: > Perhaps, but we'll need more information. I'll try to have that monday ( hardware & software specs ). Just as a point of interest the system was running ok for 6-8 months prior to the memory problems... ---------------------- Randy Welch UUCP : ...!ncar!scicom!bldr!randy (work) Vitel International INTERNET: rwelch@du.edu (read) Boulder, CO VOICE : 303-442-6717 "Unfortunately, life contains an unavoidable element of unpredictability" -David Lynch "The Angriest Dog in the World" -- Randy Welch Mail to : ...!ncar!scicom!bldr!randy or rwelch@du.edu Boulder, CO VOICE : 303-442-6717 "Unfortunately, life contains an unavoidable element of unpredictability" -David Lynch "The Angriest Dog in the World"
glen@hpfcmgw.HP.COM (Glen Robinson) (07/27/90)
The H-P field guys keep insisting that it is a power problem because that is the most likely cause of parity panics on a 370. Parity can ONLY occur on a read from memory (that is the only time it is checked and this is done by hardware not software). Therefore the problem could have only happened during the previous write to the location or at some time after the initial write. Problems that occur duuring writes to a memory location can quite reliably be found with memory diagnostics including the boot time memory diagnostic. Problems occuring after a cell is written usually are caused by one of two things: 1. A cell changes due to a soft error caused for instance by an alpha particle hit. 2. A cell changes due to a voltage transient on logic ground. In this scenario the specific cells affected by such a transient are those that are 'weakest' at the time. While many cells might be changed you will only know about the first one that a read attempt is made on (i.e., the one that generates the parity panic). Note that in the two cases above the location of failure will probably be random. The design is extremely robust in handling spikes or large transients across AC neutral and AC phase, however, in order to pass VDE, class B, et. al. the designed separated AC neutral, Safety Ground and Logic Ground. In normal user ac power situations this is no problem. However, when the user has problems such as floating grounds, or peripherals on one phase and the computer on another phase (or whatever) a measurement of the rms voltage between AC neutral and Safety ground will indicate the problem. The Model 370 will NOT tolerate voltage greater than 1 volt rms between these two lines. Often a power line monitor is required in order to catch transients across these two lines which sometimes occur as the result of an external event (elevator motor, or ..). To put all of this into perspective. There are a lot of Model 370's out there (in the tens of thousands). You can count the sites that have experienced recurring parity problems on one hand. In every previous case we have found that curing input power problems solved the parity problems. The normal comments about this not bein an official postion of H-P etc. apply. Glen Robinson
rjn@hpfcso.HP.COM (Bob Niland) (07/28/90)
re: > The H-P field guys keep insisting that it is a power problem because
> that is the most likely cause of parity panics on a 370.
It could also be a [rare] defective backplane (bent connector, cold/loose
solder joint, etc.), in which case the most recent action of swapping out
the whole box will probably correct it.
rwelch@diana.cair.du.edu (RANDY S WELCH) (07/30/90)
In article <7370182@hpfcso.HP.COM> rjn@hpfcso.HP.COM (Bob Niland) writes: > It could also be a [rare] defective backplane (bent connector, cold/loose > solder joint, etc.), in which case the most recent action of swapping out > the whole box will probably correct it. Well so far the machine seems to be working ok. It's been a week and the only thing that killed it was a power outage. Hope it works. I'd like to get on this box someday :-) ( if you only knew my office... ) Thanks to everyone who has given answers on this problem! -- Randy Welch Mail to : ...!ncar!scicom!bldr!randy or rwelch@du.edu Boulder, CO VOICE : 303-442-6717 "Unfortunately, life contains an unavoidable element of unpredictability" -David Lynch "The Angriest Dog in the World"