[comp.unix.ultrix] 5000/200 HANGS intermittently. No error messages. ugh. help.

van@triton.unm.edu (Van Rauch) (10/17/90)

Well, I just caught up on about 130 articles in unix.ultrix
and didn't see any allusion toward the problem we have but
will keep this short and ask for suggestions, no
matter how ambiguous.

We have a DEC System 5000/200 running 4.0. Attempts to run 3.1d
on it were tried first, but one could not build a kernel
under 3.1d. So what the heck,  lets' go with 4.0.

Besides 4.0's innability to function as a YP Server we have a 
very strange problem with our 5000.  About every 3 to 7 days, 
the system will HANG.  No messages on the console, nothing in 
syserr.hostname.# (uerf), no core file in /usr/adm/crash (savecore is 
turned on). - nuthin!

Each reboot is complaint free excepting of course
all the open files that fsck has to deal with.

DEC has since replaced the system board twice, originally
we were running 4.0 of the firmware, now we have
5.3. (When is DEC going to start shipping updated 
Hardware Operator Guides with new rev.s of firmware??).
Console commands are very different.

I do not believe the problem lies in the E'net interface
as when the system hangs the console hangs along with
it. None of our other systems (all running 3.1x) are 
exhibiting any sympathetic behavior.

Suggestions (obvious/ambiguous) on trouble shooting procedures would be 
gratefully accepted. Thanks.

Van Rauch			van@triton.unm.edu
Application/Systems		
University of NM, CIRT

carter@ferrari.mst6.lanl.gov (Dave Carter) (10/17/90)

In article <1990Oct16.211229.18767@ariel.unm.edu> van@triton.unm.edu (Van Rauch) writes:
>
>DEC has since replaced the system board twice, originally
>we were running 4.0 of the firmware, now we have
>5.3. (When is DEC going to start shipping updated 
>Hardware Operator Guides with new rev.s of firmware??).
>Console commands are very different.

oh please!  this is dec we're dealing with - you actually expect to
receive documentation that resembles the software they ship, or even
software that works on the machine it is shipped with?  :-)

i had to beg for a hardware operator guide.  (i also got an old
hardware manual, but new firmware.)  my sales rep told me he'd "get
back to me with a quote" for this documentation.  by the time i got
the manual, i had already made some phone calls, and managed to at
least boot the thing up.

i don't mean to put down dec (we use dec stuff exclusively) but it
does seem strange to ship new workstations (the 5000s) along with an
operating system (3.x) which doesn't even support the 5000!  :-)

						- dave

jack@cscdec.cs.com (Jack Hudler) (10/18/90)

In article <1258@mustang.mst6.lanl.gov> carter@ferrari.mst6.lanl.gov (Dave Carter) writes:
>
>i had to beg for a hardware operator guide.  (i also got an old
>hardware manual, but new firmware.)  my sales rep told me he'd "get
>back to me with a quote" for this documentation.  by the time i got
>the manual, i had already made some phone calls, and managed to at
>least boot the thing up.

I had the same experence, but I still don't have any documentation
on what goodies I can attach to this thing.

-- 
Jack           Computer Support Corporation             Dallas,Texas
Hudler         Internet: jack@cscdec.cs.com

mellon@fenris.pa.dec.com (Ted Lemon) (10/18/90)

Just to reiterate Ed Santiago's statement of about a month ago, if you
look on gatekeeper:pub/DEC, you will find the following useful files:

	DS5000_boot_commands.ps
	DS5000_console_command_comparison.ps
	DS5000_newboot_stuff.tar

There's some other stuff there, too, but I don't have Ed's original
message, so I don't know if it's useful.

			       _MelloN_

van@triton.unm.edu (Van Rauch) (10/19/90)

In article <1990Oct16.211229.18767@ariel.unm.edu> van@triton.unm.edu (Van Rauch) writes:
>
>very strange problem with our 5000.  About every 3 to 7 days, 
>the system will HANG.  No messages on the console, nothing in 
>syserr.hostname.# (uerf), no core file in /usr/adm/crash (savecore is 
>turned on). - nuthin!
>

I should have rtfm'd before I posted. There exists a doc
called:

"Starting the Crash Dump Routine Mnaully on RISC Processors"
in volume 3 of System and Network Management.

As far as I can tell there is a bug in the 4.0 kernel 
that innocuous user and system processes are tripping over.

After spending a few hours with crash vmcore.# vmunix.#
a trace on runnable processes at the time of
the crash shows different processes that are 
eventually executing panic and boot instruction, for example:

> proc -r
SLT S   PID  PPID  PGRP  UID  PY CPU   SIGS    EVENT FLAGS
...
 80 r  4324  3999  4324 7341 113 255      0              in trace pagi
...

> trace 80
Stack trace -- last called first
   0 boot (paniced = 0, arghowto = 0) [../../machine/mips/machdep.c: ,545 0x8010
9ea8]
   1 panic (s = 80159828) [../../sys/subr_prf.c: ,1159 0x800a3c18]
   2 kn02trap_error (ep = ffffdcf8, code = 80112fcc, sr = 0008, signo = ffffdcd4
...
> ps 80
SLOT   PID   UID   COMMAND
  80  4324  7341    (sml)
>

where "sml" is a program made available to students for a cs class.

The $60,000  question is, how does one get the text string for 
the  argument to  PANIC eg.  panic (s = 80159828)? Or more 
plainly, where do I go from here?

The consensus here is that without adb, one can't get it.
Does anyone know differently? 

Each time our 5000 has hanged, a different process leads to the 
panic and boot. ie. there is no consistency at the csh level for 
what comamnd is tripping the ?kernel? bug. Without more 
help from /bin/crash I'm at a loss for how to find
the instruction that does the damage.

---

And now for someting completely different...

cmp different under 4.0

Given two files, foo1 and foo2; foo1 is NONempty and foo2 is empty.

And the script, "cmp.csh":

#! /bin/csh
set x = `cmp foo1 foo2`
echo $x
echo $x[1]

---
under 3.x:
----
fornax.unm.edu:van -> cmp.csh
cmp: EOF on foo2
cmp:

---
under 4.0
----
triton.unm.edu:van -> cmp.csh
cmp: EOF on foo2
Subscript out of range.

This happens because cmp under 4.0 was changed to 
write EOF diagnostics to std err. instead of std out. Under
3.x EOF diags are  written to std out. 
Yes I'm splitting hairs here, but 
when your favorite prof comes to you pulling his/her
hair out because their homegrown script breaks on the "new" 
system, it makes you appreciate consistency ;-)
---
Van Rauch			van@triton.unm.edu
Application/Systems
University of NM, CIRT

pavlov@canisius.UUCP (Greg Pavlov) (10/20/90)

In article <1258@mustang.mst6.lanl.gov>, carter@ferrari.mst6.lanl.gov (Dave Carter) writes:
> i don't mean to put down dec (we use dec stuff exclusively) but it
> does seem strange to ship new workstations (the 5000s) along with an
> operating system (3.x) which doesn't even support the 5000!  :-)
> 
  Gee, I guess those weren't 5000 we were running on 2 months ago,
  after all..........



    greg pavlov, fstrf, amherst,  ny

    pavlov@stewart.fstrf.org

gringort@wsl.dec.com (Joel Gringorten) (10/23/90)

In article <1258@mustang.mst6.lanl.gov>, carter@ferrari.mst6.lanl.gov (Dave Carter) writes:
|> i don't mean to put down dec (we use dec stuff exclusively) but it
|> does seem strange to ship new workstations (the 5000s) along with an
|> operating system (3.x) which doesn't even support the 5000!  :-)

Sorry but this just isn't true.

The first version of Ultrix to support the DS5000/200 was 3.1D.

-joel

aem@aber-cs.UUCP (Alec D.E. Muffett) (10/24/90)

In article <1990Oct18.213749.29975@ariel.unm.edu> van@triton.unm.edu (Van Rauch) writes:
>In article <1990Oct16.211229.18767@ariel.unm.edu> van@triton.unm.edu (Van Rauch) writes:
>>
>>very strange problem with our 5000.  About every 3 to 7 days, 
>The $60,000  question is, how does one get the text string for 
>the  argument to  PANIC eg.  panic (s = 80159828)? Or more 
>plainly, where do I go from here?
>
>The consensus here is that without adb, one can't get it.
>Does anyone know differently? 
Try 'where' under  dbx -k vmunix.# vmcore.#

alec

JANET	aem@uk.ac.aber
INET:	aem@cs.aber.ac.uk or aem@aber.ac.uk
UUCP:	...!mcsun!ukc!aber-cs!aem
ARPA:	aem%uk.ac.aber.cs@nsfnet-relay.ac.uk,aem%uk.ac.aber@nsfnet-relay.ac.uk
BITNET:	<play around with aem%aber@ukacrl, ok?>
SNAIL:	Alec Muffett, Computer Unit, Llandinam Building, 
	UCW Campus, Aberystwyth, UK, SY23 3DB