[net.unix-wizards] 'munhash' panic

george@weitek.UUCP (George White) (05/22/85)

We have recently installed a VAX780 which every once in a while
crashes with a 'munhash' panic message.  I have determined that
this occurs only while someone is using dbx.  I suspect a
hardware problem since the same dbx works on a VAX750 and the
occurance of the panic is frequent but not always deterministic.
It would be useful to have more info on what dbx does that could
cause this problem so that I can point the DEC repairman in the 
right direction (their first reaction will be 'I can't run the 
diags without VMS').  Any help or insight into problem would be 
greatly appreciated.

				Thanks

				George White
				Weitek Corp.

				..!{cae780,turtlevax}!weitek!george

wedgingt@udenva.UUCP (Will Edgington/Ejeo) (05/28/85)

In article <> george@weitek.UUCP (George White) writes:
>
>We have recently installed a VAX780 which every once in a while
>crashes with a 'munhash' panic message.  I have determined that
>this occurs only while someone is using dbx.  I suspect a
>hardware problem since the same dbx works on a VAX750 and the
>occurance of the panic is frequent but not always deterministic.
>It would be useful to have more info on what dbx does that could
>cause this problem so that I can point the DEC repairman in the 
>right direction (their first reaction will be 'I can't run the 
>diags without VMS').  Any help or insight into problem would be 
>greatly appreciated.
>
>				Thanks
>
>				George White
>				Weitek Corp.
>
>				..!{cae780,turtlevax}!weitek!george

  We here at the Univ. of Denver have recently been getting the same
panic on a VAX 11/750 running BSD 4.2.  It appeared only after we
recompiled the kernel to include the QUOTA routines.  Adb and dbx
both cause the panic but only when they are used to run a program;
not when they are used to get a back-trace.  The problem cannot be
reproduced on our VAX 11/750 running Ultrix with quotas; both machines
use 'ra' disks, though the BSD 4.2 machine has root on a ra80 while
Ultrix machine has root on a ra81, it's only disk (the BSD machine
has a ra81 with most of the users on it).  We don't get the panic
very often, but that's because we only have three people that use
dbx and adb at all (most of our users either don't write programs at
all or don't need debuggers to find their bugs since they're just
learning programming).
  The really strange thing about ours (at least; my first guess would
be that weitek!george has the same 'glitch') is that there are certain
users that can't crash the system in this method.  The two users I have
confirmed this with are both in group 'staff'; /vmunix, /dev/mem,
/dev/kmem, and /dev/kUmem are all group staff; /vmunix is mode 755,
while /dev/mem and /dev/kmem are both 640 and /dev/kUmem is 600.
Noticing this, I changed /dev/kmem and /dev/mem to 644 temporarily
to see if that got rid of the problem; it didn't.  There's something
else funny related to protections, also.  Users klamb and wedgingt
(myself) are the two users in group staff I've mentioned; if I compile
a program as klamb or wedgingt and then run it under dbx as klamb or
wedgingt, no panic.  If I compile a program as klamb or wedgingt and
run it under dbx as user 'support' (not in group staff), no panic.  If
I compile a program as support and run it under dbx while support, I
get the panic.  Since our drivers (vanilla BSD 4.2) for ra disks don't
include dump routines, I can't get a dump of the kernel's memory when
the crash occurs.  However, the system apparently is up long enough
for the program being run under dbx to dump core; it's always at text
address 0x3 (just starting up ?).
  I then wrote a tiny little C program that does more-or-less what dbx
does when running another program; I hacked it up from dbx's source :

#include <sys/wait.h>
#include <stdio.h>
#include <signal.h>
#include <errno.h>
#include <sys/param.h>
#include <machine/reg.h>
#include <sys/stat.h>

#define STOPPED 0177
#define NREG 16

extern int errno;

/*
 * This magic macro enables us to look at the process' registers
 * in its user structure.
 */

#define regloc(reg)     (ctob(UPAGES) + ( sizeof(int) * (reg) ))

#define WMASK           (~(sizeof(Word) - 1))
#define cachehash(addr) ((unsigned) ((addr >> 2) % CSIZE))

#define FIRSTSIG        SIGINT
#define LASTSIG         SIGQUIT
#define ischild(pid)    ((pid) == 0)
#define traceme()       ptrace(0, 0, 0, 0)
#define setrep(n)       (1 << ((n)-1))
#define istraced(p)     (p->sigset&setrep(p->signo))

/*
 * Ptrace options (specified in first argument).
 */

#define UREAD   3       /* read from process's user structure */
#define UWRITE  6       /* write to process's user structure */
#define IREAD   1       /* read from process's instruction space */
#define IWRITE  4       /* write to process's instruction space */
#define DREAD   2       /* read from process's data space */
#define DWRITE  5       /* write to process's data space */
#define CONT    7       /* continue stopped process */
#define SSTEP   9       /* continue for approximately one instruction */
#define PKILL   8       /* terminate the process */

main()
{
    int status, pid;
    char *prog = "/bin/cat";

    pid = vfork();
    if (pid == -1) {
	write(2, "can't vfork", 11);
	_exit(errno);
    }
    if (ischild(pid)) {
	if (traceme() != 0) {
	    write(2, "ptrace call in child failed\n", 28);
	    _exit(errno);
	}
	execl(prog, "cat", "/etc/fstab", 0);
	write(2, "can't exec ", 11);
	write(2, prog, strlen(prog));
	write(2, "\n", 1);
	_exit(errno);
    }
    pwait(pid, &status);
    getinfo(pid, status);
    if ((status&0177) != STOPPED) {
	write(2, "program could not begin execution", 33);
	_exit(errno);
    }
    pcont(pid);
    exit(0);
}

int rloc[] ={
    R0, R1, R2, R3, R4, R5, R6, R7, R8, R9, R10, R11, AP, FP, SP, PC
};

getinfo(pid, status)
register int pid;
register int status;
{
    register int i;

    fprintf(stderr, "pid = %d, status = %o\n", pid, status);
    fprintf(stderr, "signo = %o\n", status&0177);
    fprintf(stderr, "exitval = %o\n", (status >> 8)&0377);
    if ((status&0177) != STOPPED) {
	fprintf(stderr, "%s\n", "Not stopped");
    } else {
	fprintf(stderr, "mask = %d\n",
	    ptrace(UREAD, pid, regloc(PS), 0));
	for (i = 0; i < NREG; i++) {
	    fprintf(stderr, "reg[%d] = %d\n", i,
		ptrace(UREAD, pid, regloc(rloc[i]), 0));
	}
    }
}

pwait(pid, statusp)
int pid, *statusp;
{
    int pnum, status;

    for (pnum=wait(&status); pnum!=pid && pnum>=0; pnum=wait(&status))
	fprintf(stderr, "%s\n%s%d%s%d\n",
	    "Found another child !!!", "\tpid = ", pnum,
	    "\tstatus = %o", status);
    if (pnum < 0) {
	fprintf(stderr, "error in pwait:  wait returned %d\n", pnum);
	exit(errno);
    } else
        *statusp = status;
}

pcont(pid)
int pid;
{
    int status;

    do {
	fprintf(stderr, "About to call CONT ptrace\n");
	if (ptrace(CONT, pid, 1, 0) < 0) {
	    fprintf(stderr, "error %d trying to continue process",
		errno);
	    exit(errno);
	}
	pwait(pid, &status);
	getinfo(pid, status);
    } while ((status&0177) == STOPPED);
}

  I then compiled this as wedgingt and support using 'cc -g file.c';
even running it as support after compiling it as support did NOT give
the panic; it gave as good as identical output as when compiled and run
as wedgingt.  Running it under dbx gave identical results as any other
program being run under dbx; i.e., if the program was compiled as
support, it crashed the system; as wedgingt, it didn't.
  Lastly, I recompiled ALL of dbx, thinking that it must have some
subtle dependency on the kernel.  Nothing changed; I tried all of the
above again with identical results.  Seeing weitek!george's request
to the network, I thought I would post this, both to help him track his
problem down (and make sure it's the same one) and to see if there are
any gurus out there that know what is going on.  Sorry about the length
of this, but I figured that in a case like this it's better to post too
much info than not enough.  Thanks !!!
-- 
Will Edgington		 | Phone: (303) 871-2081 (work), 772-5738 (home)
Computing Services Staff | USnail: BA 469, 2020 S. Race, Denver CO 80210
University of Denver	 | Home: 2035 S. Josephine #312, Denver CO 80210
Electronic Address (UUCP only): {hplabs, seismo}!hao!udenva!wedgingt
or {boulder, cires, denelcor, ucbvax!nbires, cisden}!udenva!wedgingt

mccallum@opus.UUCP (Doug McCallum) (05/28/85)

This bug and fix have been reported several times before.  Here is the
fix:


> From: RWS%mit-xx@sri-unix.UUCP
> Newsgroups: net.unix-wizards
> Subject: sundry 4.2 bugs
> Message-ID: <13280@sri-arpa.UUCP>
> Date: Wed, 2-Nov-83 17:15:00 EST
> Article-I.D.: sri-arpa.13280
> Posted: Wed Nov  2 17:15:00 1983
> Date-Received: Fri, 4-Nov-83 09:05:00 EST
> Lines: 64
> Status: RO
> 
> Despite claims to the contrary, the block number sign extension problem still
> exists.  Berkeley put in a fix that should have worked, but a C compiler bug
> apparently keeps it from working.  In /sys/sys/vm_mem.c in memall() the code
>       swapdev : mount[c->c_mdev].m_dev, (daddr_t)(u_long)c->c_blkno
> should be changed to
>       swapdev : mount[c->c_mdev].m_dev, c->c_blkno
> and in /sys/vax/vm_machdep.c in chgprot() the code
>           munhash(mount[c->c_mdev].m_dev, (daddr_t)(u_long)c->c_blkno);
> should be changed to
>           munhash(mount[c->c_mdev].m_dev, c->c_blkno);
> because the C compiler apparently incorrectly folds the (daddr_t) and (u_long)
> together and sign extends anyway.  Simply taking out the (daddr_t)(u_long)
> works, although lint will probably complain about it.
> 
> ---