george@weitek.UUCP (George White) (05/22/85)
We have recently installed a VAX780 which every once in a while crashes with a 'munhash' panic message. I have determined that this occurs only while someone is using dbx. I suspect a hardware problem since the same dbx works on a VAX750 and the occurance of the panic is frequent but not always deterministic. It would be useful to have more info on what dbx does that could cause this problem so that I can point the DEC repairman in the right direction (their first reaction will be 'I can't run the diags without VMS'). Any help or insight into problem would be greatly appreciated. Thanks George White Weitek Corp. ..!{cae780,turtlevax}!weitek!george
wedgingt@udenva.UUCP (Will Edgington/Ejeo) (05/28/85)
In article <> george@weitek.UUCP (George White) writes: > >We have recently installed a VAX780 which every once in a while >crashes with a 'munhash' panic message. I have determined that >this occurs only while someone is using dbx. I suspect a >hardware problem since the same dbx works on a VAX750 and the >occurance of the panic is frequent but not always deterministic. >It would be useful to have more info on what dbx does that could >cause this problem so that I can point the DEC repairman in the >right direction (their first reaction will be 'I can't run the >diags without VMS'). Any help or insight into problem would be >greatly appreciated. > > Thanks > > George White > Weitek Corp. > > ..!{cae780,turtlevax}!weitek!george We here at the Univ. of Denver have recently been getting the same panic on a VAX 11/750 running BSD 4.2. It appeared only after we recompiled the kernel to include the QUOTA routines. Adb and dbx both cause the panic but only when they are used to run a program; not when they are used to get a back-trace. The problem cannot be reproduced on our VAX 11/750 running Ultrix with quotas; both machines use 'ra' disks, though the BSD 4.2 machine has root on a ra80 while Ultrix machine has root on a ra81, it's only disk (the BSD machine has a ra81 with most of the users on it). We don't get the panic very often, but that's because we only have three people that use dbx and adb at all (most of our users either don't write programs at all or don't need debuggers to find their bugs since they're just learning programming). The really strange thing about ours (at least; my first guess would be that weitek!george has the same 'glitch') is that there are certain users that can't crash the system in this method. The two users I have confirmed this with are both in group 'staff'; /vmunix, /dev/mem, /dev/kmem, and /dev/kUmem are all group staff; /vmunix is mode 755, while /dev/mem and /dev/kmem are both 640 and /dev/kUmem is 600. Noticing this, I changed /dev/kmem and /dev/mem to 644 temporarily to see if that got rid of the problem; it didn't. There's something else funny related to protections, also. Users klamb and wedgingt (myself) are the two users in group staff I've mentioned; if I compile a program as klamb or wedgingt and then run it under dbx as klamb or wedgingt, no panic. If I compile a program as klamb or wedgingt and run it under dbx as user 'support' (not in group staff), no panic. If I compile a program as support and run it under dbx while support, I get the panic. Since our drivers (vanilla BSD 4.2) for ra disks don't include dump routines, I can't get a dump of the kernel's memory when the crash occurs. However, the system apparently is up long enough for the program being run under dbx to dump core; it's always at text address 0x3 (just starting up ?). I then wrote a tiny little C program that does more-or-less what dbx does when running another program; I hacked it up from dbx's source : #include <sys/wait.h> #include <stdio.h> #include <signal.h> #include <errno.h> #include <sys/param.h> #include <machine/reg.h> #include <sys/stat.h> #define STOPPED 0177 #define NREG 16 extern int errno; /* * This magic macro enables us to look at the process' registers * in its user structure. */ #define regloc(reg) (ctob(UPAGES) + ( sizeof(int) * (reg) )) #define WMASK (~(sizeof(Word) - 1)) #define cachehash(addr) ((unsigned) ((addr >> 2) % CSIZE)) #define FIRSTSIG SIGINT #define LASTSIG SIGQUIT #define ischild(pid) ((pid) == 0) #define traceme() ptrace(0, 0, 0, 0) #define setrep(n) (1 << ((n)-1)) #define istraced(p) (p->sigset&setrep(p->signo)) /* * Ptrace options (specified in first argument). */ #define UREAD 3 /* read from process's user structure */ #define UWRITE 6 /* write to process's user structure */ #define IREAD 1 /* read from process's instruction space */ #define IWRITE 4 /* write to process's instruction space */ #define DREAD 2 /* read from process's data space */ #define DWRITE 5 /* write to process's data space */ #define CONT 7 /* continue stopped process */ #define SSTEP 9 /* continue for approximately one instruction */ #define PKILL 8 /* terminate the process */ main() { int status, pid; char *prog = "/bin/cat"; pid = vfork(); if (pid == -1) { write(2, "can't vfork", 11); _exit(errno); } if (ischild(pid)) { if (traceme() != 0) { write(2, "ptrace call in child failed\n", 28); _exit(errno); } execl(prog, "cat", "/etc/fstab", 0); write(2, "can't exec ", 11); write(2, prog, strlen(prog)); write(2, "\n", 1); _exit(errno); } pwait(pid, &status); getinfo(pid, status); if ((status&0177) != STOPPED) { write(2, "program could not begin execution", 33); _exit(errno); } pcont(pid); exit(0); } int rloc[] ={ R0, R1, R2, R3, R4, R5, R6, R7, R8, R9, R10, R11, AP, FP, SP, PC }; getinfo(pid, status) register int pid; register int status; { register int i; fprintf(stderr, "pid = %d, status = %o\n", pid, status); fprintf(stderr, "signo = %o\n", status&0177); fprintf(stderr, "exitval = %o\n", (status >> 8)&0377); if ((status&0177) != STOPPED) { fprintf(stderr, "%s\n", "Not stopped"); } else { fprintf(stderr, "mask = %d\n", ptrace(UREAD, pid, regloc(PS), 0)); for (i = 0; i < NREG; i++) { fprintf(stderr, "reg[%d] = %d\n", i, ptrace(UREAD, pid, regloc(rloc[i]), 0)); } } } pwait(pid, statusp) int pid, *statusp; { int pnum, status; for (pnum=wait(&status); pnum!=pid && pnum>=0; pnum=wait(&status)) fprintf(stderr, "%s\n%s%d%s%d\n", "Found another child !!!", "\tpid = ", pnum, "\tstatus = %o", status); if (pnum < 0) { fprintf(stderr, "error in pwait: wait returned %d\n", pnum); exit(errno); } else *statusp = status; } pcont(pid) int pid; { int status; do { fprintf(stderr, "About to call CONT ptrace\n"); if (ptrace(CONT, pid, 1, 0) < 0) { fprintf(stderr, "error %d trying to continue process", errno); exit(errno); } pwait(pid, &status); getinfo(pid, status); } while ((status&0177) == STOPPED); } I then compiled this as wedgingt and support using 'cc -g file.c'; even running it as support after compiling it as support did NOT give the panic; it gave as good as identical output as when compiled and run as wedgingt. Running it under dbx gave identical results as any other program being run under dbx; i.e., if the program was compiled as support, it crashed the system; as wedgingt, it didn't. Lastly, I recompiled ALL of dbx, thinking that it must have some subtle dependency on the kernel. Nothing changed; I tried all of the above again with identical results. Seeing weitek!george's request to the network, I thought I would post this, both to help him track his problem down (and make sure it's the same one) and to see if there are any gurus out there that know what is going on. Sorry about the length of this, but I figured that in a case like this it's better to post too much info than not enough. Thanks !!! -- Will Edgington | Phone: (303) 871-2081 (work), 772-5738 (home) Computing Services Staff | USnail: BA 469, 2020 S. Race, Denver CO 80210 University of Denver | Home: 2035 S. Josephine #312, Denver CO 80210 Electronic Address (UUCP only): {hplabs, seismo}!hao!udenva!wedgingt or {boulder, cires, denelcor, ucbvax!nbires, cisden}!udenva!wedgingt
mccallum@opus.UUCP (Doug McCallum) (05/28/85)
This bug and fix have been reported several times before. Here is the fix: > From: RWS%mit-xx@sri-unix.UUCP > Newsgroups: net.unix-wizards > Subject: sundry 4.2 bugs > Message-ID: <13280@sri-arpa.UUCP> > Date: Wed, 2-Nov-83 17:15:00 EST > Article-I.D.: sri-arpa.13280 > Posted: Wed Nov 2 17:15:00 1983 > Date-Received: Fri, 4-Nov-83 09:05:00 EST > Lines: 64 > Status: RO > > Despite claims to the contrary, the block number sign extension problem still > exists. Berkeley put in a fix that should have worked, but a C compiler bug > apparently keeps it from working. In /sys/sys/vm_mem.c in memall() the code > swapdev : mount[c->c_mdev].m_dev, (daddr_t)(u_long)c->c_blkno > should be changed to > swapdev : mount[c->c_mdev].m_dev, c->c_blkno > and in /sys/vax/vm_machdep.c in chgprot() the code > munhash(mount[c->c_mdev].m_dev, (daddr_t)(u_long)c->c_blkno); > should be changed to > munhash(mount[c->c_mdev].m_dev, c->c_blkno); > because the C compiler apparently incorrectly folds the (daddr_t) and (u_long) > together and sign extends anyway. Simply taking out the (daddr_t)(u_long) > works, although lint will probably complain about it. > > ---