jsloan@wright.UUCP (John Sloan) (03/01/88)
We're hoping someone has seen this before, and knows more about it than anyone we've been able to talk to at DEC so far (we're still talking, so by the time you read this, the problem may have been resolved. If anyone else if having this problem, and we get some useful information, we'll pass it along.) Crash: panic: bad rmfree System: VAX 785 Ultrix-32 2.0-1 Background: We're running 2.0-1 on a 785 with two RA81s. We had the second RA81 installed a couple of weeks ago, and installed Ultrix immediately after that, after which we began to experience the panic described above. We also run Ultrix on a 750 (and have been since the 1.0 days), and have never seen this problem before. During the install on the _785_, we changed the partitioning on BOTH RA81s. Both RA81s are partitioned exactly the same. Among other things we increased the swap area on both drives. Our _750_ has an RA80 (ra0) and a RA81 (ra1) and we increased the swap space on the RA81 on the 750 in the same fashion when 2.0-1 first came out and have been running it without incident since then. The new partitioning and file system structure on the 785 looks like this (the NEW RA81 is ra1). /dev/rra1a Current partition table: partition bottom top size overlap a 0 15883 15884 c b 15884 82763 66880 c c 0 891071 891072 a,b,d,e,f,g,h d 131404 254396 122993 c,g,h e 254397 377389 122993 c,h f 377390 891071 513682 c,h g 82764 246923 164160 c,d h 246924 891071 644148 c,d,e,f Filesystem total kbytes kbytes percent node kbytes used free used Mounted on /dev/ra0a 7423 5692 989 85% / /dev/ra0g 77983 52209 17976 74% /usr /dev/ra0h 306551 28387 247509 10% /usr/local /dev/ra1a 7423 9 6672 0% /tmp /dev/ra1g 77983 13927 56258 20% /usr/spool /dev/ra1h 306551 11 275885 0% /thor/users We believe that our /etc/fstab, /etc/rc.local, and our kernel are set up correctly for swapping on the second disk (as it is on our 750). When we first starting having this panic, the new RA81 was ra0. To see if the problem followed the drive we backed up, swapped drive plugs, and restored, and the problem appears to follow the new drive (which is now ra1). In fact, it panic'ed during the restore to the new RA81. Of the three crash dumps we've examined (out of perhaps a dozen), this is what typical of what dbx -k tells us. csh> dbx -k /usr/adm/crash/vmunix.5 /usr/adm/crash/vmcore.5 dbx version 2.0 of 4/2/87 22:10. Type 'help' for help. reading symbolic information ... [using memory image in /usr/adm/crash/vmcore.5] sbr 80061470 slr 7e00 p0br 804ffa00 p0lr 160 p1br 7fd00200 p1lr 1fffdc (dbx) where sleep(0x80110aa0, 0x14) at 0x80025e04 biowait(0x80110aa0) at 0x80006436 bwrite(0x80110aa0) at 0x80005d3b dirremove(0x7fffed94) at 0x80041077 ufs_unlink(0x800bd5d8, 0x7fffed94) at 0x8004179e unlink() at 0x8000a5b5 syscall() at 0x8004ea5f Xsyscall(0x7fffe15c) at 0x80001d7b (dbx) q csh> This is consistent with tracing by hand the pc's in the kernel stack panic dump that is printed on the console with a printout of the namelist from /vmunix. In all three cases the system was doing a bread or bwrite on the new RA81. We think the problem might be swap space related on the new disk because rmfree is used to deallocate entries in a resource map; resource maps seem to be used mostly in virtual memory management; the system dies in a sleep while it is presumably trying to switch processes to wait for the I/O to complete. We don't have Ultrix source, so this is all conjecture. At all times, the system had only one or no users (but things may have been running in background). We can't get it to fail consistently. We ran the DEC standalone disk formatter and scrubber many times. The first time they both reported 68 bad blocks. The second time 66. The third and fourth and fifth times 67 bad blocks. The blocks reported bad are always a subset of the originally reported 68. We ran the DEC s/a disk exerciser for 14 hours without incident. We ran the Ultrix disk exerciser and it panic'ed within a few minutes, but it does not do so consistently. DDC and Field Service so far haven't been able to tell us anything else, although DDC seems unusally knowledgable about Ultrix lately (which is nice to see). We think we may have done something really stupid with the repartitioning or changing the swap space, but we've RTFM'ed until we're blue in the face, talked to DEC diagnostic center, swapped disks, and done similar (but not identical) things with our 750 without problems. We are reluctant to repartition the disks because [1] since we can't get it to fail consistently, we won't know if its fixed or not (although that is our next fall back position), and [2] we need the extra virtual memory. Has anyone else seen this? Any ideas? Have we missed something obvious? Thanks. A lot. John Sloan (CSNET: jsloan@SPOTS.Wright.edu, USENET:...!cbosgd!wright!jsloan)
gilgut@cg-atla.UUCP (Mr. Uucp X5277) (03/03/88)
In article <23255@felix.UUCP> jsloan@wright.UUCP (John Sloan) writes: > >We're hoping someone has seen this before, and knows more about it than > >Crash: panic: bad rmfree > Yup. We've seen it. Know what it is too. It's not your disks, it's your tape driver software (if in fact what you are getting is what we had) Do you see tape drive errors in the errlog? You mentioned "while doing restore" which is what keyed it for me. We installed a new patch from Ultrix Support, tmscp.o, and the next time the tape drive was accessed, wham! the same panic. I suggest if you can, to back out the copy of tmscp.o etc and replace it with the previous version. Good Luck..... Steve -- Steve Gilgut, Compugraphic Corp. Wilmington, Mass. 01887 (617)658-5600 X5277 UNISIG Suite/Campground Coordinator; DECUS U.S. Chapter "Of all the things I've lost, I miss my mind the most." .!{decvax,ima,ism780c,ulowell,laidbak,denning,wizvax,cgeuro,cg-f}!cg-atla!gilgut