idallen@watcgl.waterloo.edu (10/08/89)
From: "Ian! D. Allen [CGL]" <idallen> Has anyone seen this? File system /tmp on our 4.3BSD vax8600's has a block size equal to its frag size equal to 8192. When I compile and run different programs in this file system, sometimes they mysteriously die of Illegal instruction faults. If I compile them ten times in a row, half the time the resulting a.out won't run. Copying the a.out to another file in /tmp often fixes the problem. Copying the file to another file system and running it from there always fixes the problem. Only on /tmp do I have this problem. If I run a faulting a.out under adb, it will fault and when I examine instructions near where it faults I see zeroes! An edited example: Script started on Fri Oct 6 22:43:14 1989 % rm a.out % cc ian.c % ./a.out </dev/null % cc ian.c % ./a.out < /dev/null 8708 Bus error ./a.out < /dev/null (core dumped) % cc ian.c % ./a.out < /dev/null % cc ian.c % ./a.out < /dev/null 8740 Illegal instruction ./a.out < /dev/null (core dumped) % cc ian.c % ./a.out < /dev/null % cc -g ian.c % ./a.out < /dev/null 8756 Illegal instruction ./a.out < /dev/null (core dumped) % adb a.out $c _fstat(0,7fffd9a8) from filbuf.o+5a filbuf.o(2de0) from fgets.o+22 fgets.o(7fffde40,400,2de0) from ian.o+22 ian.o(1,7fffe270,7fffe278) from crt0.o+3d crt0.o() data address not found <pc?i _fstat+2: chmk $3e _fstat+4: blssu fstat.o :r </dev/null a.out: running illegal instruction (priviliged instruction fault) stopped at _fstat+2: halt <pc?i _fstat+2: halt _fstat+3: halt _fstat+4: halt _fstat+5: halt _fstat+6: halt _fstat+7: halt getdtablesize.o: getdtablesize.o: halt getdtablesize.o+1: halt getdtablesize.o+2: halt getdtablesize.o+3: halt getdtablesize.o+4: halt getdtablesize.o+5: halt getdtablesize.o+6: halt getdtablesize.o+7: halt _getdtablesize: _getdtablesize: 0 _getdtablesize+2: halt _getdtablesize+3: halt _getdtablesize+4: halt % exit script done on Fri Oct 6 22:45:29 1989
chris@mimsy.UUCP (Chris Torek) (10/08/89)
In article <11827@watcgl.waterloo.edu> idallen@watcgl.waterloo.edu writes: >File system /tmp on our 4.3BSD vax8600's has a block size equal to its >frag size equal to 8192. ... If I compile ... ten times in a row, half >the time the resulting a.out won't run. Copying the a.out to another >file in /tmp often fixes the problem. Copying the file to another file >system and running it from there always fixes the problem. ... If I run >a faulting a.out under adb, it will fault and when I examine instructions >near where it faults I see zeroes! Sounds like munhash() is either not being called properly, or not doing its job. There was a small change to realloccg() between 4.3BSD and 4.3BSD-tahoe, along the following lines: [old] count = roundup(osize, CLBYTES / DEV_BSIZE); for (i = 0; i < count; i += CLBYTES / DEV_BSIZE) ... munhash(..., bn + i); [new] count = roundup(osize, CLBYTES); for (i = 0; i < count; i++) ... munhash(..., bn + i * CLBYTES / DEV_BSIZE); As far as I can tell, this change has no actual effect (on both Vax and Tahoe). Also, with fsize==bsize, realloccg() should not be called at all since there are no fragments. The other likely possiblity is the buffer size-changing code, for which a fix was posted from Berkeley. Here is a version of that fix. *** /tmp/,RCSt1003260 Sun Oct 8 11:19:12 1989 --- ufs_bio.c Tue Nov 8 00:19:24 1988 *************** *** 4,8 **** * specifies the terms and conditions for redistribution. * ! * @(#)ufs_bio.c 7.1 (Berkeley) 6/5/86 */ --- 4,8 ---- * specifies the terms and conditions for redistribution. * ! * @(#)ufs_bio.c 7.3 (Berkeley) 11/12/87 */ *************** *** 34,38 **** panic("bread: size 0"); bp = getblk(dev, blkno, size); ! if (bp->b_flags&B_DONE) { trace(TR_BREADHIT, pack(dev, size), blkno); return (bp); --- 34,38 ---- panic("bread: size 0"); bp = getblk(dev, blkno, size); ! if (bp->b_flags&(B_DONE|B_DELWRI)) { trace(TR_BREADHIT, pack(dev, size), blkno); return (bp); *************** *** 68,72 **** if (!incore(dev, blkno)) { bp = getblk(dev, blkno, size); ! if ((bp->b_flags&B_DONE) == 0) { bp->b_flags |= B_READ; if (bp->b_bcount > bp->b_bufsize) --- 68,72 ---- if (!incore(dev, blkno)) { bp = getblk(dev, blkno, size); ! if ((bp->b_flags&(B_DONE|B_DELWRI)) == 0) { bp->b_flags |= B_READ; if (bp->b_bcount > bp->b_bufsize) *************** *** 85,89 **** if (rablkno && !incore(dev, rablkno)) { rabp = getblk(dev, rablkno, rabsize); ! if (rabp->b_flags & B_DONE) { brelse(rabp); trace(TR_BREADHITRA, pack(dev, rabsize), blkno); --- 85,89 ---- if (rablkno && !incore(dev, rablkno)) { rabp = getblk(dev, rablkno, rabsize); ! if (rabp->b_flags & (B_DONE|B_DELWRI)) { brelse(rabp); trace(TR_BREADHITRA, pack(dev, rabsize), blkno); *************** *** 150,159 **** register struct buf *bp; { - register int flags; if ((bp->b_flags&B_DELWRI) == 0) u.u_ru.ru_oublock++; /* noone paid yet */ ! flags = bdevsw[major(bp->b_dev)].d_flags; ! if(flags & B_TAPE) bawrite(bp); else { --- 150,157 ---- register struct buf *bp; { if ((bp->b_flags&B_DELWRI) == 0) u.u_ru.ru_oublock++; /* noone paid yet */ ! if (bdevsw[major(bp->b_dev)].d_flags & B_TAPE) bawrite(bp); else { *************** *** 261,264 **** --- 259,269 ---- * for the oldest non-busy buffer and reassign it. * + * If we find the buffer, but it is dirty (marked DELWRI) and + * its size is changing, we must write it out first. When the + * buffer is shrinking, the write is done by brealloc to avoid + * losing the unwritten data. When the buffer is growing, the + * write is done by getblk, so that bread will not read stale + * disk data over the modified data in the buffer. + * * We use splx here because this routine may be called * on the interrupt stack during a dump, and we don't *************** *** 306,309 **** --- 311,323 ---- splx(s); notavail(bp); + if (bp->b_bcount != size) { + if (bp->b_bcount < size && (bp->b_flags&B_DELWRI)) { + bp->b_flags &= ~B_ASYNC; + bwrite(bp); + goto loop; + } + if (brealloc(bp, size) == 0) + goto loop; + } if (bp->b_bcount != size && brealloc(bp, size) == 0) goto loop; *************** *** 365,369 **** /* ! * First need to make sure that all overlaping previous I/O * is dispatched with. */ --- 379,383 ---- /* ! * First need to make sure that all overlapping previous I/O * is dispatched with. */ *************** *** 502,505 **** --- 516,522 ---- /* * Insure that no part of a specified block is in an incore buffer. + #ifdef SECSIZE + * "size" is given in device blocks (the units of b_blkno). + #endif SECSIZE */ blkflush(dev, blkno, size) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
mike@thor.acc.stolaf.edu (Mike Haertel) (10/09/89)
In article <11827@watcgl.waterloo.edu> idallen@watcgl.waterloo.edu writes: >File system /tmp on our 4.3BSD vax8600's has a block size equal to its >frag size equal to 8192. When I compile and run different programs in >this file system, sometimes they mysteriously die of Illegal instruction >faults. If I compile them ten times in a row, half the time the >resulting a.out won't run. Copying the a.out to another file in /tmp >often fixes the problem. Copying the file to another file system and >running it from there always fixes the problem. Only on /tmp do I >have this problem. If I run a faulting a.out under adb, it will fault >and when I examine instructions near where it faults I see zeroes! I've never seen that happen on a 4.3BSD vax, but something like it happens on my 3b1. Often when an executable has just been built, immediately executing it will get a trap. Waiting a few seconds or running sync seems to cure the problem. My guess would be that VM paging isn't cooperating with the Unix block cache, in the case of the 3b1. I *really doubt* Berkeley would have that problem, but it's something you can look for. -- Mike Haertel <mike@stolaf.edu> ``There's nothing remarkable about it. All one has to do is hit the right keys at the right time and the instrument plays itself.'' -- J. S. Bach
tbray@watsol.waterloo.edu (Tim Bray) (10/09/89)
In article <11827@watcgl.waterloo.edu> idallen@watcgl.waterloo.edu writes: >Has anyone seen this? > >File system /tmp on our 4.3BSD vax8600's has a block size equal to its >frag size equal to 8192. When I compile and run different programs in >this file system, sometimes they mysteriously die ... Yes, I've seen this behaviour. It was caused by a bad block on the disk. In that case, it was in the middle of a large static executable and behaved in such a fashion that read(2) from the filesystem sometimes returned what you wrote, sometimes not. Actually, even if the block was 100% shot (effectively write-only), you could get that effect on a small volatile filesystem like /tmp as that particular block circulated back & forth between the free list and your a.out. When that happened, it surprised me that the disk (RP06 under 4.1bsd (!), I believe) would let that happen without shrieking. Still surprising. Tim Bray, New OED Project, U of Waterloo
wlm@archet.UUCP (William L. Moran Jr.) (10/10/89)
I've often seen the type of behavior you describe when using two machines one of which NFS mounts a partition from the other (say A mounts Bs /usr/foo). On B I compile something in /usr/foo, then on A I try to run it. Sometimes this results in odd behavior for example I've run things four times in a row getting results of segv, bus error, trap, and works fine. Sometimes it just continues to act strangely. On a more stable NFS, usually it lets you know when this is a problem, but not all NFS implementations are this good. Bill -- arpa: moran-william@cs.yale.edu or wlm@ibm.com uucp: uunet!bywater!acheron!archet!wlm or decvax!yale!moran-william ------------------------------------------------------------------------------- ``There is Jackson standing like a stone wall. Let us determine to die, and we will conquer. Follow me.'' - General Barnard E. Bee (CSA)
D. Allen [CGL]) (10/16/89)
Indeed, we seem to have found the nondeterministic a.out problem. It's a long-standing bug in 4.3bsd pagein() code dealing with incomplete blkflush() behaviour. I expect Chris Torek and friends will post an official fix eventually. -- -IAN! (Ian! D. Allen) idallen@watcgl.uwaterloo.ca idallen@watcgl.waterloo.edu 129.97.128.64 Computer Graphics Lab/University of Waterloo/Ontario/Canada