scott@zorch.SF-Bay.ORG (Scott Hazen Mueller) (10/03/89)
Chalk up YA site being bitten by the System 5 lost inode bug. My vendor doesn't do Unix anymore (that I know of), and it's not a terribly standard system, so I have little hope of seeing the bug fixed. Since it is of course hitting my news spool filesystem hardest, I would like to mitigate the effects by hacking [ir]news to spool the entire batch on a low inodes condition. The necessary code changes were entirely trivial; however, I have no idea of whether they are really meaningful. Does anybody know how the out-of-inodes condition progresses? that is, does a stricken system decay steadily from the "real" inode count down toward 0 or 1, or do things look normal until blammo! ifree is (<100|=1) or whatever? Perhaps a periodic repost of one of the analyses of the problem would be useful. -- Scott Hazen Mueller| scott@zorch.SF-Bay.ORG (ames|pyramid|vsi1)!zorch!scott 685 Balfour Drive | (408) 298-6213 |Mail to fusion-request@zorch.SF-Bay.ORG San Jose, CA 95111 |No room for quote.|for sci.physics.fusion digests via email
bill@twwells.com (T. William Wells) (10/04/89)
In article <906@zorch.SF-Bay.ORG> scott@zorch.SF-Bay.ORG (Scott Hazen Mueller) writes:
: Chalk up YA site being bitten by the System 5 lost inode bug. My vendor
: doesn't do Unix anymore (that I know of), and it's not a terribly standard
: system, so I have little hope of seeing the bug fixed. Since it is of
: course hitting my news spool filesystem hardest, I would like to mitigate
: the effects by hacking [ir]news to spool the entire batch on a low inodes
: condition. The necessary code changes were entirely trivial; however, I
: have no idea of whether they are really meaningful. Does anybody know how
: the out-of-inodes condition progresses? that is, does a stricken system
: decay steadily from the "real" inode count down toward 0 or 1, or do things
: look normal until blammo! ifree is (<100|=1) or whatever? Perhaps a periodic
: repost of one of the analyses of the problem would be useful.
The inode bug can pop up right out of the blue, from full inodes to
none at all. I've even seen it happen.
What kind of system do you have? I have a bug fix for Microport
SysV/386 3.0e and maybe Interactive 2.0.2. Making the bug harmless is
fairly easy. And I'm willing to lend a hand in swatting this bug.
Drop me a note if you are interested.
---
Bill { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com
jgd@rsiatl.UUCP (John G. De Armond) (10/05/89)
In article <906@zorch.SF-Bay.ORG> scott@zorch.SF-Bay.ORG (Scott Hazen Mueller) writes: >Chalk up YA site being bitten by the System 5 lost inode bug. My vendor >doesn't do Unix anymore (that I know of), and it's not a terribly standard >system, so I have little hope of seeing the bug fixed. Since it is of >course hitting my news spool filesystem hardest, I would like to mitigate >the effects by hacking [ir]news to spool the entire batch on a low inodes >condition. The necessary code changes were entirely trivial; however, I >have no idea of whether they are really meaningful. Does anybody know how >the out-of-inodes condition progresses? that is, does a stricken system >decay steadily from the "real" inode count down toward 0 or 1, or do things >look normal until blammo! ifree is (<100|=1) or whatever? Perhaps a periodic >repost of one of the analyses of the problem would be useful. This problem is trivially easy to fix regarding news, the most common perpetrator of the problem. Following are a couple of scripts that completely solve the problem. A prerequisite is that your news spool system must reside on a separate, unmountable partition. These scripts were written by Paul Anderson (stiatl!pda or rsiatl!pda) and modified as needed by me. Simply install these scripts in a convenient place, ususally your news/bin directory, edit the paths, and put the cron file in. Then sit back and enjoy an "armchair" news system John ------------------------- script "inodes" ------------------------------------ # this shell script monitors free disk and runs the Clean script # if the inodes count should free disk fall below critical levels PANICLEVEL=1000 LIBD=/usr/lib/news/bin SPOOLD=/news/spool LOCK=/usr/spool/locks/LCK..news NEWSSLICE=/dev/dsk/0s4 CLEAN=$LIBD/Clean # check to see if anyone reading news, if so, then exit # format of /etc/fuser input record # /dev/dsk/0s4: 9999 9999 9999 9999 set `/etc/fuser $NEWSSLICE 2>&1 ` if [ $# -gt 1 -o -f $LOCK ] ; then exit 0 ; fi # format of df input record: # $1 $2 $3 $4 $5 $6 $7 # /news (/dev/dsk/0s4 ): 85064 blocks 27015 i-nodes df /news |\ while read partition device junk free junk inodes junk do if [ $inodes -lt $PANICLEVEL ] then echo Cleaning /news: $inodes free ---- `date` >>$LIBD/cleanlog $CLEAN >/dev/null fi done --------------------------------------------------------------------------- -------------------- Script "CLEAN" --------------------------------------- # Free up the inodes that are lost due to a system V bug LIBD=/usr/lib/news/bin SPOOLD=/news/spool LOCK=/usr/spool/locks/LCK..news NEWSSLICE=/dev/dsk/0s4 # wait until there are no processes running on the news disk # and wait for uuxqt to finish. Then set a lock # file so nothing else will happen. This locks out Uuxqt while { if [ ! -r $NEWSSLICE ] ; then exit 1 ; fi set `/etc/fuser $NEWSSLICE 2>&1 ` [ $# -gt 1 -o -f $LOCK ] } do sleep 15 done echo >$LOCK # then run the fsck $LIBD/fixnewsfs rm $LOCK ------------------------------------------------------------------------------ ---------------------------- Script NIGHTLY ---------------------------------- set -xv # %Z% %M% %I% # This script runs nightly. It does the news expiration. It also # runs an fsck on the /news partition to recover lost inodes. LIBD=/usr/lib/news/bin SPOOLD=/news/spool NEWSSLICE=/dev/dsk/0s4 EXPIRE=$LIBD/expire LOCK=/usr/spool/locks/LCK..news trap "exit" 0 1 2 3 15 # Before anything begins, wait for uuxqt to finish. Then set a lock # file so nothing else will happen. This locks out Uuxqt date while [ -f $LOCK ] do sleep 15 done echo >$LOCK date trap "rm -f $LOCK; exit" 0 1 2 3 15 # Save the old error log files $LIBD/Savelog >/dev/null 2>&1 # Expire old news # never,never,never expire RSI groups # blow away some stuff daily, since of little interest $EXPIRE -e 1 -n control # blow these away quickly since they are of time-based merit $EXPIRE -e 2 -n \ misc,\ !misc.jobs.offered,\ rec,\ !rec.humor,\ !rec.guns,\ !rec.arts.poems,\ !rec.ham-radio,\ !rec.ham-radio.all,\ soc,\ soc.motss,\ talk,\ !talk.bizarre,\ !talk.politics.guns,\ alt,\ !alt.sources $EXPIRE -e 7 -n \ alt.sources,\ !soc.motss,\ talk.bizarre,\ misc.jobs.offered,\ rec.arts.poems,\ rec.humor,\ rec.ham-radio,\ rec.ham-radio.all $EXPIRE -e 4 -n \ !misc.jobs.offered,\ rec.guns,\ talk.politics.guns,\ junk # computer groups will hang around for a bit longer. xwindows (my baby) # will hang around for a week. $EXPIRE -e 2 -n \ sci $EXPIRE -e 7 -n \ comp,gnu $EXPIRE -e 3 -n news # finally, if I forgot any newsgroups, blow them away after 1 week... $EXPIRE -e 7 -n all,\ !rsi.technical,\ !rsi.std,\ !rsi.std.rfc,\ !rsi.std.pfc # On the first of the month, run find to clean up the disk of all files # that were not purged by expire, then rebuild the history files... if [ `date +%d` -eq 1 ] then $LIBD/Monthly fi # When all done, run fsck to recover lost inodes (Sys V bug) # first wait until there are no processes running on the news disk #while { ## if [ ! -r $NEWSSLICE ] ## then ## echo *** WARNING: the news partition is not readable! ## rm -f $LOCK ## exit 1 ## fi # set `/etc/fuser $NEWSSLICE 2>&1 ` # [ $# -gt 1 ] #} do # sleep 15 #done # then run the fsck $LIBD/fixnewsfs # Now release the disk to Uuxqt rm -f $LOCK $LIBD/Useage | $LIBD/recnews general ------------------------------------------------------------------------------- --------------------------- script Monthly ------------------------------------ # Things to be done by news monthly LIBD=/usr/lib/news/bin SPOOLD=/news/spool cd $LIBD EXPIRE=$LIBD/expire # Find all junk files in the news spool directory and blow them away. # When done, rebuild the history files. find $SPOOLD \( -size 0 -o -mtime +15 \) -type f -print -exec rm -f '{}' \; $EXPIRE -r -e 99999 -E 99999 ------------------------------------------------------------------------------ ------------------------------- script Sendbatch ----------------------------- : '@(#)sendbatch.sh 1.10 9/23/86' # pda sendbatch: @(#) Sendbatch 1.2 # # 3/24/89 pda # added stuff to check [on sysv] for the number of batches that were # queued for this site. if the total data is larger than the specified # amount, then no more will be queued for this site... # MAXDATA is the variable for this. can be set using -m... # # also put in hooks for max disk space utilization...based on # results from df... # /usr (/dev/dsk/0s3 ): 289872 blocks 42251 i-nodes # $1 $2 $3 $4 $5 $6 $7 tmpfile=/tmp/news$$ cflags= LIM=50000 CMD='/usr/lib/news/bin/batch /news/batch/$rmt $BLIM' ECHO= COMP= C7= DOIHAVE= RNEWS=rnews SUMMER=/usr/lib/news/bin/Sumtosite MAXDATA=200000 MINFREE=50000 MININODES=10000 for rmt in $* do # Check for enough disk space on spool partition. if not enough, # then exit... df /usr >$tmpfile # changed to /news after system restore 09/25 read junk junk junk blocks junk inodes junk <$tmpfile rm $tmpfile if test \( $blocks -lt $MINFREE \) -o \( $inodes -lt $MININODES \) then echo $0: Out of Inodes or Free Disk. Not Run df /usr exit 1 fi case $rmt in -[bBC]*) cflags="$cflags $rmt"; continue;; -m*) MAXDATA=`expr "$rmt" : '-m\(.*\)'` continue;; -s*) LIM=`expr "$rmt" : '-s\(.*\)'` continue;; -c7) COMP='| /usr/lib/news/bin/compress $cflags' C7='| /usr/lib/news/bin/encode' ECHO='echo "#! c7unbatch"' continue;; -c) COMP='| /usr/lib/news/bin/compress $cflags' ECHO='echo "#! cunbatch"' continue;; -o*) ECHO=`expr "$rmt" : '-o\(.*\)'` RNEWS='cunbatch' continue;; -i*) DOIHAVE=`expr "$rmt" : '-i\(.*\)'` if test -z "$DOIHAVE" then DOIHAVE=`uuname -l` fi continue;; esac if test -n "$COMP" then BLIM=`expr $LIM \* 2` else BLIM=$LIM fi : make sure we have processed all switches before going on... if test $? -eq 0 then if test `$SUMMER $rmt` -gt $MAXDATA then echo Too much data queued for $rmt, not sending a batch... exit 1 fi fi : make sure $? is zero while test $? -eq 0 -a \( -s /news/batch/$rmt -o -s /news/batch/$rmt.work -o \( -n "$DOIHAVE" -a -s /news/batch/$rmt.ihave \) \) do if test -n "$DOIHAVE" -a -s /news/batch/$rmt.ihave then mv /news/batch/$rmt.ihave /news/batch/$rmt.$$ /usr/lib/news/bin/inews -t "cmsg ihave $DOIHAVE" -n to.$rmt.ctl < \ /news/batch/$rmt.$$ rm /news/batch/$rmt.$$ else (eval $ECHO; eval $CMD $COMP $C7) | if test -s /news/batch/$rmt.cmd then /news/batch/$rmt.cmd else uux - -r -z $rmt!$RNEWS fi fi done done ------------------------------------------------------------------------------- ---------------------------- crontab for News --------------------------------- # News' crontab # # run the nightly expiration of news # 30 6 * * * /bin/nice -15 /usr/lib/news/bin/Nightly # # Make sure there are enough inodes every 10 minutes # 7,17,27,37,47,57 2-23 * * * /bin/nice /usr/lib/news/bin/Inodes # # # Feed news to other sites 20 * * * * /bin/nice /usr/lib/news/bin/Sendbatch -c stiatl # ( you need an entry here for each site you talk to ) # ---------------------------------- end ---------------------------------------- -- John De Armond, WD4OQC | Manual? ... What manual ?!? Radiation Systems, Inc. Atlanta, GA | This is Unix, My son, You gatech!stiatl!rsiatl!jgd **I am the NRA** | just GOTTA Know!!!
bill@twwells.com (T. William Wells) (10/08/89)
In article <231@rsiatl.UUCP> jgd@rsiatl.UUCP (John G. De Armond) writes:
: This problem is trivially easy to fix regarding news, the most common
: perpetrator of the problem. Following are a couple of scripts that
: completely solve the problem.
Apologies, but your scripts do little or nothing to solve the problem.
First: the inode crash can occur regardless of the number of available
inodes on your system.
Second: the crash can occur almost instantly after you have done a
fsck, so doing it frequently does not eliminate the problem.
Unless you have the bug fixed, you are always at risk of the inode
crash.
---
Bill { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com
jgd@rsiatl.UUCP (John G. De Armond) (10/08/89)
In article <1989Oct7.213314.17921@twwells.com> bill@twwells.com (T. William Wells) writes: >Apologies, but your scripts do little or nothing to solve the problem. > >First: the inode crash can occur regardless of the number of available >inodes on your system. > >Second: the crash can occur almost instantly after you have done a >fsck, so doing it frequently does not eliminate the problem. > Please cite an example of how a crash can occure right after an fsck. Assuming the file system is not genuinly full or out of inodes. If the bug works as per what has been posted, my scripts address the problem rather well. From a practical perspective, these scripts have reduced the inode problem on stiatl at my former employer from perhaps twice daily on the news partition to NEVER. (We were forced to run a tight disk so we are close to the hairy edge most of the time). On rsiatl here, the scripts have also totally eliminated the problem. That's kinda the bottom line. John -- John De Armond, WD4OQC | Manual? ... What manual ?!? Radiation Systems, Inc. Atlanta, GA | This is Unix, My son, You gatech!stiatl!rsiatl!jgd **I am the NRA** | just GOTTA Know!!!
bill@twwells.com (T. William Wells) (10/09/89)
In article <282@rsiatl.UUCP> jgd@rsiatl.UUCP (John G. De Armond) writes: : In article <1989Oct7.213314.17921@twwells.com> bill@twwells.com (T. William Wells) writes: : >Apologies, but your scripts do little or nothing to solve the problem. : > : >First: the inode crash can occur regardless of the number of available : >inodes on your system. : > : >Second: the crash can occur almost instantly after you have done a : >fsck, so doing it frequently does not eliminate the problem. : > : : Please cite an example of how a crash can occure right after an fsck. I had the following experience: run out of inodes fsck the file system restart the news job run out of inodes fsck the file system restart the news job run out of inodes fsck the file system restart the news job run out of inodes fsck the file system do some random file creates and deletes to change the inode cache restart the news job phew! it worked So the answer is: it has happened. : From a practical perspective, these scripts have reduced the inode problem : on stiatl at my former employer from perhaps twice daily on the news partition : to NEVER. (We were forced to run a tight disk so we are close to the : hairy edge most of the time). On rsiatl here, the scripts have also : totally eliminated the problem. That's kinda the bottom line. Well, you've been lucky. I suppose that running fsck frequently would reduce the probability of the crash; the bug *can* result in gradual loss of inodes. But the bug can also result in catastrophic loss of inodes and your scripts won't protect you against that. Do yourself a favor and get one of the real fixes. Among other things, they don't cost anywhere near as much CPU time or disk activity or downed file system time. And they *completely* eliminate the possibility of this particular crash. --- Bill { uunet | novavax | ankh | sunvice } !twwells!bill bill@twwells.com
mhw@wittsend.lbp.harris.com (Michael H. Warfield (Mike)) (10/10/89)
In article <1989Oct9.140605.21949@twwells.com> bill@twwells.com (T. William Wells) writes: >I had the following experience: > run out of inodes > fsck the file system . . >So the answer is: it has happened. Ok so I've got a REAL STUPID question, but I've been there TOO! DID YOU UNMOUNT THE FILE SYSTEM FIRST ?????? If you don't, chances are you are right, fsck will do little or nothing useful. Until I made sure all news activity was finished; the file system was not busy; and I had unmounted the file system, frequent fsck's did not cure the problem. Doing all of the above finished the job. I haven't seen the scripts in question and I developed my own procedures through painful experience (the inode problem is only one of several that I have had to work around). If this is your root file system that your running fsck on in multi user mode, may the saints look kindly upon you and pray for your safety. It shouldn't do any harm until it finds something it needs to fix right in the middle of something you need to use or have in use. Michael H. Warfield (The Mad Wizard) | gatech.edu!galbp!wittsend!mhw (404) 270-2123 / 270-2098 | mhw@wittsend.LBP.HARRIS.COM An optimist believes we live in the best of all possible worlds. A pessimist is sure of it!
bill@twwells.com (T. William Wells) (10/11/89)
In article <8883@galbp.LBP.HARRIS.COM> mhw@wittsend.UUCP (Michael H. Warfield (Mike)) writes: : In article <1989Oct9.140605.21949@twwells.com> bill@twwells.com (T. William Wells) writes: : : >I had the following experience: : : > run out of inodes : > fsck the file system : . : . : : >So the answer is: it has happened. : : Ok so I've got a REAL STUPID question, but I've been there TOO! : : DID YOU UNMOUNT THE FILE SYSTEM FIRST ?????? Of course. Do you think I'm as much a turkey as you seem? ASSuming that someone is stupid is just a good way to get flames. --- Bill { uunet | novavax | ankh | sunvice } !twwells!bill bill@twwells.com
rock@rancho.uucp (Rock Kent) (10/11/89)
John: John De Armond Mike: Michael H. Warfield TBill: T. William Wells OK, let's see . . . John> [recommends periodically running some scripts] TBill> [says they don't address the problem] John> [says cite examples, my scripts have eliminated the problem] TBill> [cites an anecdotal example] Mike> [asks whether TBill was correctly using fsck] I don't understand the gruesome details of the inode bug and am not competent to judge whether regular fsck's or other magic can be expected to compensate for it. It seems to me, however, that such an approach is, at best, a band-aid solution. Six months ago I was encountering the inode problem one or two times a week and was fortunate enough to run across an article by T. William Wells in which he said: TBill> S5ialloc (aka ialloc) has a bug. It seems that the code is TBill> dependent on the condition that the inode cache always TBill> contains the lowest free inode. This is a condition that just TBill> can't be met. . . . . My fix is to ignore a failure to read TBill> inodes and try again. This has the advantage of not requiring TBill> a rescan except when the inode pointer gets screwed up. I got in touch with Mr. Wells and, with his help, incorporated his fix in my kernel. The problem has not recurred. No muss, no fuss, no wasted cycles, no lost news. Disclaimer: I know T. William Wells only over the net and am merely an appreciative acquaintance. *************************************************************************** *Rock Kent rock@rancho.uucp POB 8964, Rancho Sante Fe, CA. 92067* ***************************************************************************
mhw@wittsend.lbp.harris.com (Michael H. Warfield (Mike)) (10/12/89)
In article <1989Oct10.232641.314@twwells.com> bill@twwells.com (T. William Wells) writes: >In article <8883@galbp.LBP.HARRIS.COM> mhw@wittsend.UUCP (Michael H. Warfield (Mike)) writes: >: Ok so I've got a REAL STUPID question, but I've been there TOO! >: DID YOU UNMOUNT THE FILE SYSTEM FIRST ?????? >Of course. Do you think I'm as much a turkey as you seem? >ASSuming that someone is stupid is just a good way to get flames. 1) I have seen highly intelligent people do much stupider things just because they happened to overlook the obvious. It's no reflection on them or their intelligence. That's why I said up front that it was a stupid question. But stupid questions occasional strike pay dirt and ASSuming someone would not be so foolish as to overlook something would be even more foolish. 2) I'VE DONE STUPIDER THINGS. I think I'm reasonably intelligent and I sometimes have to ask myself if I could make some such mistake in haste or without thinking. WHEN THE ANSWER IS "NO", I KNOW I'M IN TROUBLE! I didn't mean to imply that it was likely that you would overlook something like that. I didn't think it was probable but it certainly was possible. My profoundest appologies if you took offense. Michael H. Warfield (The Mad Wizard) | gatech.edu!galbp!wittsend!mhw (404) 270-2123 / 270-2098 | mhw@wittsend.LBP.HARRIS.COM An optimist believes we live in the best of all possible worlds. A pessimist is sure of it!
gentry@kcdev.UUCP (Art Gentry) (10/13/89)
For any of you experiencing the inode bug on an AT&T 3B running either 3.1, 3.1.1, 3.2 or 3.2.1, there is an official patch available from the AT&T hotline. I have no idea what the cost is if you do not have a support contract. The number for the holine is 1-800-922-0354. -- | R. Arthur Gentry AT&T Communications Kansas City, MO 64106 | | Email: attctc!kcdev!gentry ATTMail: attmail!kc4rtm!gentry | | The UNIX BBS: 816-221-0475 The Bedroom BBS: 816-637-4183 | | $include {std_disclaimer.h} "I will make a quess" - Spock - STIV |
root@attctc.Dallas.TX.US (Admin) (10/14/89)
In article <902@kcdev.UUCP>, gentry@kcdev.UUCP (Art Gentry) writes: | | For any of you experiencing the inode bug on an AT&T 3B running either | 3.1, 3.1.1, 3.2 or 3.2.1, there is an official patch available from | the AT&T hotline. | I have no idea what the cost is if you do not have a support contract. | The number for the holine is 1-800-922-0354. | | -- | | R. Arthur Gentry AT&T Communications Kansas City, MO 64106 | | | Email: attctc!kcdev!gentry ATTMail: attmail!kc4rtm!gentry | | | The UNIX BBS: 816-221-0475 The Bedroom BBS: 816-637-4183 | | | $include {std_disclaimer.h} "I will make a quess" - Spock - STIV | The fix is available and has been for quite some time. It is a replace- ment for the S5 disk driver. I have been using it for well over a year with no problem - and no loss of inode problem either. Charlie
bill@twwells.com (T. William Wells) (10/16/89)
In article <8892@galbp.LBP.HARRIS.COM> mhw@wittsend.UUCP (Michael H. Warfield (Mike)) writes:
: I didn't mean to imply that it was likely that you would overlook
: something like that. I didn't think it was probable but it certainly was
: possible. My profoundest appologies if you took offense.
Apology accepted.
Might I offer a couple of communication suggestions? First, you might
want to put more effort into getting the grammar of your messages
right. I didn't fully understand your original message because of its
grammatical problems. (Partly due to my unwillingness to untangle poor
grammar, and partly due to the grammar itself.) Second, a series of
fully capitalized words is often interpreted as a shout. You might
want to avoid them unless you want the recipient to feel shouted at.
---
Bill { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com