jc@minya.UUCP (John Chambers) (12/25/87)
Here's another file-system puzzle to entertain y'all over the holiday season. About once every two months, I've seen all the inodes disappear in the news filesystem here, with similar symptoms. This is one of the machines where /bin/rnews is a script that just stuffs the news file into ~news/spool/.rnews, and a couple of times a day cron runs a script that does "rnews -U" to unpack it all. This usually works just fine, and solves the problem that rnews, compress, and uucico don't seem to be able to coexist with vi or cc here. But occasionally all the inodes disappear, and when I investigate, I find something curious: | drwxrwxrwx 2 news news 40656 Dec 20 07:44 .rnews This is rather large for a directory; what it contains is: | total 0 | 352 -rw-rw-rw- 1 news news 0 Dec 20 07:49 8712201249103d | 118 -rw-rw-rw- 1 news news 0 Dec 20 07:49 87122012491045 | 2845 -rw-rw-rw- 1 news news 0 Dec 20 07:49 87122012491047 | 170 -rw-rw-rw- 1 news news 0 Dec 20 07:49 87122012491049 | 389 -rw-rw-rw- 1 news news 0 Dec 20 07:49 8712201249104b | 126 -rw-rw-rw- 1 news news 0 Dec 20 07:49 8712201249104f and so on for pages and pages. There are maybe half a dozen such files created per minute. I did a "rm .rnews/*", and checked again: | total 0 | 170 -rw-rw-rw- 1 news news 0 Dec 20 07:50 87122012501096 | 352 -rw-rw-rw- 1 news news 0 Dec 20 07:50 871220125010a2 | 389 -rw-rw-rw- 1 news news 0 Dec 20 07:50 871220125010a4 | 126 -rw-rw-rw- 1 news news 0 Dec 20 07:50 871220125010a6 | 5239 -rw-rw-rw- 1 news news 0 Dec 20 07:50 871220125010a8 | 118 -rw-rw-rw- 1 news news 0 Dec 20 07:50 871220125010ac This continues as long as rnews is running. Note the "total 0". There are no news files in the directory; there are just these funny files whose names obviously start with a timestamp, plus plus something to make them unique. This did happen of Dec 20 at about 12:50. The process number at this time was about 4000, so the '5010*" isn't based on the process number. Anyhow, I killed off rnews, and discovered that the last batch of mail had not been unpacked. I restarted it (with no news) just out of curiosity, and it did the same silly thing. I killed that one, too. Luckily (actually, it wasn't luck; it was experience :-) I had told /bin/rnews to also link the news files into another directory, too, so I could just do a ln to put them back, and when I started up rnews, it proceeded to unpack the news as if nothing was wrong, and it doesn't happen for several more months. Does anyone out there have any idea what might be going on in rnews' feeble little mind when it creates these funny files? I've spent a little time trying to spot the code that creates them, but so far I've had no luck. Is there something that I've done wrong, for which these files are evidence? BTW, there's also what I consider a bug in rnews here: when it fails to unpack the news (in this case, because there are no inodes), it throws away the news file. I've done an end run around this by, as I said above, linking incoming news files into a secret directory, and running a separate cleanup script that deletes them after a couple of days. But it would have been easier if there were someway I could tell rnews "If you have trouble with unpacking a file, leave it in directory <X>, send me mail, and I'll take care of it." -- John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)
pete@octopus.UUCP (Pete Holzmann) (12/26/87)
In article <440@minya.UUCP> jc@minya.UUCP (John Chambers) writes:
<[Discussion of how rnews sometimes eats all inodes in a filesystem...]
<[rnews seems to create gobs of temp files:]
<There are maybe half a dozen such files created per minute.
<[example ls output:]
<
<| total 0
<| 170 -rw-rw-rw- 1 news news 0 Dec 20 07:50 87122012501096
<| 352 -rw-rw-rw- 1 news news 0 Dec 20 07:50 871220125010a2
<| 389 -rw-rw-rw- 1 news news 0 Dec 20 07:50 871220125010a4
<| 126 -rw-rw-rw- 1 news news 0 Dec 20 07:50 871220125010a6
<| 5239 -rw-rw-rw- 1 news news 0 Dec 20 07:50 871220125010a8
<| 118 -rw-rw-rw- 1 news news 0 Dec 20 07:50 871220125010ac
<
<This continues as long as rnews is running. Note the "total 0".
<There are no news files in the directory; there are just these
<funny files whose names obviously start with a timestamp, plus
<plus something to make them unique. This did happen of Dec 20
<at about 12:50. The process number at this time was about 4000,
<so the '5010*" isn't based on the process number.
<
<Anyhow, I killed off rnews, and discovered that the last batch
<of mail had not been unpacked. I restarted it (with no news)
<just out of curiosity, and it did the same silly thing. I
<killed that one, too.
<
<Luckily (actually, it wasn't luck; it was experience :-) I had
<told /bin/rnews to also link the news files into another directory,
<too, so I could just do a ln to put them back, and when I started
<up rnews, it proceeded to unpack the news as if nothing was wrong,
<and it doesn't happen for several more months.
It started happening to me recently as well. I wasn't watching the system
closely, so I didn't get to it until things were in real bad shape: 12000
inodes get eaten up, and I'm getting complaints all over the console screen.
I reboot, fsck tells me that the inode count is wrong in my /usr/spool
partition (where news gets stored and unpacked), and we're back to normal
for another few days to few weeks. I've never been around when things started
going bad, so I've never found the culprit. Maybe this is it? Something is
very strange, since I can't imagine rnews having a valid need for so many
inodes no matter *what*! We don't get more than around 6-7000 articles in a
whole week!
<Is there something that
<I've done wrong, for which these files are evidence?
Same question here. I also agree with his complaint about rnews:
<BTW, there's also what I consider a bug in rnews here: when
<it fails to unpack the news (in this case, because there are
<no inodes), it throws away the news file.... It would
<have been easier if there were someway I could tell rnews "If
<you have trouble with unpacking a file, leave it in directory
<<X>, send me mail, and I'll take care of it."
<--
<John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)
Do any of you news-hackers have any ideas? HELP!
--
OOO __| ___ Peter Holzmann, Octopus Enterprises
OOOOOOO___/ _______ USPS: 19611 La Mar Court, Cupertino, CA 95014
OOOOO \___/ UUCP: {hpda,pyramid}!octopus!pete
___| \_____ Phone: 408/996-7746
david@ms.uky.edu (David Herron -- Resident E-mail Hack) (12/28/87)
guys ... that /usr/spool/news/.rnews directory is controlled by the SPOOLNEWS code ... that is, if you have SPOOLNEWS defined then rnews will only put it's input into a file in the ".rnews" directory and exit quickly. The idea is to let uuxqt finish quickly with little chance of having lots of uuxqt's running at once. Later you run another program (in this case "rnews -U") to pick up the stuff in ".rnews" ... there's locking involved to make sure that only one "rnews -U" is running at a time... The *ONLY* time I've seen that method making zero length files is when the filesystem is out of free blocks. Oh ... another thing. If there are problems with a particular file, the SPOOLNEWS code moves the file to "/usr/spool/news" (er.. without making any kind of notification except for an errorlog entry). These comments are based on 2.11.8 ... I apologize if any details have changed since patch#8 but this last semester has been very very looooong. -- <---- David Herron -- The E-Mail guy <david@ms.uky.edu> <---- or: {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET <---- <---- Winter health warning: Remember, don't eat the yellow snow!
pete@octopus.UUCP (Pete Holzmann) (12/28/87)
In article <7958@e.ms.uky.edu> david@ms.uky.edu (David Herron -- Resident E-mail Hack) writes: [regarding massive inode disappearance while news software is running] >that /usr/spool/news/.rnews directory is controlled by the SPOOLNEWS >code ... that is, if you have SPOOLNEWS defined then rnews will only >put it's input into a file in the ".rnews" directory and exit quickly. >..... >The *ONLY* time I've seen that method making zero length files is when >the filesystem is out of free blocks. ><---- David Herron -- The E-Mail guy <david@ms.uky.edu> ><---- or: {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET In our case, we are not running SPOOLNEWS. And I've never gotten as far as seeing lots of zero length files. All I know is that the system is dead, and when it is rebooted, fsck fixes the inode count in the spool partition. I've since received an email response to my message from another sysadmin on a system V system. He's got the same problem, has had it for many patchlevels of news 2.11... he's got an hourly script running under cron. It checks the inode count; if it goes to zero, it kills any news process, dismounts and remounts the spool partition, and the inodes get restored. What a kludge! Any suggestions on how to find and kill this bug? Unfortunately (?! :-)) it hasn't hit us here now for a few weeks.... I wonder if it is related to any of the other recent bugs (% in article-id, etc)... we haven't upgraded beyond patchlevel 13 yet. Pete -- OOO __| ___ Peter Holzmann, Octopus Enterprises OOOOOOO___/ _______ USPS: 19611 La Mar Court, Cupertino, CA 95014 OOOOO \___/ UUCP: {hpda,pyramid}!octopus!pete ___| \_____ Phone: 408/996-7746
smb@ulysses.homer.nj.att.com (Steven Bellovin) (12/28/87)
We've seen the zero-length file bug, too (with SPOOLNEWS), but we've never run out of i-nodes or disk space. I finally patched around it by checking the length of the newly-created file, and unlinking it if necessary....
geoff@desint.UUCP (Geoff Kuenning) (12/29/87)
In article <375@octopus.UUCP> pete@octopus.UUCP (Pete Holzmann) writes: > In article <440@minya.UUCP> jc@minya.UUCP (John Chambers) writes: > <[Discussion of how rnews sometimes eats all inodes in a filesystem...] > It started happening to me recently as well. I wasn't watching the system > closely, so I didn't get to it until things were in real bad shape: 12000 > inodes get eaten up, and I'm getting complaints all over the console screen. > I reboot, fsck tells me that the inode count is wrong in my /usr/spool > partition (where news gets stored and unpacked), and we're back to normal > for another few days to few weeks. I've never been around when things started > going bad, so I've never found the culprit. Maybe this is it? Something is > very strange, since I can't imagine rnews having a valid need for so many > inodes no matter *what*! We don't get more than around 6-7000 articles in a > whole week! Absolutely FASCINATING. I've suffered this problem for at least a year, maybe two, and there's a possibility it goes back as much as five years. But I blamed it on a hardware problem. I run a special crontab script that watches the news i-nodes and kills incoming news if they get too low. Then I hand-unmount and fsck the disk. Since inews leaves "No space on device" errors in the errlog, I can then use sendme's to re-request the articles that were dropped before the incoming news got killed. An interesting point here is that the number of i-nodes remains too small even after rnews exits. Only fsck will correct the problem. This leads me to wonder if maybe rnews is triggering a kernel bug or deficiency of some sort. Since I now have evidence it's a bug and not a hardware problem, I'll modify the script to collect a little more data and see if I can trap something. It'll probably take a while, since the occurrence rate varies from several times a week to once every few months. -- Geoff Kuenning geoff@ITcorp.com {uunet,trwrb}!desint!geoff
scl@virginia.acc.virginia.edu (Steve Losen) (12/30/87)
In case y'all missed it, there was a massive discussion of this news-eats-inodes problem in comp.sys.att about 2 months ago. The problem is a bug in the Sys V kernel. The kernel keeps a cache of free inode numbers in the superblock. When the cache empties, the kernel searches the ilist for more free inodes to put in the cache. The kernel searches the ilist starting at the last inode number that was taken from the cache. If the end of the ilist is encountered without finding any free inodes, the kernel assumes there are no free inodes and sets the inode count to zero. (The kernel should wrap its search back to the top of the ilist here.) The following scenario will cause a SysV file system to lose all its free inodes. 1. System takes last inode from cache. There are probably plenty of free inodes still in the ilist, though. 2. System frees an inode whose number is higher than any other free inode in ilist. This inode goes into the previously empty cache. 3. System needs an inode taking last inode from cache. 4. System needs another inode. Cache is empty so system tries to refill the cache. System starts searching ilist at number of last inode taken from cache (which was beyond any free inodes in ilist). 5. System hits end of ilist without finding any free inodes so free inode count set to 0. :-( All SysV file systems are susceptible to this bug, but it only appears in news file systems because they create and delete so many files. Since the cache holds 200 inumbers, most "normal" file systems rarely empty it. News file systems get their caches drained often and it's never clear where the search to refill the cache will start. Thus news file systems eventually allocate their highest-numbered inodes even though they never use more than about half of their available inodes. This spells disaster further down the road. For those of you with kernel source you can fix the problem in the ialloc() routine. And for those with rel 3. it is in 5ialloc() or somesuch. Many thanks to the persons who posted the solution to this problem earlier. I am only passing on their information and regret that I don't have their names to give them due credit. -- Steve Losen University of Virginia Academic Computing Center