[comp.unix.wizards] Another way that news eats inodes

jc@minya.UUCP (John Chambers) (12/25/87)

Here's another file-system puzzle to entertain y'all over the holiday
season.  About once every two months, I've seen all the inodes disappear
in the news filesystem here, with similar symptoms.

This is one of the machines where /bin/rnews is a script that just
stuffs the news file into ~news/spool/.rnews, and a couple of times
a day cron runs a script that does "rnews -U" to unpack it all.  

This usually works just fine, and solves the problem that rnews,
compress, and uucico don't seem to be able to coexist with vi or
cc here.  But occasionally all the inodes disappear, and when I
investigate, I find something curious:

| drwxrwxrwx   2 news     news       40656 Dec 20 07:44 .rnews

This is rather large for a directory; what it contains is:

| total 0
|   352 -rw-rw-rw-   1 news     news           0 Dec 20 07:49 8712201249103d
|   118 -rw-rw-rw-   1 news     news           0 Dec 20 07:49 87122012491045
|  2845 -rw-rw-rw-   1 news     news           0 Dec 20 07:49 87122012491047
|   170 -rw-rw-rw-   1 news     news           0 Dec 20 07:49 87122012491049
|   389 -rw-rw-rw-   1 news     news           0 Dec 20 07:49 8712201249104b
|   126 -rw-rw-rw-   1 news     news           0 Dec 20 07:49 8712201249104f

and so on for pages and pages.  There are maybe half a dozen such files
created per minute.  I did a "rm .rnews/*", and checked again:

| total 0
|   170 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 87122012501096
|   352 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010a2
|   389 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010a4
|   126 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010a6
|  5239 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010a8
|   118 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010ac

This continues as long as rnews is running.  Note the "total 0".
There are no news files in the directory; there are just these
funny files whose names obviously start with a timestamp, plus
plus something to make them unique.  This did happen of Dec 20
at about 12:50.  The process number at this time was about 4000, 
so the '5010*" isn't based on the process number.

Anyhow, I killed off rnews, and discovered that the last batch
of mail had not been unpacked.  I restarted it (with no news)
just out of curiosity, and it did the same silly thing.  I
killed that one, too.

Luckily (actually, it wasn't luck; it was experience :-) I had 
told /bin/rnews to also link the news files into another directory, 
too, so I could just do a ln to put them back, and when I started 
up rnews, it proceeded to unpack the news as if nothing was wrong,
and it doesn't happen for several more months.

Does anyone out there have any idea what might be going on in
rnews' feeble little mind when it creates these funny files?
I've spent a little time trying to spot the code that creates
them, but so far I've had no luck.  Is there something that
I've done wrong, for which these files are evidence?

BTW, there's also what I consider a bug in rnews here:  when
it fails to unpack the news (in this case, because there are
no inodes), it throws away the news file.  I've done an end
run around this by, as I said above, linking incoming news
files into a secret directory, and running a separate cleanup
script that deletes them after a couple of days.  But it would
have been easier if there were someway I could tell rnews "If 
you have trouble with unpacking a file, leave it in directory
<X>, send me mail, and I'll take care of it."

-- 
John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)

pete@octopus.UUCP (Pete Holzmann) (12/26/87)

In article <440@minya.UUCP> jc@minya.UUCP (John Chambers) writes:
<[Discussion of how rnews sometimes eats all inodes in a filesystem...]
<[rnews seems to create gobs of temp files:]
<There are maybe half a dozen such files created per minute.
<[example ls output:]
<
<| total 0
<|   170 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 87122012501096
<|   352 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010a2
<|   389 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010a4
<|   126 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010a6
<|  5239 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010a8
<|   118 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010ac
<
<This continues as long as rnews is running.  Note the "total 0".
<There are no news files in the directory; there are just these
<funny files whose names obviously start with a timestamp, plus
<plus something to make them unique.  This did happen of Dec 20
<at about 12:50.  The process number at this time was about 4000, 
<so the '5010*" isn't based on the process number.
<
<Anyhow, I killed off rnews, and discovered that the last batch
<of mail had not been unpacked.  I restarted it (with no news)
<just out of curiosity, and it did the same silly thing.  I
<killed that one, too.
<
<Luckily (actually, it wasn't luck; it was experience :-) I had 
<told /bin/rnews to also link the news files into another directory, 
<too, so I could just do a ln to put them back, and when I started 
<up rnews, it proceeded to unpack the news as if nothing was wrong,
<and it doesn't happen for several more months.

It started happening to me recently as well. I wasn't watching the system
closely, so I didn't get to it until things were in real bad shape: 12000
inodes get eaten up, and I'm getting complaints all over the console screen.
I reboot, fsck tells me that the inode count is wrong in my /usr/spool
partition (where news gets stored and unpacked), and we're back to normal
for another few days to few weeks. I've never been around when things started
going bad, so I've never found the culprit. Maybe this is it? Something is
very strange, since I can't imagine rnews having a valid need for so many
inodes no matter *what*! We don't get more than around 6-7000 articles in a
whole week!

<Is there something that
<I've done wrong, for which these files are evidence?

Same question here. I also agree with his complaint about rnews:

<BTW, there's also what I consider a bug in rnews here:  when
<it fails to unpack the news (in this case, because there are
<no inodes), it throws away the news file.... It would
<have been easier if there were someway I could tell rnews "If 
<you have trouble with unpacking a file, leave it in directory
<<X>, send me mail, and I'll take care of it."
<-- 
<John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)

Do any of you news-hackers have any ideas? HELP!
-- 
  OOO   __| ___      Peter Holzmann, Octopus Enterprises
 OOOOOOO___/ _______ USPS: 19611 La Mar Court, Cupertino, CA 95014
  OOOOO \___/        UUCP: {hpda,pyramid}!octopus!pete
___| \_____          Phone: 408/996-7746

david@ms.uky.edu (David Herron -- Resident E-mail Hack) (12/28/87)

guys ...

that /usr/spool/news/.rnews directory is controlled by the SPOOLNEWS
code ... that is, if you have SPOOLNEWS defined then rnews will only
put it's input into a file in the ".rnews" directory and exit quickly.
The idea is to let uuxqt finish quickly with little chance of having
lots of uuxqt's running at once.  Later you run another program (in
this case "rnews -U") to pick up the stuff in ".rnews" ... there's
locking involved to make sure that only one "rnews -U" is running at a
time...

The *ONLY* time I've seen that method making zero length files is when
the filesystem is out of free blocks.

Oh ... another thing.  If there are problems with a particular file,
the SPOOLNEWS code moves the file to "/usr/spool/news" (er..  without
making any kind of notification except for an errorlog entry).

These comments are based on 2.11.8 ... I apologize if any details have
changed since patch#8 but this last semester has been very very looooong.
-- 
<---- David Herron -- The E-Mail guy            <david@ms.uky.edu>
<---- or:                {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET
<----
<---- Winter health warning:  Remember, don't eat the yellow snow!

pete@octopus.UUCP (Pete Holzmann) (12/28/87)

In article <7958@e.ms.uky.edu> david@ms.uky.edu (David Herron -- Resident E-mail Hack) writes:
[regarding massive inode disappearance while news software is running]
>that /usr/spool/news/.rnews directory is controlled by the SPOOLNEWS
>code ... that is, if you have SPOOLNEWS defined then rnews will only
>put it's input into a file in the ".rnews" directory and exit quickly.
>.....
>The *ONLY* time I've seen that method making zero length files is when
>the filesystem is out of free blocks.
><---- David Herron -- The E-Mail guy            <david@ms.uky.edu>
><---- or:                {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET

In our case, we are not running SPOOLNEWS. And I've never gotten as far as
seeing lots of zero length files. All I know is that the system is dead,
and when it is rebooted, fsck fixes the inode count in the spool partition.
I've since received an email response to my message from another sysadmin
on a system V system. He's got the same problem, has had it for many
patchlevels of news 2.11... he's got an hourly script running under cron.
It checks the inode count; if it goes to zero, it kills any news process,
dismounts and remounts the spool partition, and the inodes get restored.
What a kludge!

Any suggestions on how to find and kill this bug? Unfortunately (?! :-)) it
hasn't hit us here now for a few weeks.... I wonder if it is related to
any of the other recent bugs (% in article-id, etc)... we haven't upgraded
beyond patchlevel 13 yet.

Pete

-- 
  OOO   __| ___      Peter Holzmann, Octopus Enterprises
 OOOOOOO___/ _______ USPS: 19611 La Mar Court, Cupertino, CA 95014
  OOOOO \___/        UUCP: {hpda,pyramid}!octopus!pete
___| \_____          Phone: 408/996-7746

smb@ulysses.homer.nj.att.com (Steven Bellovin) (12/28/87)

We've seen the zero-length file bug, too (with SPOOLNEWS), but we've never
run out of i-nodes or disk space.  I finally patched around it by
checking the length of the newly-created file, and unlinking it if
necessary....

geoff@desint.UUCP (Geoff Kuenning) (12/29/87)

In article <375@octopus.UUCP> pete@octopus.UUCP (Pete Holzmann) writes:

> In article <440@minya.UUCP> jc@minya.UUCP (John Chambers) writes:
> <[Discussion of how rnews sometimes eats all inodes in a filesystem...]

> It started happening to me recently as well. I wasn't watching the system
> closely, so I didn't get to it until things were in real bad shape: 12000
> inodes get eaten up, and I'm getting complaints all over the console screen.
> I reboot, fsck tells me that the inode count is wrong in my /usr/spool
> partition (where news gets stored and unpacked), and we're back to normal
> for another few days to few weeks. I've never been around when things started
> going bad, so I've never found the culprit. Maybe this is it? Something is
> very strange, since I can't imagine rnews having a valid need for so many
> inodes no matter *what*! We don't get more than around 6-7000 articles in a
> whole week!

Absolutely FASCINATING.  I've suffered this problem for at least a year,
maybe two, and there's a possibility it goes back as much as five years.
But I blamed it on a hardware problem.  I run a special crontab script
that watches the news i-nodes and kills incoming news if they get too
low.  Then I hand-unmount and fsck the disk.  Since inews leaves
"No space on device" errors in the errlog, I can then use sendme's
to re-request the articles that were dropped before the incoming
news got killed.

An interesting point here is that the number of i-nodes remains too
small even after rnews exits.  Only fsck will correct the problem.
This leads me to wonder if maybe rnews is triggering a kernel bug or
deficiency of some sort.

Since I now have evidence it's a bug and not a hardware problem, I'll
modify the script to collect a little more data and see if I can
trap something.  It'll probably take a while, since the occurrence
rate varies from several times a week to once every few months.
-- 
	Geoff Kuenning   geoff@ITcorp.com   {uunet,trwrb}!desint!geoff

scl@virginia.acc.virginia.edu (Steve Losen) (12/30/87)

In case y'all missed it, there was a massive discussion of this
news-eats-inodes problem in comp.sys.att about 2 months ago.

The problem is a bug in the Sys V kernel.  The kernel keeps a cache
of free inode numbers in the superblock.  When the cache empties, the
kernel searches the ilist for more free inodes to put in the cache.
The kernel searches the ilist starting at the last inode number
that was taken from the cache.  If the end of the ilist is encountered without
finding any free inodes, the kernel assumes there are no free inodes and
sets the inode count to zero.  (The kernel should wrap its search
back to the top of the ilist here.)

The following scenario will cause a SysV file system to lose all its free inodes.

1. System takes last inode from cache.  There are probably plenty of free inodes
   still in the ilist, though.

2. System frees an inode whose number is higher than any other free inode in ilist.
   This inode goes into the previously empty cache.

3. System needs an inode taking last inode from cache.

4. System needs another inode.  Cache is empty so system tries to refill the cache.
   System starts searching ilist at number of last inode taken from cache
   (which was beyond any free inodes in ilist).

5. System hits end of ilist without finding any free inodes so free inode
   count set to 0. :-(

All SysV file systems are susceptible to this bug, but it only appears in news
file systems because they create and delete so many files.  Since the cache
holds 200 inumbers, most "normal" file systems rarely empty it.  News
file systems get their caches drained often and it's never clear
where the search to refill the cache will start.  Thus news file systems
eventually allocate their highest-numbered inodes even though they never
use more than about half of their available inodes.  This spells disaster
further down the road.

For those of you with kernel source you can fix the problem in the ialloc()
routine.  And for those with rel 3. it is in 5ialloc() or somesuch.

Many thanks to the persons who posted the solution to this problem earlier.
I am only passing on their information and regret that I don't have their
names to give them due credit.
-- 
Steve Losen
University of Virginia Academic Computing Center