[news.software.b] Another way that news eats inodes

pete@octopus.UUCP (Pete Holzmann) (12/26/87)

In article <440@minya.UUCP> jc@minya.UUCP (John Chambers) writes:
<[Discussion of how rnews sometimes eats all inodes in a filesystem...]
<[rnews seems to create gobs of temp files:]
<There are maybe half a dozen such files created per minute.
<[example ls output:]
<
<| total 0
<|   170 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 87122012501096
<|   352 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010a2
<|   389 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010a4
<|   126 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010a6
<|  5239 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010a8
<|   118 -rw-rw-rw-   1 news     news           0 Dec 20 07:50 871220125010ac
<
<This continues as long as rnews is running.  Note the "total 0".
<There are no news files in the directory; there are just these
<funny files whose names obviously start with a timestamp, plus
<plus something to make them unique.  This did happen of Dec 20
<at about 12:50.  The process number at this time was about 4000, 
<so the '5010*" isn't based on the process number.
<
<Anyhow, I killed off rnews, and discovered that the last batch
<of mail had not been unpacked.  I restarted it (with no news)
<just out of curiosity, and it did the same silly thing.  I
<killed that one, too.
<
<Luckily (actually, it wasn't luck; it was experience :-) I had 
<told /bin/rnews to also link the news files into another directory, 
<too, so I could just do a ln to put them back, and when I started 
<up rnews, it proceeded to unpack the news as if nothing was wrong,
<and it doesn't happen for several more months.

It started happening to me recently as well. I wasn't watching the system
closely, so I didn't get to it until things were in real bad shape: 12000
inodes get eaten up, and I'm getting complaints all over the console screen.
I reboot, fsck tells me that the inode count is wrong in my /usr/spool
partition (where news gets stored and unpacked), and we're back to normal
for another few days to few weeks. I've never been around when things started
going bad, so I've never found the culprit. Maybe this is it? Something is
very strange, since I can't imagine rnews having a valid need for so many
inodes no matter *what*! We don't get more than around 6-7000 articles in a
whole week!

<Is there something that
<I've done wrong, for which these files are evidence?

Same question here. I also agree with his complaint about rnews:

<BTW, there's also what I consider a bug in rnews here:  when
<it fails to unpack the news (in this case, because there are
<no inodes), it throws away the news file.... It would
<have been easier if there were someway I could tell rnews "If 
<you have trouble with unpacking a file, leave it in directory
<<X>, send me mail, and I'll take care of it."
<-- 
<John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)

Do any of you news-hackers have any ideas? HELP!
-- 
  OOO   __| ___      Peter Holzmann, Octopus Enterprises
 OOOOOOO___/ _______ USPS: 19611 La Mar Court, Cupertino, CA 95014
  OOOOO \___/        UUCP: {hpda,pyramid}!octopus!pete
___| \_____          Phone: 408/996-7746

david@ms.uky.edu (David Herron -- Resident E-mail Hack) (12/28/87)

guys ...

that /usr/spool/news/.rnews directory is controlled by the SPOOLNEWS
code ... that is, if you have SPOOLNEWS defined then rnews will only
put it's input into a file in the ".rnews" directory and exit quickly.
The idea is to let uuxqt finish quickly with little chance of having
lots of uuxqt's running at once.  Later you run another program (in
this case "rnews -U") to pick up the stuff in ".rnews" ... there's
locking involved to make sure that only one "rnews -U" is running at a
time...

The *ONLY* time I've seen that method making zero length files is when
the filesystem is out of free blocks.

Oh ... another thing.  If there are problems with a particular file,
the SPOOLNEWS code moves the file to "/usr/spool/news" (er..  without
making any kind of notification except for an errorlog entry).

These comments are based on 2.11.8 ... I apologize if any details have
changed since patch#8 but this last semester has been very very looooong.
-- 
<---- David Herron -- The E-Mail guy            <david@ms.uky.edu>
<---- or:                {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET
<----
<---- Winter health warning:  Remember, don't eat the yellow snow!

pete@octopus.UUCP (Pete Holzmann) (12/28/87)

In article <7958@e.ms.uky.edu> david@ms.uky.edu (David Herron -- Resident E-mail Hack) writes:
[regarding massive inode disappearance while news software is running]
>that /usr/spool/news/.rnews directory is controlled by the SPOOLNEWS
>code ... that is, if you have SPOOLNEWS defined then rnews will only
>put it's input into a file in the ".rnews" directory and exit quickly.
>.....
>The *ONLY* time I've seen that method making zero length files is when
>the filesystem is out of free blocks.
><---- David Herron -- The E-Mail guy            <david@ms.uky.edu>
><---- or:                {rutgers,uunet,cbosgd}!ukma!david, david@UKMA.BITNET

In our case, we are not running SPOOLNEWS. And I've never gotten as far as
seeing lots of zero length files. All I know is that the system is dead,
and when it is rebooted, fsck fixes the inode count in the spool partition.
I've since received an email response to my message from another sysadmin
on a system V system. He's got the same problem, has had it for many
patchlevels of news 2.11... he's got an hourly script running under cron.
It checks the inode count; if it goes to zero, it kills any news process,
dismounts and remounts the spool partition, and the inodes get restored.
What a kludge!

Any suggestions on how to find and kill this bug? Unfortunately (?! :-)) it
hasn't hit us here now for a few weeks.... I wonder if it is related to
any of the other recent bugs (% in article-id, etc)... we haven't upgraded
beyond patchlevel 13 yet.

Pete

-- 
  OOO   __| ___      Peter Holzmann, Octopus Enterprises
 OOOOOOO___/ _______ USPS: 19611 La Mar Court, Cupertino, CA 95014
  OOOOO \___/        UUCP: {hpda,pyramid}!octopus!pete
___| \_____          Phone: 408/996-7746

smb@ulysses.homer.nj.att.com (Steven Bellovin) (12/28/87)

We've seen the zero-length file bug, too (with SPOOLNEWS), but we've never
run out of i-nodes or disk space.  I finally patched around it by
checking the length of the newly-created file, and unlinking it if
necessary....

geoff@desint.UUCP (Geoff Kuenning) (12/29/87)

In article <375@octopus.UUCP> pete@octopus.UUCP (Pete Holzmann) writes:

> In article <440@minya.UUCP> jc@minya.UUCP (John Chambers) writes:
> <[Discussion of how rnews sometimes eats all inodes in a filesystem...]

> It started happening to me recently as well. I wasn't watching the system
> closely, so I didn't get to it until things were in real bad shape: 12000
> inodes get eaten up, and I'm getting complaints all over the console screen.
> I reboot, fsck tells me that the inode count is wrong in my /usr/spool
> partition (where news gets stored and unpacked), and we're back to normal
> for another few days to few weeks. I've never been around when things started
> going bad, so I've never found the culprit. Maybe this is it? Something is
> very strange, since I can't imagine rnews having a valid need for so many
> inodes no matter *what*! We don't get more than around 6-7000 articles in a
> whole week!

Absolutely FASCINATING.  I've suffered this problem for at least a year,
maybe two, and there's a possibility it goes back as much as five years.
But I blamed it on a hardware problem.  I run a special crontab script
that watches the news i-nodes and kills incoming news if they get too
low.  Then I hand-unmount and fsck the disk.  Since inews leaves
"No space on device" errors in the errlog, I can then use sendme's
to re-request the articles that were dropped before the incoming
news got killed.

An interesting point here is that the number of i-nodes remains too
small even after rnews exits.  Only fsck will correct the problem.
This leads me to wonder if maybe rnews is triggering a kernel bug or
deficiency of some sort.

Since I now have evidence it's a bug and not a hardware problem, I'll
modify the script to collect a little more data and see if I can
trap something.  It'll probably take a while, since the occurrence
rate varies from several times a week to once every few months.
-- 
	Geoff Kuenning   geoff@ITcorp.com   {uunet,trwrb}!desint!geoff

scl@virginia.acc.virginia.edu (Steve Losen) (12/30/87)

In case y'all missed it, there was a massive discussion of this
news-eats-inodes problem in comp.sys.att about 2 months ago.

The problem is a bug in the Sys V kernel.  The kernel keeps a cache
of free inode numbers in the superblock.  When the cache empties, the
kernel searches the ilist for more free inodes to put in the cache.
The kernel searches the ilist starting at the last inode number
that was taken from the cache.  If the end of the ilist is encountered without
finding any free inodes, the kernel assumes there are no free inodes and
sets the inode count to zero.  (The kernel should wrap its search
back to the top of the ilist here.)

The following scenario will cause a SysV file system to lose all its free inodes.

1. System takes last inode from cache.  There are probably plenty of free inodes
   still in the ilist, though.

2. System frees an inode whose number is higher than any other free inode in ilist.
   This inode goes into the previously empty cache.

3. System needs an inode taking last inode from cache.

4. System needs another inode.  Cache is empty so system tries to refill the cache.
   System starts searching ilist at number of last inode taken from cache
   (which was beyond any free inodes in ilist).

5. System hits end of ilist without finding any free inodes so free inode
   count set to 0. :-(

All SysV file systems are susceptible to this bug, but it only appears in news
file systems because they create and delete so many files.  Since the cache
holds 200 inumbers, most "normal" file systems rarely empty it.  News
file systems get their caches drained often and it's never clear
where the search to refill the cache will start.  Thus news file systems
eventually allocate their highest-numbered inodes even though they never
use more than about half of their available inodes.  This spells disaster
further down the road.

For those of you with kernel source you can fix the problem in the ialloc()
routine.  And for those with rel 3. it is in 5ialloc() or somesuch.

Many thanks to the persons who posted the solution to this problem earlier.
I am only passing on their information and regret that I don't have their
names to give them due credit.
-- 
Steve Losen
University of Virginia Academic Computing Center

steve@mahendo.Jpl.Nasa.Gov (3171) (12/30/87)

I have changed the followup to news.software.b:

One case where I think you can make this happen is to fire up expire while
news is being unpacked.  expire will lock the news system and rnews will then
start saving files in .rnews.  When expire is done, it will fire up rnews -U
and sometimes the problem will show itself.  I think it will happen if there
is still news coming in when rnews -U is started.  My recollections are vague
because I thought the problem had to do with the non-standard System V
derivative running on the Ridge 32 here where I have seen this problem.

Since others are having this problem as well, I guess I will have to start
paying it some attention.  Maybe one of us will trap this beast.

		--Steve Schlaifer (steve@mahendo.Jpl.Nasa.Gov or jplgodo!steve)