[comp.unix.aux] Revised Bug Report- rmail/mail

alexis@panix.uucp (Alexis Rosen) (12/12/90)

Bottom Line: mail is broken. Read on...

Several days ago I posted a detailed description of a very serious problem
I was having with uuxqt, cunbatch, and rmail (well at least it was serious
to me). Shortly thereafter I posted a note explaining that it wasn't a bug,
but that our kernel needed more file table space. There was the curious
problem of why the kernel hadn't notified us the way it was supposed to,
until several days after the problem started.

Since then I've done a lot of exploring and some thinking (...always helps)
about this issue. I wrote some info-gathering code into the daily news and
uucp scripts, and I've discovered that my initial assesment- that there was
no bug- was _wrong_. There is a bug, and it's not in the kernel error-message
routines, either.

I wrote that our problem was that the kernel was running out of file table
entries, and that we fixed things by bumping up NFILE and NINODE. (I said
the kernel was running out of file descriptors in that article, but I meant
file table entries.) In fact, our running out of file table space was a
complete coincidence, and had nothing to do with the problems with mail or
news. The news problems are now fixed (I'll describe that in another
article), and the mail problem is a genuine bug. When we did in fact run
out of file table room, the kernel correctly notified us.

The problem was, believe it or not, exactly what mail said it was: it
couldn't create the /usr/mail/user.lock file after ten tries. (Wow. An
error message that's precise and concise.) This was caused by uuxqt unspooling
several messages at once to the same user, so that another rmail owned the
lockfile. And then yet another rmail got the lockfile. And so on, so that
the first one timed out eventually.

While some may call this a matter of judgement, I consider this to be a
nasty bug in rmail. It's clearly a matter of two constants (the number of
times to retry locking, and the delay factor between two attempts). I think
it qualifies as a bug because it fails the Turing rmail test. (In other
words, if your rmail talks to my rmail, it'll know it's not a real rmail. :-)
To be more precise, this stuff works on other unixes, and it ought to work
here.

So rmail (just /bin/mail, really) needs to be fixed. While this is the sort
of thing that should be fixed in 2.0.1, we _need_ a fix, and we need it _now_.
It's not the sort of thing I could easily fix with a shell script, like I did
with uuxqt (but, see my next message about that...), although I can see one
_very_ ugly way. Since the whole problem could be fixed (I think) by patching
one or two constants (retry count and/or delay), it should be a simple patch.

I have never done any serious coding on unix, so this is beyond me. If nobody
out there can help me, I'll have to learn about the debugger, and there goes
the weekend. So I'm hoping that someone out there who also wants to get all
his/her mail will make a fix and then post it. (Perhaps my friend Jim will
do this, in which case I or he will post his patch.) Otherwise I will, but
it'll take a lot longer...

BTW, you may be wondering why you haven't seen this problem yourself. First
of all, it may occur every once in a while, but you may not notice, or you
may not be able to trace it. After all, the only notification you'll get is a
mysterious line in your logfile. (This was true of our system until we started
putting new users on.) Secondly, you'll need to receive a number of articles
to the same mailbox, all at once, with the system loaded enough so that one
or more of the mails times out on the lockfile creation. It's my guess that
I own one of the _very_ few A/UX boxes that's used heavily for mail and news
by more than just a few users. Still, this problem can strike _anyone_ who
does uucp. (uucp is indicated because of its bursty nature.)

If anyone doubts my analysis, there's a simple way to prove it. Write a
little shell file that invokes ten or twelve mails, all in the background,
all to the same person.

---
Alexis Rosen
Owner/Sysadmin, PANIX Public Access Unix, NY
{cmcl2,apple}!panix!alexis