[comp.bugs.sys5] Sys V r2 v2 cron keeps dying

phil@qfdts.OZ (Phil Chadwick) (09/17/87)

We run UNIX System V release 2 version 2 on a VAX.
Cron is core dumping fairly regularly - 5 or 6 times
in the last 4 days.  The problems first started when
I added the `t' queue (for troff).  The `h' queue
(for high priority batch jobs) was added at the same
time.  /usr/lib/cron/queuedefs looks like this:

    a.2j14n
    b.3j14n90w
    t.2j13n90w
    h.1j0n90w

Before I dive in and start hacking, has anyone done
it before?

Phil.

greg@ncr-sd.SanDiego.NCR.COM (Greg Noel) (09/20/87)

In article <1305@qfdts.OZ> phil@qfdts.OZ (Phil Chadwick) writes:
>Cron is core dumping fairly regularly - 5 or 6 times
>in the last 4 days.  .....

He also includes the following interesting line from his queuedefs file

>    h.1j0n90w

I can't tell if you are having the same problem I had, but this line makes
me suspicious.  I once tried to set up a single-server queue like this one
and it wouldn't work.  The symptoms included droping cores and "infinite"
loops.  The latter would eventually work, but as long as there was an active
job and a queued job, cron would loop saying that it was requeueing the job.
When the active job finished, after a few minutes (but not immediately for
some reason), cron would notice it and schedule the queued job, then return
to normal.  Needless to say, no other queues were being serviced while this
was going on.  Two jobs queued in this class seemed to cause the core dump,
but not all the time.

I noticed this because my UUCP traffic was getting stalled.  After some
poking around, I found that my cron log had grown over 50 megabytes in a
single day -- hundreds of thousands of "requeueing" messages.  You might
check your cron log to see if it grew explosivly just before the cron died.
If you have per-process file size limitations or a file system without enough
free space for the log (I was lucky that I had no outgoing news backed up;
all the sites I feed were up that day), that may be causing a problem as well.

I looked at the code briefly, but nothing seemed obviously wrong about the
loop where the message was being generated; it didn't seem to be a fencepost
error, anyway.  I didn't have a chance to look at it long; I set the number
of parallel jobs to two as a temporary fix (this causes the problem to go away)
and got involved with some other fire drills.  That temporary fix is still in
place, almost a year latter.

If you can set the number of jobs to two, that might serve for the time being;
at least it should tell you if it's the same problem.  I'd be curious to know
if this problem occurs, and how, and on what other hardware, as I suspect that
a problem this obvious could only have gotten out is if it is a dereferencing-
a-null-pointer bug.  It happened to me on a Pyramid under OSx2.5; I haven't
checked to see if the problem still occurs in the newer releases.

Oh, and if you find a cure, please let me know; I'd like to get rid of my
"temporary" fix.....
-- 
-- Greg Noel, NCR Rancho Bernardo     Greg.Noel@SanDiego.NCR.COM

chris@softway.oz (Chris Maltby) (09/21/87)

I have seen cron produce mysterious core dumps also. Our
version on an NCR Tower always core dumps when the crontab
command is used. A feature of the Unix version is that
null pointers are bus-errors. As we have no source on hand
I have been unable to track it down, and we just don't use
the crontab command...
-- 
Chris Maltby - Softway Pty Ltd	(chris@softway.oz)

PHONE:	+61-2-698-2322		UUCP:		uunet!softway.oz!chris
FAX:	+61-2-699-9174		INTERNET:	chris@softway.oz.au