phil@qfdts.OZ (Phil Chadwick) (09/17/87)
We run UNIX System V release 2 version 2 on a VAX. Cron is core dumping fairly regularly - 5 or 6 times in the last 4 days. The problems first started when I added the `t' queue (for troff). The `h' queue (for high priority batch jobs) was added at the same time. /usr/lib/cron/queuedefs looks like this: a.2j14n b.3j14n90w t.2j13n90w h.1j0n90w Before I dive in and start hacking, has anyone done it before? Phil.
greg@ncr-sd.SanDiego.NCR.COM (Greg Noel) (09/20/87)
In article <1305@qfdts.OZ> phil@qfdts.OZ (Phil Chadwick) writes: >Cron is core dumping fairly regularly - 5 or 6 times >in the last 4 days. ..... He also includes the following interesting line from his queuedefs file > h.1j0n90w I can't tell if you are having the same problem I had, but this line makes me suspicious. I once tried to set up a single-server queue like this one and it wouldn't work. The symptoms included droping cores and "infinite" loops. The latter would eventually work, but as long as there was an active job and a queued job, cron would loop saying that it was requeueing the job. When the active job finished, after a few minutes (but not immediately for some reason), cron would notice it and schedule the queued job, then return to normal. Needless to say, no other queues were being serviced while this was going on. Two jobs queued in this class seemed to cause the core dump, but not all the time. I noticed this because my UUCP traffic was getting stalled. After some poking around, I found that my cron log had grown over 50 megabytes in a single day -- hundreds of thousands of "requeueing" messages. You might check your cron log to see if it grew explosivly just before the cron died. If you have per-process file size limitations or a file system without enough free space for the log (I was lucky that I had no outgoing news backed up; all the sites I feed were up that day), that may be causing a problem as well. I looked at the code briefly, but nothing seemed obviously wrong about the loop where the message was being generated; it didn't seem to be a fencepost error, anyway. I didn't have a chance to look at it long; I set the number of parallel jobs to two as a temporary fix (this causes the problem to go away) and got involved with some other fire drills. That temporary fix is still in place, almost a year latter. If you can set the number of jobs to two, that might serve for the time being; at least it should tell you if it's the same problem. I'd be curious to know if this problem occurs, and how, and on what other hardware, as I suspect that a problem this obvious could only have gotten out is if it is a dereferencing- a-null-pointer bug. It happened to me on a Pyramid under OSx2.5; I haven't checked to see if the problem still occurs in the newer releases. Oh, and if you find a cure, please let me know; I'd like to get rid of my "temporary" fix..... -- -- Greg Noel, NCR Rancho Bernardo Greg.Noel@SanDiego.NCR.COM
chris@softway.oz (Chris Maltby) (09/21/87)
I have seen cron produce mysterious core dumps also. Our version on an NCR Tower always core dumps when the crontab command is used. A feature of the Unix version is that null pointers are bus-errors. As we have no source on hand I have been unable to track it down, and we just don't use the crontab command... -- Chris Maltby - Softway Pty Ltd (chris@softway.oz) PHONE: +61-2-698-2322 UUCP: uunet!softway.oz!chris FAX: +61-2-699-9174 INTERNET: chris@softway.oz.au