[net.bugs.4bsd] The correct clock fix in 4bsd

mjg@raybed2.UUCP (11/30/83)

4.1bsd	CLOCK FIX	CLOCK FIX	CLOCK FIX	CLOCK FIX  4.1bsd

     The following commentary and C code lines have been extracted from
our sys/clock.c file.  Please note the one line added which was missing
from the original code.

     We like a lot of others noticed our clock was loosing time and we
had our operators reset the date every day.  This bothered me a lot so I
started looking into it.  The first thing we tried was calling dec who
suggested we try the system with vms.  We did over a weekend and clock
was fine so it had to be unix (4.1bsd).  I then put in debugging code
into the hard clock routine to find if we were missing clock interrupts
and we were.  The next test was to make a table 128 entries long with a
pointer at the top.  Each time we entered the hardclock routine and a
interrupt was missed (i also gave lbolt an extra tick) I incremented the
pointer and then stuffed the PC (from the stack) into the table.  A user
program was then written to continuously read the table from /dev/kmem.
This resulted in no answer to the clock but did tell us where the kernel
spends most of its time ( in open, read, and write ).  A few months later
i attacked the kernel again when i figured out it only happened when the
system was very busy.  We would loose one clock tic every 5 seconds when
the system was busy (load ave. > 30) and one tic every 60 seconds if
load average was between 10 and 30.  I then did a very close examination
of all the kernel code in C and ASM looking for someone who raised the IPL
and someway bypassed the code lowering it.  After 2-3 days i found it.
When in the soft clock routine and calling all the callouts the priority
did not get lowered after the last call.  This causes the rest of the
softclock routine to run at hardclock priority which blocks further hard-
clock interrupts.  The softclock routine always calls vmmeter routine which
does not take too long unless (time % 5 == 0) then it calls vmtotal which
when added to vmmeter and softclock take a very long time to run.  When
softclock is finally done we have missed a clock interrupt and the next one
has already arrived.  After softclock and hardclock finish we return to
who was running before with IPL taken from the original stack which in
most cases will return IPL to zero.

		Martin Grossman        allegra!rayssd!raybed2!mjg
		617-274-7100
		ext 3395 or 4793

===========================================================================
/*
 * Software clock interrupt.
 * This routine runs at lower priority than device interrupts.
 */
/*ARGSUSED*/
softclock(pc, ps)
	caddr_t pc;
{
	register struct callout *p1;
	register struct proc *pp;
	register int a, s;
	caddr_t arg;
	int (*func)();

	/*
	 * Perform callouts (but not after panic's!)
	 */
	if (panicstr == 0) {
		for (;;) {
			s = spl7();
			if((p1 = calltodo.c_next) == 0 || p1->c_time > 0){
/* this line is missing */	(void) splx(s);
				break;
			}
			calltodo.c_next = p1->c_next;
			arg = p1->c_arg;
			func = p1->c_func;
			p1->c_next = callfree;
			callfree = p1;
			(void) splx(s);
			(*func)(arg);
		}
	}
===========================================================================

dmmartindale@watcgl.UUCP (Dave Martindale) (12/04/83)

Ah, nostalgia.  I found that bug when first bringing up 4.1BSD two years
and some ago.  I had written a DUP11 driver, and to avoid getting an
overrun or underrun, interrupt requests had to be handled within 1
millisecond (at 9600 baud).  Even on an otherwise-unused machine, 
these happened consistently.  Looking at the data going by with a
serial analyzer showed that successful transmissions never lasted
more than a second.  So, what happens once a second?  Softclock has
a lot more work to do.  But how can softclock be running with
a priority which locks out interrupts from the device?  Aha! There
it is.  It only took an hour or so to find when the symptom was
"spending too long at high priority" rather than "clock loses time".

Somebody else must have found this very early too;  I think it was mentioned
in a printed list of bug fixes Berkeley sent out, and I know it's been
posted to USENET several times over the years.