mjg@raybed2.UUCP (11/30/83)
4.1bsd CLOCK FIX CLOCK FIX CLOCK FIX CLOCK FIX 4.1bsd The following commentary and C code lines have been extracted from our sys/clock.c file. Please note the one line added which was missing from the original code. We like a lot of others noticed our clock was loosing time and we had our operators reset the date every day. This bothered me a lot so I started looking into it. The first thing we tried was calling dec who suggested we try the system with vms. We did over a weekend and clock was fine so it had to be unix (4.1bsd). I then put in debugging code into the hard clock routine to find if we were missing clock interrupts and we were. The next test was to make a table 128 entries long with a pointer at the top. Each time we entered the hardclock routine and a interrupt was missed (i also gave lbolt an extra tick) I incremented the pointer and then stuffed the PC (from the stack) into the table. A user program was then written to continuously read the table from /dev/kmem. This resulted in no answer to the clock but did tell us where the kernel spends most of its time ( in open, read, and write ). A few months later i attacked the kernel again when i figured out it only happened when the system was very busy. We would loose one clock tic every 5 seconds when the system was busy (load ave. > 30) and one tic every 60 seconds if load average was between 10 and 30. I then did a very close examination of all the kernel code in C and ASM looking for someone who raised the IPL and someway bypassed the code lowering it. After 2-3 days i found it. When in the soft clock routine and calling all the callouts the priority did not get lowered after the last call. This causes the rest of the softclock routine to run at hardclock priority which blocks further hard- clock interrupts. The softclock routine always calls vmmeter routine which does not take too long unless (time % 5 == 0) then it calls vmtotal which when added to vmmeter and softclock take a very long time to run. When softclock is finally done we have missed a clock interrupt and the next one has already arrived. After softclock and hardclock finish we return to who was running before with IPL taken from the original stack which in most cases will return IPL to zero. Martin Grossman allegra!rayssd!raybed2!mjg 617-274-7100 ext 3395 or 4793 =========================================================================== /* * Software clock interrupt. * This routine runs at lower priority than device interrupts. */ /*ARGSUSED*/ softclock(pc, ps) caddr_t pc; { register struct callout *p1; register struct proc *pp; register int a, s; caddr_t arg; int (*func)(); /* * Perform callouts (but not after panic's!) */ if (panicstr == 0) { for (;;) { s = spl7(); if((p1 = calltodo.c_next) == 0 || p1->c_time > 0){ /* this line is missing */ (void) splx(s); break; } calltodo.c_next = p1->c_next; arg = p1->c_arg; func = p1->c_func; p1->c_next = callfree; callfree = p1; (void) splx(s); (*func)(arg); } } ===========================================================================
dmmartindale@watcgl.UUCP (Dave Martindale) (12/04/83)
Ah, nostalgia. I found that bug when first bringing up 4.1BSD two years and some ago. I had written a DUP11 driver, and to avoid getting an overrun or underrun, interrupt requests had to be handled within 1 millisecond (at 9600 baud). Even on an otherwise-unused machine, these happened consistently. Looking at the data going by with a serial analyzer showed that successful transmissions never lasted more than a second. So, what happens once a second? Softclock has a lot more work to do. But how can softclock be running with a priority which locks out interrupts from the device? Aha! There it is. It only took an hour or so to find when the symptom was "spending too long at high priority" rather than "clock loses time". Somebody else must have found this very early too; I think it was mentioned in a printed list of bug fixes Berkeley sent out, and I know it's been posted to USENET several times over the years.