[comp.protocols.tcp-ip] Interrupts & Polling [was: Re: Super Cheap IP router

rpw3@amdcad.AMD.COM (Rob Warnock) (04/14/89)

dcrocker@AHWAHNEE.STANFORD.EDU (Dave Crocker) writes:
+---------------
| Interrupts kill...
| An interrupt is very useful when you don't expect much activity and don't
| want the overhead of polling.  On the other hand, if you DO expect heavy
| activity, polling is quite nice...
| Many, occasional sources of activity warrant interrupts.
| A few, active sources warrant polling.
+---------------

Which is why the two-level interrupt service structure I wrote a
"tutorial" about in comp.arch (circa 3/20/89?) does exactly this,
although that was not one of the points I stressed in that article.

In the absence of interrupt activity, you are "interrupt driven".
But once you get an interrupt, additional interrupts are queued on
a lightweight task queue, and [this is the part I left unstressed]
every second-level interrupt routine checks for more work of its
own type before exiting, and if there is any, requeues itself on
the tail of the task queue. [This promotes fairness. And of course,
you can have multiple 2nd-level interrupt queues/priorities if you
like.] Thus, once fired up by a 1st-level interrupt, the 2nd-level
interrupt service layer is *polled* -- exactly as you requested!

This has the nice behavior that a full interrupt context save/restore
only has to be done once for each "burst" of interrupts (and in
timesharing systems, interrupts *are* "bursty").

I once wrote a PDP-8/e communications kernel which, using this
approach, was able to handle 10,000 characters a second *through*
the node. Interrupt per character, on each side, or 20k ints/sec.

If you have a *large* number of sources of data to service (such as
*many* async terminal ports, say more than 100, as on that PDP-8/e),
and you have some very-light-weight 1st-level hardware interrupt
mechanism (such as the Am29000's "freeze mode"), then using interrupts
as "asynchronous polls" can save you some polling overhead while still
not requiring a full context save. That is, hardware interrupts remain
enabled even while the 2nd-level service routines work. This is a useful
tradeoff even on some CISCs [such as the MC68000 -- though not the '020],
where the 1st-level catch-the-data/queue-the-2nd-level-task function
can be done without very much save/restore overhead. Like Dave's
experience, I have seen this technique result in a factor of *12*
improvement in the TTY-handling capacity of a 68000-based Unix.

Note that this "interrupt/polling" switchoff has a direct analogy to
those link-level access protocols which attempt to provide low latency
under light load and good efficiency under heavy load. I'm talking about
things like the "Urn" protocol and some forms of "reservation/TDMA" or
"random-access/TDMA" which have a contention phase which degenerates
into round-robin under load. There are also some forms of "virtual token
bus" which deliberately drop the token under light load, reverting to
a contention mode. [Sorry I don't have access to formal references on
these handy, as I hear a flood of "Tell me more"s already. Anybody else?]

Anyway, the point is that any good design provides a smooth *dynamic*
tradeoff between latency and efficiency.

p.s.
I personally feel that the FDDI standard is oriented too far to the
"efficiency" side; on maximal configurations of FDDI, the "packet
exchange" time (say, an NFS disk block request/response) on a lightly
loaded net can be over 3 milliseconds -- that's worse than Ethernet!


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

pcg@aber-cs.UUCP (Piercarlo Grandi) (04/15/89)

In article <25223@amdcad.AMD.COM> rpw3@amdcad.UUCP (Rob Warnock) writes:
    dcrocker@AHWAHNEE.STANFORD.EDU (Dave Crocker) writes:
    +---------------
    | Interrupts kill...
    | An interrupt is very useful when you don't expect much activity and don't
    | want the overhead of polling.  On the other hand, if you DO expect heavy
    | activity, polling is quite nice...
    | Many, occasional sources of activity warrant interrupts.
    | A few, active sources warrant polling.
    +---------------
    
    Which is why the two-level interrupt service structure I wrote a
    "tutorial" about in comp.arch (circa 3/20/89?) does exactly this,
    although that was not one of the points I stressed in that article.
    
    In the absence of interrupt activity, you are "interrupt driven".
    But once you get an interrupt, additional interrupts are queued on
    a lightweight task queue, and [this is the part I left unstressed]
    every second-level interrupt routine checks for more work of its
    own type before exiting, and if there is any, requeues itself on
    the tail of the task queue.

While I agree that the technique is useful, it requires implementing
a lightweight process system in your kernel, which may be major
surgery. In a sense, you are using any interrupt as though it were
the clock interrupt to start polling.

The simple version, used e.g. in many UNIX kernels, is to have any
and every interrupt processing procedure always check at the end for
further pending interrupts for ITSELF, and then go into a loop. Even a
little busy waiting, if it is known that there will be a next interrupt
shortly, may be worthwhile, e.g. when reading packets/bursts from
line on which you are running a protocol.

I do not like much the idea of having an interrupt routine at the end
fire up polling in other drivers. (If I understood correctly what you
are thinking about).
-- 
Piercarlo "Peter" Grandi            |  ARPA: pcg%cs.aber.ac.uk@nss.cs.ucl.ac.uk
Dept of CS, UCW Aberystwyth         |  UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK  |  INET: pcg@cs.aber.ac.uk

rpw3@amdcad.AMD.COM (Rob Warnock) (04/16/89)

[Excuse me if I edited previous comments to the bone, but it's getting deep...]

By the way, let me preface this by saying that I actually completely agree
with Dave Crocker [below] that polling is the way to go when you have a
few very active devices... *if* you don't need to do something else with
your "idle" time. Pure polling is often the best (and sometimes the only)
strategy in an embedded controller. My remarks in the past few postings
have been assuming that there is general-purpose timesharing going on
in the same CPU, and thus you are trying to balance those famous three --
latency, efficiency, and throughput -- *and* give useful time to the "user".

+--------------- In article <820@aber-cs.UUCP> (Piercarlo Grandi) writes:
| +--------------- In article <25223@amdcad.AMD.COM> (Rob Warnock) writes:
| | +--------------- dcrocker@AHWAHNEE.STANFORD.EDU (Dave Crocker) writes:
| | | Interrupts kill... Many, occasional sources of activity warrant
| | | interrupts. A few, active sources warrant polling.
| | Which is why the two-level interrupt service structure I wrote a
| | "tutorial" about in comp.arch (circa 3/20/89?) does exactly this,...
| | But once you get an interrupt, additional interrupts are queued...
| | every second-level interrupt routine checks for more work...
| | if there is any, requeues itself on the tail of the task queue.
| While I agree that the technique is useful, it requires implementing
| a lightweight process system in your kernel, which may be major surgery.
+---------------

The only real surgery, and I'll admit it's a mass of painful detail, is
to go through the kernel (especially the disk cache stuff) taking out all
the unneeded "spl7()" (a.k.a. "splhigh()") calls, and replacing them with
"splsched()". But in fact, for a quick hack, you can simply map *all* the
old calls to "splsched()", and then just put back the few you really need.
[You have to do this, because in any "standard" kernel, there are places
that hold "splhigh" for many *milliseconds*. Any needed mutual exclusion
can always be done in a few microseconds, if only by setting a lock.]

As far as "lightweight processes", I think you misunderstood me. There is
a "lightweight task queue", but the "processes" that are run are actually
all interrupt completion routines, all of which were already there. There
is no new "context"; you're still in "interrupt context".

+---------------
| In a sense, you are using any interrupt as though it were
| the clock interrupt to start polling.
+---------------

Precisely! In fact, one version of this actually just used the kernel's
callout queue (the one that "timeout()" uses), and queued interrupt
completions with a time-to-run of "zero ticks", then let the next tick
of the normal "softclock" handler (which runs at the *lowest* interrupt
priority) run the 2nd-level tasks *just as if* they were timeouts which
had expired. [This assumes you can get a fast enough "hardclock". You can
have "urgent" interrupts start the queue running directly, if you like,
but you lose a lot of the efficiency benefit.]

Any new interrupts that occur while one of those is running get queued
after all the "zero" callouts but before any that are really waiting for
time to elapse. (You should add one more pointer to be maintained, so
you don't have to scan the queue.) And of course, the "softclock" handler
always checks the queue again before dismissing, just to see if any new
interrupts have arrived. (*And* if the clock ticks, adjusts the time left
of the first non-zero task and if it goes to zero promotes the task into
the zero group. That way you your "real" timeouts stay accurate, too.
Also, the "hardclock" 1st-level handler should bump a count of "ticks
seen while softclock busy" that "softclock" can use to keep time straight.)

+---------------
| The simple version, used e.g. in many UNIX kernels, is to have any
| and every interrupt processing procedure always check at the end for
| further pending interrupts for ITSELF, and then go into a loop. Even a
| little busy waiting, if it is known that there will be a next interrupt
| shortly, may be worthwhile, e.g. when reading packets/bursts from
| line on which you are running a protocol.
+---------------

That's fine, and should be used when appropriate. It mixes well with
the two-level style... as long as you don't leave everybody else's
interrupts turned off during that loop. That's the fundamental problem
the two-level scheme is trying to solve: to decouple the fast-latency
needs of simple hardware (e.g., SIOs at 38400 baud) from the efficiency
concerns of not taking/dismissing a bunch of [heavyweight] interrupts
when you know there's still more work to do.

By keeping the 1st-level interrupts *very* lightweight (*don't* save any
context [except maybe a working reg or two], just grab the data, queue it,
then queue your 2nd-level handler), you can afford a *lot* of interrupts,
many more than you would think. And by leaving hardware interrupts *enabled*
during [almost] all 2nd-level processing, you don't lose data due to
latency problems.

And by doing as much 2nd-level work [normal C-language "interrupt handlers"]
as possible before dismissing -- that is, by letting 1st-level interrupts
continue to add to the 2nd-level work queue and not dismissing until the
work queue is empty -- you only do one full save/restore for the lot.

+---------------
| I do not like much the idea of having an interrupt routine at the end
| fire up polling in other drivers. (If I understood correctly what you
| are thinking about).
+---------------

Why not? Shouldn't they get the same benefits as "your" driver?  ;-}  ;-}

Or maybe you don't like the idea of every interrupt triggering a lot
of "polling". Well, that doesn't really happen. You only "poll" those
other handlers for which interrupts occurred (and thus got queued)
while you were in your "dally" loop... and that's why you don't shut
off system-wide interrupts in your dally loop, just the device you
are polling. And then when you decide nothing else is going to happen
[typically after about 1.5 to 3 times the expected next event], you
re-enable your interrupt, return to the common interrupt "scheduler"
[see above], and Lo & Behold!, if while somebody *else* is handling
a burst your device interrupts, you'll get control again before the
ultimate "dismiss" occurs.

This saves *lots* of context save/restore CPU time. Lots.


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

vance@JVNCF.CSC.ORG (Vercell Vance) (04/16/89)

:q!

zz
ZZ



____________________________________________________________________________

Vercell Vance
Manager, User Services
John von Neumann National Supercomputer Center
Princeton, New Jersey, 08543
Bitnet = vance@jvncc			Internet = vance@jvnca.csc.org
Phone = 609-520-2000
____________________________________________________________________________

backman@interlan.UUCP (Larry Backman) (04/25/89)

>By keeping the 1st-level interrupts *very* lightweight (*don't* save any
>context [except maybe a working reg or two], just grab the data, queue it,
>then queue your 2nd-level handler), you can afford a *lot* of interrupts,
>many more than you would think. And by leaving hardware interrupts *enabled*
>during [almost] all 2nd-level processing, you don't lose data due to
>latency problems.
>

	[]

	I've been following the "interrupts kill" discussion with interest.
	Its been a few years since I did SDLC/HDLC with the SIO but as I
	remember...

	The approach of capturing an interrupt, grabbing the appropriate
	SIO registers, and queuing them for later processing worked fine
	except for one nasty case.  That case was the situation where a
	large data framewithout the final bit was followed immediately by
	a Receive Ready (3 or 4 bytes).  The burst of interrupts that
	end of frame, CRC, start of frame, end of frame, CRC generated was
	too fast for 4.77 Mhz. PC's.  If I remeber correctly, the only
	approach that worked was a modified interrupt queue/handler.  The
	interrupt was scanned as it was processed, if it was "lightweight"
	i.e. a data interrupt, or one of the less critical start or end of
	frame ones it was queued for later processing.  If it was a
	"heavyweight" interrupt, i.e., the CRC for frame two in the
	situation mentioned above, it was processed immediately.  The
	catch of course was that before the "heavyweight" interrupt could
	be processed, the queue of "lightweight" interrupts had to be 
	completely serviced to keep things in their proper order. Now,
	here we are at interrupt time, and a very critical one, doing
	complete servicing of not one, but a whole string of interrupts.

	THis approach worked, but as has been previously mentioned, the rest
	of the system suffered.  As I recall, the SIO only had a 2 or 3 byte
	FIFO and did not latch interrupts.  Has a better serial chip come
	out since?


						Larry Backman
						Interlan

rpw3@amdcad.AMD.COM (Rob Warnock) (04/26/89)

In article <606@interlan.UUCP> backman@interlan.UUCP (Larry Backman) writes:
+---------------
| 	The approach of capturing an interrupt, grabbing the appropriate
| 	SIO registers, and queuing them for later processing worked fine
| 	except for one nasty case...  The burst of interrupts that
| 	end of frame, CRC, start of frame, end of frame, CRC generated was
| 	too fast for 4.77 Mhz. PC's...  [...then discusses fix/hack...]
| 	THis approach worked, but as has been previously mentioned, the rest
| 	of the system suffered.  As I recall, the SIO only had a 2 or 3 byte
| 	FIFO and did not latch interrupts.  Has a better serial chip come
| 	out since? | Larry Backman | Interlan
+---------------

No, but faster CPUs have... ;-}  ;-}

Sorry, that was too flip. What we did back at Dig.Comm.Assoc. to handle
similar issues with our poor little 4 MHz Z80's (DCA System-350, and the like)
was to build an interrupt state machine which fielded the hardware interrupts
from the serial chips and stuck the results of the interrupt into a linked
list (via DMA) in the Z-80's memory.

Why did we need all that, given that we had my fancy two-level interrupt
scheme in the first place? Well, yes, of course, there *are* limits to
the two-level scheme. All I've been trying to say was that the limits
are typically much higher than the common perception of interrupt handling
ability for a given CPU. So a 68000 *can* handle a dozen 19200-baud TTYs...

But the "DCA System 350" could have 128 idle ports all starting up from idle
at once, and one poor 4 MHz Z-80 having to deal with it. [Similar to the
problem mentioned by an earlier poster with dial tone in telephone systems.]

This system was an upgrade from an earlier PDP-8 based net node, for which
the PDP-8 was handling all the interrupts from a bus with the ports on it.
[Western Digital "ASTRO" chips, but might as well have been SIOs.] When
the PDP-8 finally ran out of steam, we wanted to use multiple Z-80's to
share the traffic, but without inventing a new bus. So we built a little
state machine to go where the PDP-8 used to go, to handle all the interrupts
from the serial ports. And the Z-80's went on boards that looked sort of
like another serial port to the bus. And there was a little RAM that
told the state machine which Z-80 should get the status for an interrupt
from a given serial port. The state machine wrote a 4-byte "packet" to the
correct Z-80, consisting of {intr-type, src-port, int-status, [int-data]}.
(That is, if the interrupt status on the port said "received data available",
the state machine went ahead and read the data for us.)

Now a Z-80 wasn't all that blindingly fast at handling interrupts, and a
*lot* of people could type <RETURN> at once, and all "idle" ports were
mapped to a common node controller Z-80 (which would assign a port to
the least busy Z-80 when the port went non-idle). So we needed to be able
to handle 128 interrupts at once, and without interfering with full-speed
traffic that might allready be going on on one of the ports (say).

The solution was to make the bus interface DMA the 4-byte interrupt
packet into Z-80 RAM, and to allocate enough space for those interrupts
that a given Z-80 could handle the worst-case peak interrupt load.
(It was really quite cute, just a couple of chips.)

When the first-level handler for "bus interrupts" saw that the incoming queue
was non-empty, it fired up the second-level handler, which trundled down the
DMA queue processing transmit & receive interrupts as fast as it could.

So I would say if your 4.77 MHz PC wasn't fast enough to handle the
SDLC inter-frame timing, you could either get a faster CPU, or make
the serial port "smarter". This may mean putting a small state machine
on the port card to handle the inter-frame state changes, and/or putting a
FIFO on board to queue the interrupt activity. There are many solutions.


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403