[comp.sys.isis] Congestion: how to recoignize it, what causes it

ken@gvax.cs.cornell.edu (Ken Birman) (07/11/90)
A few people have recently sent me email roughly along the lines
of the following:

... from perplexed user A
> I have a job scheduler that for some reason always hangs after scheduling
> 17 tasks; the 18th never gets run.  Is 17 a special number for some reason?

... from perplexed user B
> Is 15 a magic number when it comes to group sizes or bypass comms?  I have
> been trying to establish a process group of size 28, and have problems
> after fifteen members have joined...

The significance of the numbers 15 and 17 here is that protos becomes
congested when it has more than about 15 tasks active at the same time.

A protos task is needed for initiating each broadcast, including a join
request.  The task remains active until the broadcast replies are collected
(except in the BYPASS version of ISIS, where this work is done in the client
if the broadcast was done in BYPASS mode).  

Thus, if your code does a
	bcast(gaddr, MY_ENTRY, "...", ..., ALL, "%d", ...)
and some member of the group gaddr fails to do a reply, nullreply or
abortreply, a protos task will be created for this bcast and will
remain active indefinitely.

When protos congests, it tells clients to stop initiating new broadcasts,
although they can still send reply, nullreply and abortreply messages.  
This is typically when you notice that things have jammed up.

To detect this, either use a "cmd snap" and interpret as per the manual
pages (the protos log would show the blocked tasks and what they are
waiting for), or use kill -USR2 to the same effect, or ruin "prstat"
(version V2.0) and look at the column that shows congestion (CO) and why
(CFLAGS).  To fix the problem, make sure that your code replies when it
is supposed to and sends nullreplies if it doesn't plan to but someone
might be waiting (as in the case of a BCAST with ALL specified).

The manual discusses this in more detail.

Ken