[comp.sys.sun] ie1 problems on Sun 4/280 solved

sater@uunet.uu.net (Hans van Staveren) (12/09/88)

About two months ago we had big problems with the ie1 board on a Sun
4/280, it lost great amounts of packets, and we had the IP queue filling
up, and never emptying again. We ran Sys4-3.2EXPORT and Sun Netherlands
was supposed to figure it out. Well, they didn't, we did.

The first thing I thought of when I saw the symptoms was a race.  I asked
Sun whether the interrupt priority of the board was right, and they
claimed it was. So now two months and a lot of pain later I found out that
the interrupt priority is wrong, although the problem is more subtle then
I originally suspected.

Bear with me, while I go technical for the next three paragraphs:

In the SunOs kernel all networking is supposed to be done at CPU priority
splimp() or higher to prevent devices interrupting critical queue
manipulations. On Sun 3 workstations splimp() is level 3 and ie0 and ie1
also interrupt at level 3, so all is well.  The SPARC chip in the Sun4 has
twice the amount of interrupt levels as the MC68020 in the Sun3, and Sun
made up a way to map the VMEbus interrupt request levels to SPARC
interrupt levels.

It *seems* that all offboard interrupts come in at odd levels(1,3,5,7,..)
and all onboard interrupts at even levels(2,4,6,8,...).  This means that
the onboard ie0 and the offboard ie1 *cannot* interrupt at the same level:
ie0 comes in at level 6, and ie1 at level 5.  On the Sun4 splimp() is
level 6.

Now this still would have worked if inside the interrupt routine from ie1,
running at level 5, a call would have been made to raise the level to 6.
Almost needless to say this call is not there.  The effect of all this is
that while ie1 is queuing packets, ie0 can still interrupt, destroying the
consistency of the system.

End of technical mode.

I am annoyed. I was right within a minute and I had to suffer for two
months and then figure it out myself, without documentation or source.
Does Sun assume all customers are dumb? They could have checked it at
least, I suggested the priority several times as a possible cause.

The strangest thing is that this must have happened to lots of other
people, but a message to this worthy list brought up nothing.

Is there anybody out there who has seen this before?

	Hans van Staveren
	Vrije Universiteit
	Amsterdam, Holland

slevy@uf.msc.umn.edu (Stuart Levy) (12/20/88)

We have seen (and, likewise, had to uncover ourselves) a similar problem.
We don't have a Sun-4, but do have an MCP card in a Sun-3.  This is a
smart serial card (does synchronous mode, DMA, etc) which Sun supports as
a network interface; we're using it as an HDH interface to ARPAnet.

We encountered the same symptoms -- the IP input queue counter would
gradually increase over time until reaching ifq_maxlen (50) at which point
no more input packets would be accepted even though the queues themselves
were empty.

It turns out the MCP interrupts at priority 4, while all other interfaces
use priority 3, so we get the same effect as you do.  

It's too bad I didn't see your earlier message.

The funny thing was, I actually tried changing the drivers to raise
priority around the IF_ENQUEUE's as well as (effectively) around
IF_DEQUEUE in ipintr() (did that by just making splimp raise to priority 4
rather than 3).  The problem didn't go away, though, I still don't
understand why.

Anyway, maybe Sun will take two such reports more seriously than one,
though it might not be as seriously as we'd like.  They haven't admitted
to us that they've done anything wrong, so far.

	Stuart Levy, Minnesota Supercomputer Center
	slevy@uc.msc.umn.edu