sater@uunet.uu.net (Hans van Staveren) (12/09/88)
About two months ago we had big problems with the ie1 board on a Sun 4/280, it lost great amounts of packets, and we had the IP queue filling up, and never emptying again. We ran Sys4-3.2EXPORT and Sun Netherlands was supposed to figure it out. Well, they didn't, we did. The first thing I thought of when I saw the symptoms was a race. I asked Sun whether the interrupt priority of the board was right, and they claimed it was. So now two months and a lot of pain later I found out that the interrupt priority is wrong, although the problem is more subtle then I originally suspected. Bear with me, while I go technical for the next three paragraphs: In the SunOs kernel all networking is supposed to be done at CPU priority splimp() or higher to prevent devices interrupting critical queue manipulations. On Sun 3 workstations splimp() is level 3 and ie0 and ie1 also interrupt at level 3, so all is well. The SPARC chip in the Sun4 has twice the amount of interrupt levels as the MC68020 in the Sun3, and Sun made up a way to map the VMEbus interrupt request levels to SPARC interrupt levels. It *seems* that all offboard interrupts come in at odd levels(1,3,5,7,..) and all onboard interrupts at even levels(2,4,6,8,...). This means that the onboard ie0 and the offboard ie1 *cannot* interrupt at the same level: ie0 comes in at level 6, and ie1 at level 5. On the Sun4 splimp() is level 6. Now this still would have worked if inside the interrupt routine from ie1, running at level 5, a call would have been made to raise the level to 6. Almost needless to say this call is not there. The effect of all this is that while ie1 is queuing packets, ie0 can still interrupt, destroying the consistency of the system. End of technical mode. I am annoyed. I was right within a minute and I had to suffer for two months and then figure it out myself, without documentation or source. Does Sun assume all customers are dumb? They could have checked it at least, I suggested the priority several times as a possible cause. The strangest thing is that this must have happened to lots of other people, but a message to this worthy list brought up nothing. Is there anybody out there who has seen this before? Hans van Staveren Vrije Universiteit Amsterdam, Holland
slevy@uf.msc.umn.edu (Stuart Levy) (12/20/88)
We have seen (and, likewise, had to uncover ourselves) a similar problem. We don't have a Sun-4, but do have an MCP card in a Sun-3. This is a smart serial card (does synchronous mode, DMA, etc) which Sun supports as a network interface; we're using it as an HDH interface to ARPAnet. We encountered the same symptoms -- the IP input queue counter would gradually increase over time until reaching ifq_maxlen (50) at which point no more input packets would be accepted even though the queues themselves were empty. It turns out the MCP interrupts at priority 4, while all other interfaces use priority 3, so we get the same effect as you do. It's too bad I didn't see your earlier message. The funny thing was, I actually tried changing the drivers to raise priority around the IF_ENQUEUE's as well as (effectively) around IF_DEQUEUE in ipintr() (did that by just making splimp raise to priority 4 rather than 3). The problem didn't go away, though, I still don't understand why. Anyway, maybe Sun will take two such reports more seriously than one, though it might not be as seriously as we'd like. They haven't admitted to us that they've done anything wrong, so far. Stuart Levy, Minnesota Supercomputer Center slevy@uc.msc.umn.edu