[comp.sys.apollo] DN4500 arbitrarily overloads itself

dpassage@soda.berkeley.edu (David G. Paschich) (06/21/91)

In article <0677436884@INESCN.RCCN.PT>, 
	JCF%INESCN.RCCN.PT@CUNYVM.CUNY.EDU (Joao Canas Ferreira) writes:

	   One of our Apollos (a DN4500, Sr10.2) has developed the habit
   of slowing down so much, that it becames unusable. This behaviour seems
   to start out of the blue. Sometimes the users manage to logout, but
   it is usually impossible to login again: when you type the username
   the station locks up. Hitting the return key seems to have no effect.
   Using the backspace key causes some black squares to appear on
   the DM window, but no backspacing.

	   During one of the last fits, the user tried to logout. After some
   time, he got the (almost) usual question (Blast ? (Y/N)). After answering
   Yes and waiting some time, the message  'Unable to obtain scfb hash table mutex
   lock from (stream manager/ scfb)' ?

Sounds as if maybe your machine is hitting the 64 process limit that
existed under SR 10.2.  The machine we use as a mail gateway was
continually having this problem.  One work-around we used was to write
a daemon which would watch for the existence of a certain file in
`node_data, and when it appeared, kill all the sendmail processes on
the machine.  That way we didn't have to start a new process in order
to get the machine useable again.

The 64 process limit was thankfully raised in 10.3 to something more
reasonable (32 processes/megabyte RAM, I believe).

--
David G. Paschich	Open Computing Facility		UC Berkeley
dpassage@ocf.berkeley.edu
"Can Spam increase sexual potency?  `No!' say scientists!" -- Trygve Lode

wjw@ebh.eb.ele.tue.nl (Willem Jan Withagen) (06/21/91)

In article <DPASSAGE.91Jun20122344@soda.berkeley.edu>, dpassage@soda.berkeley.edu (David G. Paschich) writes:
=> In article <0677436884@INESCN.RCCN.PT>, 
=> 	JCF%INESCN.RCCN.PT@CUNYVM.CUNY.EDU (Joao Canas Ferreira) writes:
=> 
=> 	   During one of the last fits, the user tried to logout. After some
=>    time, he got the (almost) usual question (Blast ? (Y/N)). After answering
=>    Yes and waiting some time, the message  'Unable to obtain scfb hash table mutex
=>    lock from (stream manager/ scfb)' ?
=> 
=> Sounds as if maybe your machine is hitting the 64 process limit that
=> existed under SR 10.2.  The machine we use as a mail gateway was
=> continually having this problem.  One work-around we used was to write
=> a daemon which would watch for the existence of a certain file in
=> `node_data, and when it appeared, kill all the sendmail processes on
=> the machine.  That way we didn't have to start a new process in order
=> to get the machine useable again.
=> 
=> The 64 process limit was thankfully raised in 10.3 to something more
=> reasonable (32 processes/megabyte RAM, I believe).
=> 

There's also a patch which addresses this problem.
take a look in the 9012 info and look for patch 139.

Patchinfo is available for ftp at
	ftp.eb.ele.tue.nl
	in /pub/apollo
Note that there was a series of sr10.2 patchs of which the last info is in
the 9012 tape. Then a new series started of which the last on is 9106.

Ciao,
	Willem Jan
-- 
Eindhoven University of Technology   DomainName:  wjw@eb.ele.tue.nl    
Digital Systems Group, Room EH 10.10 
P.O. 513                             Tel: +31-40-473401
5600 MB Eindhoven                    The Netherlands

wilsonj@gtephx.UUCP (Jay Wilson) (06/24/91)

In article <DPASSAGE.91Jun20122344@soda.berkeley.edu>, dpassage@soda.berkeley.edu (David G. Paschich) writes:
> In article <0677436884@INESCN.RCCN.PT>, 
> 	JCF%INESCN.RCCN.PT@CUNYVM.CUNY.EDU (Joao Canas Ferreira) writes:
> 

......

> 	   During one of the last fits, the user tried to logout. After some
>    time, he got the (almost) usual question (Blast ? (Y/N)). After answering
>    Yes and waiting some time, the message  'Unable to obtain scfb hash table mutex
>    lock from (stream manager/ scfb)' ?
> 

......


I saw this posting and I could not resist having one of my partners
in crime (there are 6 of us Sys_admins) respond to it.  He has been tracking
the Mutex Lock problem for over a year now and this is what he had to say.

(FLAME ON)

Dear Mr. Ferreira,

     Your message (0677436884@INESCN.RCCN.PT) concerning the 
"mutex lock/sfcb hash table" error struck a nerve right into
the core of my spine. We have around 530 workstations at our
site and we have been attempting to combat this virus for many 
months now.

(I like to call it a "virus" as there is no way to control it and
 no one at Apollo can tell us what REALLY causes it or how to 
 stop it. Just by having sys_admins from various sites throw
 the word "virus" around when referring to something in the Apollo
 operating system should strike terror into Apollo/HP sales staff,
 and maybe someone will with pull will prime the Apollo R&D engineering
 pump and get a resolution.)

The error will rear its ugly head with no warning or pattern,
and once you get it you MUST reboot and run the long SALVOL
to appease its appetite for disaster.

Please note a few items:

1). When you run the long SALVOL be sure to parse the options
    out as follows:  1  -f -a -s -t
    We determined that changes at 10.2 SALVOL no longer allow you to
    string options together as "1 -fast".

2). The long SALVOL will clear up the problem for varying periods of
    time, but there are no guarentees. It does seem to do more good
    than harm.

3). Once a node gets "infected" with this error it seem to get it again
    and again. "Uninfected" nodes seem to be o.k. until ...

4). INVOL and reloading software did not help the nodes that had the
    errors frequently, but for some strange reason replacing the CPU
    made the errors less frequent. (We had one machine getting this error
    daily - replaced the CPU - now it only gets it weekly.)

5). There are a few patches from Apollo to correct this, but none
    of them have put a dent in our problems.

    Patches: 139 and 196 for example

6). The problem does not seem to be machine type (3000/4000/3500) 
    specific, and we get it on all types. We even had Apollo check
    CPU rev levels on the machines.  The virus is much more active
    at SR10.2 though !!!

     We thought that maybe our users were using some tool that was causing
the problems, so I asked a user who saw the hang daily on his node
to work on another node for a while.  He only saw the hang once in two weeks
then. Also, I could log into his workstation and BOOM - "sfcb/mutex"
We also noticed the case of a user who got a problem on their node going to
another node and "infecting" it. (That user has now been labeled, and must 
shout "unclean, unclean !" when comming in contact with other users.  (A
punishment I would not wish on anyone.)

Another oddity is most of our people use the same tools, have the same 
load of o/s, and work on the same data yet no one can explain (SURPRISE!)
why some nodes hang and other nodes never see the virus. (I think its
the will of Zeus as a punishment for Apollo starting his own business
outside of the Mount Olympus tax jurisdiction.)

If you would like more information on exactly what "sfcb hash table" and
"mutex lock" are, please refer to a copy of the "Domain/OS Design Principles"
014962-A00 pages 9-14,9-15.  I found this to be a better explanation than
I got from the response center of," It's a table that controls everything."
Apollo claims the problem has been fixed at SR10.3, (I won't even get started on
the extreme, orgasmic joy experienced when hearing this phrase)
but that is yet to be determined. We have SR10.3 onsite, but it will be a
while before we can get all 530 machines up on it. In Apollos defense I can say
that none of the 6 machines we have running SR10.3 have seen this error ...
maybe it's because we don't use them yet ?

I do have an open APR/SR and an open call A2047527, but it's probably been 
closed because I was out for an afternoon and missed the, "call me by the end of 
the day or I'm closing it" call. 

My latest efforts to get Apollo going on this problem seem to be working better,
and after a few calls with some of the upper Apollo support personnel I feel
they are actually looking into the virus. I will keep you posted as there are
probably many items I forgot on this as the portion of my brain that deals
with the mutex lock seems to get fuzzier each day as I burst various blood
vessels in dismay, but please - on behalf of myself, your family, and your
Apollo sys_admin brethern everywhere ...
don't hold your breath.

-- 

  Matt Ferris
  Systems Programmer
  AG Communication Systems
  2500 West Utopia Road
  Phoenix, AZ     85027
  Phone 1-(602)-582-7634
  Fax   1-(602)-581-4967                                 

(FLAME OFF)

I am the only one in the group that monitors what is going on on the net, that
is why Matt fed his reply back via me.  If you have any replies for him, please
send them to him directly at:

   UUCP    : {ncar!noao!asuvax | uunet!zardoz!hrc | att}!gtephx!ferrism
   INTERNET: gtephx!ferrism@asuvax.eas.asu.edu



Thanks
--
Jay Wilson (wilsonj@gtephx)
SR Systems Programmer
UUCP    : {ncar!noao!asuvax | uunet!zardoz!hrc | att}!gtephx!wilsonj
INTERNET: gtephx!wilsonj@asuvax.eas.asu.edu

AG Communication Systems, Phoenix, AZ
voice (602) 581-4496
fax   (602) 581-4967

"A river that overflows its banks is never a problem until a road is
 built across it."

boylel@gtephx.UUCP (Lee Boyle x4528) (06/25/91)

In article <1991Jun24.002623.18899@gtephx.UUCP>, wilsonj@gtephx.UUCP (Jay Wilson) writes:
> Apollo claims the problem has been fixed at SR10.3, (I won't even get started on
> the extreme, orgasmic joy experienced when hearing this phrase)

  I've seen a couple of references to the fact that this problem has gone
  away under SR10.3.  That's almost true.  At least the one cause I know
  about seems to be preventable.  

  I was bitten by code which used pgm_$invoke to create a process running 
  /com/sh without bothering to connect a valid stream to errout.  Under 
  SR9.7, the error output just went into the bit bucket.  Under SR10.3,
  (quoting the APR response) "This caused other problems that compounded
  as execution progressed from that point."  These problems eventually led
  to the "unable to obtain SFCB mutex lock" error. 
  I interpret this to mean that my bit bucket now leaks into the table.

  I know of at least one other site that used the same technique under 9.7,
  and has been similarly bitten under SR10.3.  Caveat Hacker.

  Apollo says that it will be fixed in a future release.


 UUCP    : {ncar!noao!asuvax | uunet!zardoz!hrc | att}!gtephx!boylel
 INTERNET: gtephx!boylel@asuvax.eas.asu.edu

-- 
Lee Boyle (boylel@gtephx)
UUCP: {ncar!noao!asuvax | uunet!zardoz!hrc | att}!gtephx!boylel
AG Communication Systems, Phoenix, AZ
(602) 581-4528

rees@pisa.citi.umich.edu (Jim Rees) (06/25/91)

In article <1991Jun24.002623.18899@gtephx.UUCP>, wilsonj@gtephx.UUCP (Jay Wilson) writes:

  I saw this posting and I could not resist having one of my partners
  in crime (there are 6 of us Sys_admins) respond to it.  He has been tracking
  the Mutex Lock problem for over a year now and this is what he had to say.

  ...

  The error will rear its ugly head with no warning or pattern,
  and once you get it you MUST reboot and run the long SALVOL
  to appease its appetite for disaster.

I find this hard to believe.  I've never had to run a long salvol after
getting a stuck sfcb mutex, and I can't think of anything that a long salvol
might do that would fix it.

  If you would like more information on exactly what "sfcb hash table" and
  "mutex lock" are, please refer to a copy of the "Domain/OS Design Principles"
  014962-A00 pages 9-14,9-15.

That paper was also published in the Atlanta Usenix Proceedings, which I
think was summer 1986.  It's also available by ftp from pisa.citi.umich.edu.
I think it's an excellent paper and everyone should read it.

The basic problem with putting mutex locks in shared memory is that any old
program can go and trash them, and then you're stuck.  What's needed is a
true object oriented architecture with tagged storage, like the old Intel
432 or the IBM System 38.  But the trend seems to be in the opposite
direction, and operating systems seem to be getting more primitive, yet
bloated, every year.  Multitasking was common on all computers in the mid
1960s, then pretty much disappeared in the 80s when everyone started running
MS-DOS.  I'm waiting for the days when we all have to start using batch
again.  All this is enough to make an old systems guy like me want to retire
to a small midwestern town and spend his summers in places like Tanjung
Pinang.

Anyway, where was I?  Oh yes, mutex locks.  These problems are nearly
impossible to track down.  Since the sfcbs are central to ios, it's very
hard to do anything in the debugger if, for example, you've set a breakpoint
in mutex_$lock.  Everybody and his mother calls mutex_$lock, so you might
have to hit it 10000000 times before catching the one time that has no
matching unlock.  And then, how do you know when that happens?  It's a
Turing completion problem.  How do you tell the debugger to breakpoint on
something that isn't going to happen?  "Please stop the next time
mutex_$unlock *isn't* called."  And if you do manage to catch it, there you
are with the sfcbs locked, and you can't do any IO.  And remember that the
missing unlock can happen in any process.

Even worse is the case where it isn't an unmatched lock, it's some random
trashing of memory that happens to scribble on the sfcb.  It can happen at
any time, in any process.

Last time I wrote a type manager (for AFS), I had a few of these problems.
I couldn't debug them.  I fixed them by tenacious examination of source
code.  I suspect that's how Apollo engineers fix them too, the few who are
left who even know what an sfcb is.

To Apollo's credit, I have to say that I haven't seen a single stuck mutex
since I installed sr10.3.  I suspect there were some problems with TCP
before this but I wouldn't swear to it -- it may have been my screwy type
manager.

Enough ranting.