[comp.sys.apollo] sfcb hash table mutex lock

dbfunk@ICAEN.UIOWA.EDU (David B. Funk) (01/25/89)

> pabong@gonzo.eta.com writes:
> has anyone ever seen the error "Unable to obtain sfcb hash table mutex lock"

The sfcb hash table is a table, in global memory, that holds shared file
control blocks. Type managers that support multiple I/O streams to an
object use sfcb's for each open object. This includes most files, IPC
sockets, mbx mailboxes, pipes, etc (IE most streams I/O managers). These
are used to manage concurrent stream access. Each time a stream is opened
or closed to oen of these types of objects, the sfcb hash table is
searched to see if an entry needs to be allocated/dealocated or reused.
(IE this table gets used a lot and its in global memory where all
processes and type managers can get to it.) It has a mutual exclusion lock
(mutex lock) on it to prevent corruption from simultanious updates. If
this lock gets lost, all kinds of chaos can result. Ordinarily a process
(type manager) obtains the lock, does the updating, and releases the lock.
If the process faults, its clean-up handlers should release all aquired
resources. If a process is blasted or dies a violent enough death that its
stack is wiped out, then its clean-up handlers may not get a chance to do
their work. This can result in a lost mutex lock. A bug in the streams
library (/lib/streams) or a type manager (/sys/mgrs/*) could also cause
this problem. Different pieces of system software have diffent revision
levels and are depenten upon other pieces being compatabile. EG the tcp/ip
upgrade was dependent upon the correct revision of the streams library for
correct operation. Mismatched software can cause problems. Third party
software can try to pull some fancy stunts that may get in trouble. A
messup in the sfcb hash table can be a ticking time bomb that won't show
up until long after the culprit did its dirty deed.

To summarize, when you have sfcb hash table problems:
    Check for processes not exiting cleanly
    Check revision levels of system software
    Check for bug reports on software that you use

When in doubt, reboot before things come to a grinding halt.

Dave Funk

giebelhaus@hi-csc.UUCP (Timothy R. Giebelhaus) (01/29/89)

In article <8901250632.AA00797@icaen.uiowa.edu> dbfunk@ICAEN.UIOWA.EDU (David B. Funk) writes:
>To summarize, when you have sfcb hash table problems:
>    Check for processes not exiting cleanly
>    Check revision levels of system software
>    Check for bug reports on software that you use
>
>When in doubt, reboot before things come to a grinding halt.

Very impressive Dave!  The details are much more than I know about the
system.  But I do want to be sure to say that I believe the summary to
be accurate.

I have not seen it spelled out specifically that if you BLAST process
with "sigp -b", "lo -f", or some other method that send a BLAST to
a process, you are running on borrowed time.  A reboot quite likely
will be necessary.  A blast will cause a process to not exit
cleanly which, as Dave points out, is one of the things to watch for.

I have not seen problems with the DN4000 machines having hash table
problems yet, but if you believe that things are not getting blasted,
you don't have revision mixes, and your in-house and third party
software does not have related bugs, please do call it in to the 
hot line 800 number (provided of course that you have a service
contract) or file an APR with crucr or mkapr.
-- 
UUCP: uunet!hi-csc!giebelhaus         UUCP: tim@apollo.uucp
ARPA: hi-csc!giebelhaus@umn-cs.arpa   ARPA: tim@apollo.com
Tim Giebelhaus, Apollo Computer, Regional Software Support Specialist.
My comments and opinions have nothing to do with work.

jwright@atanasoff.cs.iastate.edu (Jim Wright) (01/30/89)

In article <4123bc79.1032a@hi-csc.UUCP> giebelhaus@hi-csc.UUCP (Timothy R. Giebelhaus) writes:
>I have not seen problems with the DN4000 machines having hash table
>problems yet,

My 4000 had this pop up VERY regularly.  It seems to have gone away for
the time being.  Who knows why.