[comp.sys.apollo] Why do I need a "sfcb hash table mutex lock"?

hj412fr@duc220.uni-duisburg.de (Martin Anantharaman) (12/28/90)

We are having serious problems with a DN3550 under SR10.2 and the
service technicians from HP are stymied themselves, so I am hoping
for help from the GURUS out there:

	At the press of a key the system (display AND processes) suddenly
	freezes and crawls along at about 0.1 % of the usual speed (with luck I
	am able to shut the system, but that takes 20-30 minutes). Most of the time
	we have to reset the system, after which it runs normally for some time,
	BUT disk salvage everytime, lots of diskless stations hanging around ...
	
	We are unable to CRP from another node, though file service continues
	normally and I can get process lists with ps, pst etc.
	
	ps, pst from another node show some strangeness: the DM is shown
	without an owner and child processes (say from an aborted make) hang
	around after the parent is gone.
	
	Occasionally tcpd crashes during this period with an illegal reference
	in tcp_$ucblookup, causing rgyd and other daemons to spin.
	
	AND, (here comes...) at some stage the following message appears in the
	DM ouput pad: "unable to obtain sfcb hash table mutex lock from (stream
	manager/sfcb)" The same message also generally appears in the proc_dump for
	the vtserver.
	
	Nothing untoward appears in the netmain_srvr logs and DEX merely comes
	up with a strange error in the display system (we have a 40-plane DVS).
	The system_error_log has some disk-errors, but they are not recent.
	
Still, based on that, the service technicians exchanged the display
controller, but the same problem has occured again, though less
frequently. ALSO, DEX still repports the same error.

Any ideas, PLEASE?

Martin Anantharaman

FB7, FG7 (Mechanik)		Work:	+49 (203) 379-3336
Universitaet -GH- Duisburg	Home:	+49 (203) 37 65 89
Lotharstr. 1			FAX:	+49 (203) 379-3052
4100 Duisburg 1			E-Mail: hj412fr@duc220.uni-duisburg.de
West Germany    

csfst1@unix.cis.pitt.edu (Charles S. Fuller) (12/29/90)

In article <9012280914.AA03633@duc220.uni-duisburg.de>, hj412fr@duc220.uni-duisburg.de (Martin Anantharaman) writes:

> 	At the press of a key the system (display AND processes) suddenly
> 	freezes  [...]
>
> 	Occasionally tcpd crashes [...] in tcp_$ucblookup [...]
> 	
> 	AND, [...] the following message appears [...] 
> 	"unable to obtain sfcb hash table mutex lock (stream manager/sfcb)"
>	The same message also generally appears in the procdump 
> 	for the vtserver.

Just a hunch, but you may want to verify that you've got the latest
vtserver.  I seem to recall problems with the version released with
10.2.  

Hope this helps.
Chuck Fuller

[Sorry about the cryptic editing, but postnews dutifully refused to post
my suggestion otherwise.]

nazgul@alphalpha.com (Kee Hinckley) (12/31/90)

In article <9012280914.AA03633@duc220.uni-duisburg.de> hj412fr@duc220.uni-duisburg.de (Martin Anantharaman) writes:
>	AND, (here comes...) at some stage the following message appears in the
>	DM ouput pad: "unable to obtain sfcb hash table mutex lock from (stream
>	manager/sfcb)" The same message also generally appears in the proc_dump for
>	the vtserver.

This used to happen to me quite often when I was running out of
processes, but I don't know why it would occur when you press a
key (I have trouble imagining that it has anything to do with the
display controler though).  Maybe someone wired the keypress to
invoke dozens of processes????

Anyway, at SR10.2, at least on my 3500, the following code clears
the lock.  If you run it you should promptly shutdown (whether it
works or not :-).  I'm not sure if it will work on any other machines or
any other revs - it seems unlikely.  At best it will save you that 20
minute shutdown time.

I didn't write it, no one will support it, and God knows what it will do
if it doesn't clear the lock.

main()
{
    char *p;
    p = (char *) 0x3B6F8072;
    *p = 0;
}

-- 
Alphalpha Software, Inc.	|	motif-request@alphalpha.com
nazgul@alphalpha.com		|-----------------------------------
617/646-7703 (voice/fax)	|	Proline BBS: 617/641-3722

I'm not sure which upsets me more; that people are so unwilling to accept
responsibility for their own actions, or that they are so eager to regulate
everyone else's.

rees@pisa.ifs.umich.edu (Jim Rees) (01/01/91)

In article <9012280914.AA03633@duc220.uni-duisburg.de>, hj412fr@duc220.uni-duisburg.de (Martin Anantharaman) writes:

  We are having serious problems with a DN3550 under SR10.2 and the
  service technicians from HP are stymied themselves, so I am hoping
  for help from the GURUS out there:
  
  	At the press of a key the system (display AND processes) suddenly
  	freezes and crawls along at about 0.1 % of the usual speed (with luck I
  	am able to shut the system, but that takes 20-30 minutes). Most of the time
  	we have to reset the system, after which it runs normally for some time,
  	BUT disk salvage everytime, lots of diskless stations hanging around ...

The sfcb hash table is at the heart of the IO system.  It's sort of like
the file table in regular Unix.  It is protected by a mutex lock.  The
behavior you see happens when some process fails to unlock the table,
usually as a result of an unexpected fault with the lock held.

Look for a process dump with a traceback somewhere inside of ios.  It
probably won't be the most recent traceback, since things will have died
when they couldn't get the lock.  Also try to correlate this behavior with
something you did immediately before.

Things to watch for include out-of-sync software (/lib/streams not
compatible with /lib/clib, for example) and poorly-behaved type managers.

austin@METO.UMD.EDU (Austin L. Conaty) (01/01/91)

What is the status of your disk? Is it full?

-----------------------------------------------------------------------

Austin L. Conaty                           austin@meto.umd.edu
Deptartment of Meteorology                 301-405-5357
University of Maryland                     or: HEY YOU!!!!!
College Park, Md. 20742-2425

-----------------------------------------------------------------------

mishkin@jrst.ch.apollo.hp.com (Nathaniel Mishkin) (01/02/91)

In article <4eed8f39.1bc5b@pisa.ifs.umich.edu> rees@pisa.ifs.umich.edu (Jim Rees) writes:
   The sfcb hash table is at the heart of the IO system.  It's sort of like
   the file table in regular Unix.  It is protected by a mutex lock.  The
   behavior you see happens when some process fails to unlock the table,
   usually as a result of an unexpected fault with the lock held.

Correct as usual.  Of course, the only code that runs with the lock
held is part of the IO system itself (and a small part at that).
Since this is "inner loop" code, it's written a bit more trickily than
one might expect, hence the problems.  However, this code has gotten
(another) once over at sr10.3 and yet some more problem cases (and I
think disk full is one of them) have been dealt with.
--
                    -- Nat Mishkin
                       Cooperative Object Computing Operation
                       Hewlett-Packard Company
                       mishkin@apollo.hp.com

wjw@eba.eb.ele.tue.nl (Willem Jan Withagen) (01/03/91)

In article <9012280914.AA03633@duc220.uni-duisburg.de> hj412fr@duc220.uni-duisburg.de (Martin Anantharaman) writes:
>We are having serious problems with a DN3550 under SR10.2 and the
>	
>	AND, (here comes...) at some stage the following message appears in the
>	DM ouput pad: "unable to obtain sfcb hash table mutex lock from (stream
>	manager/sfcb)" 

This is a know bug, and it has a patch, no 139.
It worked for us.

Ciao,
	Willem Jan Withagen

Eindhoven University of Technology   DomainName:  wjw@eb.ele.tue.nl    
Digital Systems Group, Room EH 10.10 BITNET: ELEBWJ@HEITUE5.BITNET
P.O. 513                             Tel: +31-40-473401
5600 MB Eindhoven                    The Netherlands

rees@pisa.ifs.umich.edu (Jim Rees) (01/03/91)

In article <1990Dec30.194240.17416@alphalpha.com>, nazgul@alphalpha.com (Kee Hinckley) writes:

  Anyway, at SR10.2, at least on my 3500, the following code clears
  the lock...

  main()
  {
      char *p;
      p = (char *) 0x3B6F8072;
      *p = 0;
  }

Now that's what I call magic.  That number will change with every release of
the system and model of hardware.  You can find the lock by following
pointers down from sfcb_$puid_ptr, but I don't recommend this.

Part of the fun is getting your unlock program to run.  If the lock is
stuck, then every attempt to acquire it results in a 90 second timeout.

I used to have a daemon that would check the sfcb lock every minute, and if
it found the lock stuck (held and ec not advancing) it would shut down the
node.  This won't work any more because reboot() no longer does a shutdown,
it signals process 1 to do the shutdown (why was this changed?).

nazgul@alphalpha.com (Kee Hinckley) (01/04/91)

In article <4ef7cdb8.1bc5b@pisa.ifs.umich.edu> rees@citi.umich.edu (Jim Rees) writes:
>I used to have a daemon that would check the sfcb lock every minute, and if
>it found the lock stuck (held and ec not advancing) it would shut down the
>node.  This won't work any more because reboot() no longer does a shutdown,
>it signals process 1 to do the shutdown (why was this changed?).

You might try calling reboot(RB_AUTOBOOT).  Unless that was changed too.

-- 
Alfalfa Software, Inc.		|	motif-request@alfalfa.com
nazgul@alfalfa.com		|-----------------------------------
617/646-7703 (voice/fax)	|	Proline BBS: 617/641-3722

I'm not sure which upsets me more; that people are so unwilling to accept
responsibility for their own actions, or that they are so eager to regulate
everyone else's.

rees@pisa.ifs.umich.edu (Jim Rees) (01/09/91)

In article <1991Jan3.202230.12597@alphalpha.com>, nazgul@alphalpha.com (Kee Hinckley) writes:
  In article <4ef7cdb8.1bc5b@pisa.ifs.umich.edu> rees@citi.umich.edu (Jim Rees) writes:
  >...  This won't work any more because reboot() no longer does a shutdown,
  >it signals process 1 to do the shutdown (why was this changed?).
  
  You might try calling reboot(RB_AUTOBOOT).  Unless that was changed too.

That was changed too.  All flavors of reboot() now signal process 1.  If it
doesn't exist (!) or doesn't respond in two minutes, reboot() goes ahead and
tries to shut down in the current process by calling os_$shutdown().

reboot() also sends all processes a SIGTERM, and, if that doesn't work, a
SIGKILL.