hj412fr@duc220.uni-duisburg.de (Martin Anantharaman) (12/28/90)
We are having serious problems with a DN3550 under SR10.2 and the service technicians from HP are stymied themselves, so I am hoping for help from the GURUS out there: At the press of a key the system (display AND processes) suddenly freezes and crawls along at about 0.1 % of the usual speed (with luck I am able to shut the system, but that takes 20-30 minutes). Most of the time we have to reset the system, after which it runs normally for some time, BUT disk salvage everytime, lots of diskless stations hanging around ... We are unable to CRP from another node, though file service continues normally and I can get process lists with ps, pst etc. ps, pst from another node show some strangeness: the DM is shown without an owner and child processes (say from an aborted make) hang around after the parent is gone. Occasionally tcpd crashes during this period with an illegal reference in tcp_$ucblookup, causing rgyd and other daemons to spin. AND, (here comes...) at some stage the following message appears in the DM ouput pad: "unable to obtain sfcb hash table mutex lock from (stream manager/sfcb)" The same message also generally appears in the proc_dump for the vtserver. Nothing untoward appears in the netmain_srvr logs and DEX merely comes up with a strange error in the display system (we have a 40-plane DVS). The system_error_log has some disk-errors, but they are not recent. Still, based on that, the service technicians exchanged the display controller, but the same problem has occured again, though less frequently. ALSO, DEX still repports the same error. Any ideas, PLEASE? Martin Anantharaman FB7, FG7 (Mechanik) Work: +49 (203) 379-3336 Universitaet -GH- Duisburg Home: +49 (203) 37 65 89 Lotharstr. 1 FAX: +49 (203) 379-3052 4100 Duisburg 1 E-Mail: hj412fr@duc220.uni-duisburg.de West Germany
csfst1@unix.cis.pitt.edu (Charles S. Fuller) (12/29/90)
In article <9012280914.AA03633@duc220.uni-duisburg.de>, hj412fr@duc220.uni-duisburg.de (Martin Anantharaman) writes: > At the press of a key the system (display AND processes) suddenly > freezes [...] > > Occasionally tcpd crashes [...] in tcp_$ucblookup [...] > > AND, [...] the following message appears [...] > "unable to obtain sfcb hash table mutex lock (stream manager/sfcb)" > The same message also generally appears in the procdump > for the vtserver. Just a hunch, but you may want to verify that you've got the latest vtserver. I seem to recall problems with the version released with 10.2. Hope this helps. Chuck Fuller [Sorry about the cryptic editing, but postnews dutifully refused to post my suggestion otherwise.]
nazgul@alphalpha.com (Kee Hinckley) (12/31/90)
In article <9012280914.AA03633@duc220.uni-duisburg.de> hj412fr@duc220.uni-duisburg.de (Martin Anantharaman) writes: > AND, (here comes...) at some stage the following message appears in the > DM ouput pad: "unable to obtain sfcb hash table mutex lock from (stream > manager/sfcb)" The same message also generally appears in the proc_dump for > the vtserver. This used to happen to me quite often when I was running out of processes, but I don't know why it would occur when you press a key (I have trouble imagining that it has anything to do with the display controler though). Maybe someone wired the keypress to invoke dozens of processes???? Anyway, at SR10.2, at least on my 3500, the following code clears the lock. If you run it you should promptly shutdown (whether it works or not :-). I'm not sure if it will work on any other machines or any other revs - it seems unlikely. At best it will save you that 20 minute shutdown time. I didn't write it, no one will support it, and God knows what it will do if it doesn't clear the lock. main() { char *p; p = (char *) 0x3B6F8072; *p = 0; } -- Alphalpha Software, Inc. | motif-request@alphalpha.com nazgul@alphalpha.com |----------------------------------- 617/646-7703 (voice/fax) | Proline BBS: 617/641-3722 I'm not sure which upsets me more; that people are so unwilling to accept responsibility for their own actions, or that they are so eager to regulate everyone else's.
rees@pisa.ifs.umich.edu (Jim Rees) (01/01/91)
In article <9012280914.AA03633@duc220.uni-duisburg.de>, hj412fr@duc220.uni-duisburg.de (Martin Anantharaman) writes:
We are having serious problems with a DN3550 under SR10.2 and the
service technicians from HP are stymied themselves, so I am hoping
for help from the GURUS out there:
At the press of a key the system (display AND processes) suddenly
freezes and crawls along at about 0.1 % of the usual speed (with luck I
am able to shut the system, but that takes 20-30 minutes). Most of the time
we have to reset the system, after which it runs normally for some time,
BUT disk salvage everytime, lots of diskless stations hanging around ...
The sfcb hash table is at the heart of the IO system. It's sort of like
the file table in regular Unix. It is protected by a mutex lock. The
behavior you see happens when some process fails to unlock the table,
usually as a result of an unexpected fault with the lock held.
Look for a process dump with a traceback somewhere inside of ios. It
probably won't be the most recent traceback, since things will have died
when they couldn't get the lock. Also try to correlate this behavior with
something you did immediately before.
Things to watch for include out-of-sync software (/lib/streams not
compatible with /lib/clib, for example) and poorly-behaved type managers.
austin@METO.UMD.EDU (Austin L. Conaty) (01/01/91)
What is the status of your disk? Is it full? ----------------------------------------------------------------------- Austin L. Conaty austin@meto.umd.edu Deptartment of Meteorology 301-405-5357 University of Maryland or: HEY YOU!!!!! College Park, Md. 20742-2425 -----------------------------------------------------------------------
mishkin@jrst.ch.apollo.hp.com (Nathaniel Mishkin) (01/02/91)
In article <4eed8f39.1bc5b@pisa.ifs.umich.edu> rees@pisa.ifs.umich.edu (Jim Rees) writes:
The sfcb hash table is at the heart of the IO system. It's sort of like
the file table in regular Unix. It is protected by a mutex lock. The
behavior you see happens when some process fails to unlock the table,
usually as a result of an unexpected fault with the lock held.
Correct as usual. Of course, the only code that runs with the lock
held is part of the IO system itself (and a small part at that).
Since this is "inner loop" code, it's written a bit more trickily than
one might expect, hence the problems. However, this code has gotten
(another) once over at sr10.3 and yet some more problem cases (and I
think disk full is one of them) have been dealt with.
--
-- Nat Mishkin
Cooperative Object Computing Operation
Hewlett-Packard Company
mishkin@apollo.hp.com
wjw@eba.eb.ele.tue.nl (Willem Jan Withagen) (01/03/91)
In article <9012280914.AA03633@duc220.uni-duisburg.de> hj412fr@duc220.uni-duisburg.de (Martin Anantharaman) writes: >We are having serious problems with a DN3550 under SR10.2 and the > > AND, (here comes...) at some stage the following message appears in the > DM ouput pad: "unable to obtain sfcb hash table mutex lock from (stream > manager/sfcb)" This is a know bug, and it has a patch, no 139. It worked for us. Ciao, Willem Jan Withagen Eindhoven University of Technology DomainName: wjw@eb.ele.tue.nl Digital Systems Group, Room EH 10.10 BITNET: ELEBWJ@HEITUE5.BITNET P.O. 513 Tel: +31-40-473401 5600 MB Eindhoven The Netherlands
rees@pisa.ifs.umich.edu (Jim Rees) (01/03/91)
In article <1990Dec30.194240.17416@alphalpha.com>, nazgul@alphalpha.com (Kee Hinckley) writes: Anyway, at SR10.2, at least on my 3500, the following code clears the lock... main() { char *p; p = (char *) 0x3B6F8072; *p = 0; } Now that's what I call magic. That number will change with every release of the system and model of hardware. You can find the lock by following pointers down from sfcb_$puid_ptr, but I don't recommend this. Part of the fun is getting your unlock program to run. If the lock is stuck, then every attempt to acquire it results in a 90 second timeout. I used to have a daemon that would check the sfcb lock every minute, and if it found the lock stuck (held and ec not advancing) it would shut down the node. This won't work any more because reboot() no longer does a shutdown, it signals process 1 to do the shutdown (why was this changed?).
nazgul@alphalpha.com (Kee Hinckley) (01/04/91)
In article <4ef7cdb8.1bc5b@pisa.ifs.umich.edu> rees@citi.umich.edu (Jim Rees) writes: >I used to have a daemon that would check the sfcb lock every minute, and if >it found the lock stuck (held and ec not advancing) it would shut down the >node. This won't work any more because reboot() no longer does a shutdown, >it signals process 1 to do the shutdown (why was this changed?). You might try calling reboot(RB_AUTOBOOT). Unless that was changed too. -- Alfalfa Software, Inc. | motif-request@alfalfa.com nazgul@alfalfa.com |----------------------------------- 617/646-7703 (voice/fax) | Proline BBS: 617/641-3722 I'm not sure which upsets me more; that people are so unwilling to accept responsibility for their own actions, or that they are so eager to regulate everyone else's.
rees@pisa.ifs.umich.edu (Jim Rees) (01/09/91)
In article <1991Jan3.202230.12597@alphalpha.com>, nazgul@alphalpha.com (Kee Hinckley) writes: In article <4ef7cdb8.1bc5b@pisa.ifs.umich.edu> rees@citi.umich.edu (Jim Rees) writes: >... This won't work any more because reboot() no longer does a shutdown, >it signals process 1 to do the shutdown (why was this changed?). You might try calling reboot(RB_AUTOBOOT). Unless that was changed too. That was changed too. All flavors of reboot() now signal process 1. If it doesn't exist (!) or doesn't respond in two minutes, reboot() goes ahead and tries to shut down in the current process by calling os_$shutdown(). reboot() also sends all processes a SIGTERM, and, if that doesn't work, a SIGKILL.