3.2 UNIX system getting hung

mrb1@homxc.ATT.COM (M.BAKER) (04/06/89)

Hi ---

Since the net was so helpful on my last query, I'd like to
give it another try:

	We have an AT&T 6386E system running UNIX SysV/3.2.
	
	While running our application, it has been observed to
	'hang'.  Specifically, the application stops in the
	middle of things.  More importantly, all the terminal I/O
	stops.......including the system console.  You can't log
        in on a free getty.  Anything you
	type gets echoed back to the screen, but nothing gets 
	done with it.  If you hit "Ctrl-Alt-Del", the screen
	displays a message saying "You must run shutdown before
	using Ctrl-Alt-Del" or something very similar to that.
	There is no "Fatal Error - Parity Check at ...." message
	or anything abnormal on the console.
	The only thing to do then (that seems to work for me) is
	to hit RESET.

	Well, rebooting kind of destroys all the clues.  Since
	the kernel apparently never did a panic(), there's no
	dump available to look at with crash.
	If the hang occurred in the middle of the night, and
	time elapses before you reset the system, sar shows
	nothing past the last recorded 'checkpoint' before the
	system 'died'.

I will furnish more details of our hardware configuration/software
application upon request....for now, I think that these basic clues
should be able to get us aimed in the right direction. 

My first suspicion:

The 3.1 & 3.2 software notes state that if you "run out of 
free clists, all input/output activity from/to terminal ports and
the console will cease.  No warning message is printed by the
system to show that it is out of clists".  Sounded good at first,
so we raised the NCLIST tunable parameter from 120 to 170 (recom-
mended value for 4M machine) and then to 200 (the max. in mtune).
Stil had the problem, though.  Which leads to a couple of quick ques-
tions:

	1.) Can you check the number of free clists while the
		system is running?  sar doesn't seem to be any
		help here, and I'm sure crash can reveal it but
		I'm not sure how to get to it.  

	2.) Is there any circumstance in which clists can get slowly
		used up (i.e., occasionally not returned to the
	        free pool)?

Also, could this problem be symptomatic of the time slicer
interrupt going away
(not being generated, or recognized) which robs UNIX of knowing
that time is passing us by?  Or are we just in some kind of major
deadlock?

I think that the processor is still alive, since console characters
echo to the screen and it responds to the Ctrl-ALt-Del keyin.  Plus
this is a protected mode machine, so it's a little tougher for an
application to clobber the OS by writing in the wrong area, or
whatever.

Any clues/suggestions/tips/criticisms/flames/whatever would be
really appreciated.

Thanks
M. Baker
homxc!mrb1          201-949-3455

vause@cs-col.Columbia.NCR.COM (Sam Vause) (04/07/89)

In article <6226@homxc.ATT.COM> mrb1@homxc.ATT.COM (M.BAKER) writes:
>	We have an AT&T 6386E system running UNIX SysV/3.2.
>	While running our application, it has been observed to
>	'hang'.  Specifically, the application stops in the
>	middle of things.  More importantly, all the terminal I/O
>	stops.......including the system console.  You can't log
>        in on a free getty.  Anything you
>	type gets echoed back to the screen, but nothing gets 
>	done with it...

Well, it's possible that the clist increment mentioned later
in the original posting is actually *hurting* the situation, rather
than helping.

My experience indicates that this symptom is possibly from a variety
of situations, but personal observation leads me to believe that the
kernel logical address space is being exhausted.

Perhaps the best method of identifying the actual problem symptoms (in
the absence of the memory dump), is to use the crash(1m) command on the
running kernel to examine the status of the System Page Table Map.

Although I am not personally familiar with the way this command executes
on other machines, I have used it during kernel debug enough to give you
the general expectations:

	# crash <CR>
	> stat
	sysname: UnixV
	nodename: cs-col
	release: 020001
	version: config
	machine: 68020
	time of crash: Fri Apr  7 09:11:05 1989
	age of system: 21 day, 23 hr., 
	> map sptmap
	sptmap
	address  size
	00000000    97
	00001f99    71
	2 segments, 168 units
	>od maxspace
	00e67e18: 00100000

I've included this example from my machine (NCR TOWER 32/600) for your
reference.  For this system, there are only two segments and a total of
168 units (each is 2K clicks) of System Page Table (SPT) space left.  The
first segment is reserved for the actual kernel code itself, and is not
generally available to the user.  The second segment (and any possible
following ones) are available to user processes (but not until the fork(2)
system call returns...).

Since the MAXSPACE kernel configuration parameter is 0x100000, each active
process will dynamically sptalloc() 4K of kernel SPT space. (Your mileage
(may vary...)  For this machine, each 1MB (0x100000) increase to the
MAXSPACE parameter will also place an additional 2K burden on each processes
SPT requirements.

For this machine, I can realistically create only 35 additional processes
(71 clicks * 2K / 4K).

What this all means is that systems where SPT space is tight will exhibit
the symptoms you've described:  character echo at the terminal is okay, 
but no processes appear to be in execution.  System degradation appears
to occur slowly, rather than "all at once".  Generally, no error messages
are written to the console.  Crash(1m) shown the SPT space to be generally
less than 4 segments, with a *total* number of units less than 120.

The cure?  Well, if possible, increase your Kernel Address Space size.  If
there is not already a configuration parameter for this purpose, your only
alternatives are to reduce the number of buffers and clists, in order to
furnish more kernel logical address space for SPT usage, and delete any
kernel features and drivers you do not need.  Failing this, you get to buy
another machine...

Perhaps this is not your actual situation, but it sure sounds *PAINFULLY*
similar to situations that I've recently encountered....

+------------------------------------------------------------------+
|Sam Vause, NCR Corporation, Customer Services - TOWER Support	   |
|3325 Platt Springs Road, West Columbia, SC 29169 (803) 791-6953   |
|                                vause@cs-col.Columbia.NCR.COM     |
|			 ...!uunet!ncrlnk!ncrcae!cs-col!vause	   |
|		...!ucbvax!sdcsvax!ncr-sd!ncrcae!cs-col!vause      |
+------------------------------------------------------------------+

nusip@maccs.McMaster.CA (Mike Borza) (04/09/89)

In article <228@cs-col.Columbia.NCR.COM> vause@cs-col.Columbia.NCR.COM (Sam Vause) writes:
>In article <6226@homxc.ATT.COM> mrb1@homxc.ATT.COM (M.BAKER) writes:
>>	We have an AT&T 6386E system running UNIX SysV/3.2.
>>	While running our application, it has been observed to
>>	'hang'.  Specifically, the application stops in the
>>       [more info about hangs deleted...]
>
>Well, it's possible that the clist increment mentioned later
>in the original posting is actually *hurting* the situation, rather
>than helping.
>
>My experience indicates that this symptom is possibly from a variety
>of situations, but personal observation leads me to believe that the
>kernel logical address space is being exhausted.
>
This also concurs with my experience.  Our 386 with 4 MB of memory
is used to develop X-Windows applications under ISC 386/ix (1.0.6).
Based on analysis of sar output, we decided to increase, among
other things, the number of disk buffes..  Under unpredictable
circumstances, the system would slow to a crawl, sometimes dying and
sometimes recovering without any intervention.  At other times, the
system would just die, sometimes echoing characters typed at the
console and serial terminals, sometimes not.  Reducing the amount of
space allocated to some of the bigger kernel resources always restored
reliable functionality for our application mix.  Prior to acquiring X,
we were able to use substantially more disk buffers before
encountering this problem.  Most interesting (and annoying).
 
>Perhaps this is not your actual situation, but it sure sounds *PAINFULLY*
>similar to situations that I've recently encountered....
 
Sure does. Thanks.
 
>|Sam Vause, NCR Corporation, Customer Services - TOWER Support	   |

Mike Borza              <antel!mike@maccs.uucp>
Antel Optronics Inc.