mrb1@homxc.ATT.COM (M.BAKER) (04/06/89)
Hi --- Since the net was so helpful on my last query, I'd like to give it another try: We have an AT&T 6386E system running UNIX SysV/3.2. While running our application, it has been observed to 'hang'. Specifically, the application stops in the middle of things. More importantly, all the terminal I/O stops.......including the system console. You can't log in on a free getty. Anything you type gets echoed back to the screen, but nothing gets done with it. If you hit "Ctrl-Alt-Del", the screen displays a message saying "You must run shutdown before using Ctrl-Alt-Del" or something very similar to that. There is no "Fatal Error - Parity Check at ...." message or anything abnormal on the console. The only thing to do then (that seems to work for me) is to hit RESET. Well, rebooting kind of destroys all the clues. Since the kernel apparently never did a panic(), there's no dump available to look at with crash. If the hang occurred in the middle of the night, and time elapses before you reset the system, sar shows nothing past the last recorded 'checkpoint' before the system 'died'. I will furnish more details of our hardware configuration/software application upon request....for now, I think that these basic clues should be able to get us aimed in the right direction. My first suspicion: The 3.1 & 3.2 software notes state that if you "run out of free clists, all input/output activity from/to terminal ports and the console will cease. No warning message is printed by the system to show that it is out of clists". Sounded good at first, so we raised the NCLIST tunable parameter from 120 to 170 (recom- mended value for 4M machine) and then to 200 (the max. in mtune). Stil had the problem, though. Which leads to a couple of quick ques- tions: 1.) Can you check the number of free clists while the system is running? sar doesn't seem to be any help here, and I'm sure crash can reveal it but I'm not sure how to get to it. 2.) Is there any circumstance in which clists can get slowly used up (i.e., occasionally not returned to the free pool)? Also, could this problem be symptomatic of the time slicer interrupt going away (not being generated, or recognized) which robs UNIX of knowing that time is passing us by? Or are we just in some kind of major deadlock? I think that the processor is still alive, since console characters echo to the screen and it responds to the Ctrl-ALt-Del keyin. Plus this is a protected mode machine, so it's a little tougher for an application to clobber the OS by writing in the wrong area, or whatever. Any clues/suggestions/tips/criticisms/flames/whatever would be really appreciated. Thanks M. Baker homxc!mrb1 201-949-3455
vause@cs-col.Columbia.NCR.COM (Sam Vause) (04/07/89)
In article <6226@homxc.ATT.COM> mrb1@homxc.ATT.COM (M.BAKER) writes: > We have an AT&T 6386E system running UNIX SysV/3.2. > While running our application, it has been observed to > 'hang'. Specifically, the application stops in the > middle of things. More importantly, all the terminal I/O > stops.......including the system console. You can't log > in on a free getty. Anything you > type gets echoed back to the screen, but nothing gets > done with it... Well, it's possible that the clist increment mentioned later in the original posting is actually *hurting* the situation, rather than helping. My experience indicates that this symptom is possibly from a variety of situations, but personal observation leads me to believe that the kernel logical address space is being exhausted. Perhaps the best method of identifying the actual problem symptoms (in the absence of the memory dump), is to use the crash(1m) command on the running kernel to examine the status of the System Page Table Map. Although I am not personally familiar with the way this command executes on other machines, I have used it during kernel debug enough to give you the general expectations: # crash <CR> > stat sysname: UnixV nodename: cs-col release: 020001 version: config machine: 68020 time of crash: Fri Apr 7 09:11:05 1989 age of system: 21 day, 23 hr., > map sptmap sptmap address size 00000000 97 00001f99 71 2 segments, 168 units >od maxspace 00e67e18: 00100000 I've included this example from my machine (NCR TOWER 32/600) for your reference. For this system, there are only two segments and a total of 168 units (each is 2K clicks) of System Page Table (SPT) space left. The first segment is reserved for the actual kernel code itself, and is not generally available to the user. The second segment (and any possible following ones) are available to user processes (but not until the fork(2) system call returns...). Since the MAXSPACE kernel configuration parameter is 0x100000, each active process will dynamically sptalloc() 4K of kernel SPT space. (Your mileage (may vary...) For this machine, each 1MB (0x100000) increase to the MAXSPACE parameter will also place an additional 2K burden on each processes SPT requirements. For this machine, I can realistically create only 35 additional processes (71 clicks * 2K / 4K). What this all means is that systems where SPT space is tight will exhibit the symptoms you've described: character echo at the terminal is okay, but no processes appear to be in execution. System degradation appears to occur slowly, rather than "all at once". Generally, no error messages are written to the console. Crash(1m) shown the SPT space to be generally less than 4 segments, with a *total* number of units less than 120. The cure? Well, if possible, increase your Kernel Address Space size. If there is not already a configuration parameter for this purpose, your only alternatives are to reduce the number of buffers and clists, in order to furnish more kernel logical address space for SPT usage, and delete any kernel features and drivers you do not need. Failing this, you get to buy another machine... Perhaps this is not your actual situation, but it sure sounds *PAINFULLY* similar to situations that I've recently encountered.... +------------------------------------------------------------------+ |Sam Vause, NCR Corporation, Customer Services - TOWER Support | |3325 Platt Springs Road, West Columbia, SC 29169 (803) 791-6953 | | vause@cs-col.Columbia.NCR.COM | | ...!uunet!ncrlnk!ncrcae!cs-col!vause | | ...!ucbvax!sdcsvax!ncr-sd!ncrcae!cs-col!vause | +------------------------------------------------------------------+
nusip@maccs.McMaster.CA (Mike Borza) (04/09/89)
In article <228@cs-col.Columbia.NCR.COM> vause@cs-col.Columbia.NCR.COM (Sam Vause) writes: >In article <6226@homxc.ATT.COM> mrb1@homxc.ATT.COM (M.BAKER) writes: >> We have an AT&T 6386E system running UNIX SysV/3.2. >> While running our application, it has been observed to >> 'hang'. Specifically, the application stops in the >> [more info about hangs deleted...] > >Well, it's possible that the clist increment mentioned later >in the original posting is actually *hurting* the situation, rather >than helping. > >My experience indicates that this symptom is possibly from a variety >of situations, but personal observation leads me to believe that the >kernel logical address space is being exhausted. > This also concurs with my experience. Our 386 with 4 MB of memory is used to develop X-Windows applications under ISC 386/ix (1.0.6). Based on analysis of sar output, we decided to increase, among other things, the number of disk buffes.. Under unpredictable circumstances, the system would slow to a crawl, sometimes dying and sometimes recovering without any intervention. At other times, the system would just die, sometimes echoing characters typed at the console and serial terminals, sometimes not. Reducing the amount of space allocated to some of the bigger kernel resources always restored reliable functionality for our application mix. Prior to acquiring X, we were able to use substantially more disk buffers before encountering this problem. Most interesting (and annoying). >Perhaps this is not your actual situation, but it sure sounds *PAINFULLY* >similar to situations that I've recently encountered.... Sure does. Thanks. >|Sam Vause, NCR Corporation, Customer Services - TOWER Support | Mike Borza <antel!mike@maccs.uucp> Antel Optronics Inc.