[mod.computers.vax] system hangs

BRUCE@UC780.BITNET.UUCP (02/11/87)

HELP!!!
 
We have a problem....
 
Last night our VAX just hung, no terminal response, no system console
response, nothing....
 
We run an EMULEX SC41 controller on our Fujitsu Eagle drive (ra81 look alike)
as our system disk, EMULEX SC750 controllers for our data Fujitsu Eagle drives,
and have just upgraded to VMS 4.5 and Microcode level 99 on our 750.
 
We were running a database backup from one disk to another on the MASSBUS
(sc750) and a BACKUP from MASSBUS disk to UNIBUS tape and possibly some other
UNIBUS activity.....
 
 
HELP!   Does anyone have any clues?  DEC et. al.  have no ideas.
 
 
THANKS in advance! for any responses.

chris@MIMSY.UMD.EDU.UUCP (02/12/87)

>Last night our VAX just hung, no terminal response, no system console
>response, nothing....

Not even control P.  Only the RESET button works.

>We run an EMULEX SC41 controller

There is the problem.  Call Emulex and get the latest ROMs.

(I had this happen constantly under very high Unibus loads with a
local hack to 4.3BSD.  We moved the SC41/MS to its own Unibus and
the problems went away; we later got new ROMs, moved it back, and
the problem has not recurred.  I told Emulex about the bug long
ago; they did not seem to listen, but apparently newer versions of
VMS can work the Unibus hard enough to trigger it, and so they
found it.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP:	seismo!mimsy!chris	ARPA/CSNet:	chris@mimsy.umd.edu

DCOTTLER@rca.com.UUCP (03/01/87)

>...We are experiencing a strange problem on one of the Vaxen in our
>cluster (consisting of 2 11/750s, 1 11/785, 1 8650).  We recently
>increased the number of logins allowed from 64 (default) to 120 on our
>8650. Shortly there after the system would start to hang with about 70
>or 80 users. The system would slow down and processes would be placed in
>a RWSWP state. These processes would hang completely.  eventually the
>system would completely hang and no one could do anything, including the
>console... ...processes in the RWSWP state... The last time the system
>started doing this I was able to do a 'SHOW MEM' from DCL and our swap
>file was more that 90% used, but the page file had plenty of free space.

We also went thru this problem when we first expanded our VAXcluster.  
It took the TSC several weeks to get back to us.  By that time, via the 
VMS internals man and the doc set, we solved it ourselves.  The TSC 
answer simply confirmed it...

The RWxxx, Resource Wait, process states are simply breakdowns of the 
MWAIT, Misc Wait, state.

RWSWP -- Resource Wait for Swap File Space -- indicates that those 
process were hung waiting for space in the swap file.

Under VMS V4, swap file and page file are allocated a bit differently
than in VMS V3.  See the tuning guide for details -- the bottom line is
that each process must have a minimum area in the swap file. This space
is used for process header, page tables, etc. If that area isn't
available when needed, the process hangs.  After a while, this situation
will either clear itself when processes are deleted, thus freeing space,
or the problem will snoball and hang the entire system. 

In a VAXcluster, it is often/usually the case that:
 you have NO swap file on the system disk;
 you have a MINIMUM page file on the system disk;
 you have large secondary page and swap files on other disks.
This organization is done for performance purposes.  In a VAXcluster,
you can easily swamp a system disk if you are doing extensive paging or
swapping on it.  The minimum page file is just enough to get VMS booted
then let you mount the other disks and enable their secondary page and
swapfiles from systartup.com. Thus all real paging/swapping gets done to
the secondary files.  This is also a BIG disk space savings in large
VAXclusters -- when you have a lot of layered software, and large memory
VAXes (so you need large sysdump files) space on the system disk is
critical. 

In pre-VMS V4, the minimum size of the page file on the system disk was 
4K blocks.  In VMS V4, due to size changes in the executive, the minimum 
is now 8K blocks.  ie -- if your page file on the system disk is less 
than 8K blocks, you may hang during boot.

Actual sizing of these files will vary depending on your application.  
AUTOGEN looks for a pagefile of at least 2*VIRTUALPAGECNT.  This may or 
may not be acceptable.


Our VAXcluster has a VAX-8650 with 52MB of real memory. Our main (and
largest) memory applications are VLSI design and AI.  We need to supply
our users with 40 to 60MB virtual memory. 

In our case, our 8650 was hanging because the secondary swap file was 
MUCH too small.  This has been adjusted and we now running with:

 VIRTUALPAGECNT = 128000
 WSMAX          = 33000
 IJOBLIM        = 90
 BJOBLIM        = 5
 (for the memory files listed below, S: is the system disk,
  and U: is a user disk.)
 S:[SYS7.SYSEXE]SWAPFILE.SYS	- Doesn't exist.
 S:[SYS7.SYSEXE]PAGEFILE.SYS	- 8192 blocks.
 S:[SYS7.SYSEXE]SYSDUMP.DMP     - 106500 blocks (52MBish+dump header)
 U:[VAXVMS]SWAPFILE.SYS		- 100000 blocks.
 U:[VAXVMS]PAGEFILE.SYS		- 300000 blocks.

During the day, the 8650's secondary swapfile runs about 60 to 70% full. 
The secondary page file runs around 80% full, but has been known to 
completely fill if too many chip designers get too ambitious at once.  
This is a limitation we've decided to live with because we can't afford 
the disk storage to size the page file to the actually required 450,000 
blocks.

I hope this helps.  The full explinations can be found in the books 
distributed in DEC's VAX Performance Seminar.


					Dan Cottler
					<dcottler@rca.com>
					RCA Advanced Technology Laboratories
					Moorestown, NJ