[mod.computers.vax] Thanks and system hangs.

BITNET@UTHSCSA.BITNET.UUCP (02/28/87)

I have been a computer programmer for a little over two years and I have
subscribed to info-vax for a little over a year now.  I was recently
promoted to a position in our systems group.  I would like to say thank you
to all the people out there (ESP --Jerry) for all the great info.  I can
honestly say that if I didn't have the knowledge acquired from reading
info-vax, I probably would not have gotten my current position.
 
Now for my question.  We are experiencing a strange problem on one of the Vaxen
in our cluster (consisting of 2 11/750s, 1 11/785, 1 8650).  We recently
increased the number of logins allowed from 64 (default) to 120 on our 8650.
Shortly there after the system would start to hang with about 70 or 80 users.
The system would slow down and processes would be placed in a RWSWP state.
These processes would hang completely.  eventually the system would completely
hang and no one could do anything, including the console.  We did a ^P and
@CRASH at the console. I am still learning how to use SDA and analyze a crash
dump but here is what I found.  If I did a 'SHOW SUM/IMAGE' in SDA, some of the
processes would have 'No Image Name Available', this is not the same as process
not currently executing an image (DCL).  These processes would just not have
any image name.  Most of the processes in the RWSWP state were running
ALL-IN-ONE, A DEC office automation package.  When I set 'RMS=ALL' in SDA and
did 'SHOW PROC/RMS/INDEX=' for the processes in the RWSWP state they were all
accessing DUA6:  we use DUA6: for the ALL-IN-1 shared areas and our secondary
page and swap files. The bitmask 'BKPBITS' had the following bits set for the
majority of the processes in the RWSWP state:
 
BUSY, ACCESSED, RMS_STALL, STALL_LOCK
 
The last time the system started doing this I was able to do a 'SHOW MEM'
from DCL and our swap file was more that 90% used, but the page file had
plenty of free space.
 
The best we could come up with is that our Swap files don't have enough
space.  We increased our secondary Swap file from 50,000 blocks to 100,000
blocks so far so good.  Could it be something else?
 
1. What do the BKPBITS bits mean are these values normal?
 
2. What is RWSWP (my guess is Resource Wait SWaPped)
 
3. Where can I learn more about how to interpret what I get from SDA?
 
4. If this was the problem, how can I determine how much space per process
   should be allocated in the page and swap files.  And why are so many
   processes being swapped out?
 
Mark Moore
(Green Assistant-System-Manager who will gladly accept all the info he can get)
MOORE@UTHSCSA.BITNET
 
P.S.  I was going to call TSC, but no one answers.  I think they are having some
severe weather problems.

LEICHTER-JERRY@YALE.ARPA.UUCP (03/03/87)

    We recently increased the number of logins allowed from 64 (default) to
    120 on our 8650.  Shortly there after the system would start to hang with
    about 70 or 80 users....

    The last time the system started doing this I was able to do a 'SHOW MEM'
    from DCL and our swap file was more that 90% used, but the page file had
    plenty of free space.
     
    The best we could come up with is that our Swap files don't have enough
    space.  We increased our secondary Swap file from 50,000 blocks to 100,000
    blocks so far so good.  Could it be something else?

It's difficult to diagnose this at a distance, so I'll have to throw out a
couple of ideas and hope they turn out to be useful:

a)  After increasing the number of users, did you run AUTOGEN?  A lot of
things depend on the number of users.

b)  You mention checking the and increasing the amount of swap file space.
Did you also check for enough pagefile space this time?

c)  When you did the SHOW MEM, did you notice how full the primary pagefile
looked?  Unfortunately, not all page files are completely equal.  Some things
can go ONLY to the primary page file.  In particular, page file sections will
always live in the primary.  RMS global buffers are page file sections, and
ALL-IN-1 uses them....

    2. What is RWSWP (my guess is Resource Wait SWaPped)

Well, more or less.  The swap space allocated for a process grows and shrinks
as the process's working set grows and shrinks.  RWSWP means the process is
trying to increase its working set, so needs more swapfile space, and is
waiting for it to become available.  The unavailability arises from lack of
total swapfile space, or from internal fragmentation of the swapfile space.
(This has nothing to do with disk or file fragmentation - it just means that
the free space in the swapfile is split up by space allocated to other pro-
cesses.)

Again, make sure you have adequate pagefile space.  VMS can page to the
swapfile (or swap to the pagefile? - I forget which) if it has to to keep
running, but the allocation strategies for the two kinds of files conflict,
and the result will be a badly fragmented swapfile.  (Actually, this probably
has nothing to do with your problem, since you probably have to install the
same file as both a pagefile and a swapfile.)

    3. Where can I learn more about how to interpret what I get from SDA?

As a start, get hold of the VMS Internals and Data Structures book, but Kenah
and Bates.  It's published by Digital Press.  Unfortunately, it still corres-
onds to V3 - a V4 version is due out "any time now".

There are various courses on internals that are available through DEC, and
others through DECUS.  Much of the information available in those courses and
their handouts is available nowhere else.

    4. If this was the problem, how can I determine how much space per process
       should be allocated in the page and swap files.  And why are so many
       processes being swapped out?

If many processes are swapped out, the value of the SYSGEN BALSETCNT is too
low.

AUTOGEN does a fairly good job of determining reasonable numbers for this and
related parameters.  You might also look into VPA (VMS Performance Analyser?),
a DEC product that watches your system running for a while, then recommends
things you should do to improve performance.

							-- Jerry
-------