BITNET@UTHSCSA.BITNET.UUCP (02/28/87)
I have been a computer programmer for a little over two years and I have subscribed to info-vax for a little over a year now. I was recently promoted to a position in our systems group. I would like to say thank you to all the people out there (ESP --Jerry) for all the great info. I can honestly say that if I didn't have the knowledge acquired from reading info-vax, I probably would not have gotten my current position. Now for my question. We are experiencing a strange problem on one of the Vaxen in our cluster (consisting of 2 11/750s, 1 11/785, 1 8650). We recently increased the number of logins allowed from 64 (default) to 120 on our 8650. Shortly there after the system would start to hang with about 70 or 80 users. The system would slow down and processes would be placed in a RWSWP state. These processes would hang completely. eventually the system would completely hang and no one could do anything, including the console. We did a ^P and @CRASH at the console. I am still learning how to use SDA and analyze a crash dump but here is what I found. If I did a 'SHOW SUM/IMAGE' in SDA, some of the processes would have 'No Image Name Available', this is not the same as process not currently executing an image (DCL). These processes would just not have any image name. Most of the processes in the RWSWP state were running ALL-IN-ONE, A DEC office automation package. When I set 'RMS=ALL' in SDA and did 'SHOW PROC/RMS/INDEX=' for the processes in the RWSWP state they were all accessing DUA6: we use DUA6: for the ALL-IN-1 shared areas and our secondary page and swap files. The bitmask 'BKPBITS' had the following bits set for the majority of the processes in the RWSWP state: BUSY, ACCESSED, RMS_STALL, STALL_LOCK The last time the system started doing this I was able to do a 'SHOW MEM' from DCL and our swap file was more that 90% used, but the page file had plenty of free space. The best we could come up with is that our Swap files don't have enough space. We increased our secondary Swap file from 50,000 blocks to 100,000 blocks so far so good. Could it be something else? 1. What do the BKPBITS bits mean are these values normal? 2. What is RWSWP (my guess is Resource Wait SWaPped) 3. Where can I learn more about how to interpret what I get from SDA? 4. If this was the problem, how can I determine how much space per process should be allocated in the page and swap files. And why are so many processes being swapped out? Mark Moore (Green Assistant-System-Manager who will gladly accept all the info he can get) MOORE@UTHSCSA.BITNET P.S. I was going to call TSC, but no one answers. I think they are having some severe weather problems.
LEICHTER-JERRY@YALE.ARPA.UUCP (03/03/87)
We recently increased the number of logins allowed from 64 (default) to 120 on our 8650. Shortly there after the system would start to hang with about 70 or 80 users.... The last time the system started doing this I was able to do a 'SHOW MEM' from DCL and our swap file was more that 90% used, but the page file had plenty of free space. The best we could come up with is that our Swap files don't have enough space. We increased our secondary Swap file from 50,000 blocks to 100,000 blocks so far so good. Could it be something else? It's difficult to diagnose this at a distance, so I'll have to throw out a couple of ideas and hope they turn out to be useful: a) After increasing the number of users, did you run AUTOGEN? A lot of things depend on the number of users. b) You mention checking the and increasing the amount of swap file space. Did you also check for enough pagefile space this time? c) When you did the SHOW MEM, did you notice how full the primary pagefile looked? Unfortunately, not all page files are completely equal. Some things can go ONLY to the primary page file. In particular, page file sections will always live in the primary. RMS global buffers are page file sections, and ALL-IN-1 uses them.... 2. What is RWSWP (my guess is Resource Wait SWaPped) Well, more or less. The swap space allocated for a process grows and shrinks as the process's working set grows and shrinks. RWSWP means the process is trying to increase its working set, so needs more swapfile space, and is waiting for it to become available. The unavailability arises from lack of total swapfile space, or from internal fragmentation of the swapfile space. (This has nothing to do with disk or file fragmentation - it just means that the free space in the swapfile is split up by space allocated to other pro- cesses.) Again, make sure you have adequate pagefile space. VMS can page to the swapfile (or swap to the pagefile? - I forget which) if it has to to keep running, but the allocation strategies for the two kinds of files conflict, and the result will be a badly fragmented swapfile. (Actually, this probably has nothing to do with your problem, since you probably have to install the same file as both a pagefile and a swapfile.) 3. Where can I learn more about how to interpret what I get from SDA? As a start, get hold of the VMS Internals and Data Structures book, but Kenah and Bates. It's published by Digital Press. Unfortunately, it still corres- onds to V3 - a V4 version is due out "any time now". There are various courses on internals that are available through DEC, and others through DECUS. Much of the information available in those courses and their handouts is available nowhere else. 4. If this was the problem, how can I determine how much space per process should be allocated in the page and swap files. And why are so many processes being swapped out? If many processes are swapped out, the value of the SYSGEN BALSETCNT is too low. AUTOGEN does a fairly good job of determining reasonable numbers for this and related parameters. You might also look into VPA (VMS Performance Analyser?), a DEC product that watches your system running for a while, then recommends things you should do to improve performance. -- Jerry -------