richk@june.cs.washington.edu (Richard Korry) (10/13/89)
behavior: jobs that are swapped (ps stat = W) get placed on the run queue (ps stat = RW) but never seem to run. the load average shoots up but nothing is actually executing. The only fix has been to reboot. Anyone else ever seen this? rich
litwack@operations.dccs.upenn.edu (Mark Litwack) (10/13/89)
In article <9475@june.cs.washington.edu> richk@june.cs.washington.edu (Richard Korry) writes: >behavior: jobs that are swapped (ps stat = W) get placed on the >run queue (ps stat = RW) but never seem to run. the load average shoots >up but nothing is actually executing. The only fix has been to reboot. >Anyone else ever seen this? > rich Yes, I'm glad (in a way) that we aren't the only ones. DEC claims that they've never heard of a problem like this. Our system continually hangs either every 3-5 days or after a few hours. Once the problem starts happening it locks up in two flavors: quickly in a matter of 5 minutes, or slowly over a period of 2-3 hours. In both cases pstat always shows processes runnable but swapped. Eventually, I guess the wrong process gets swapped out and the system dies completely. We've noted that some processes that are running that don't need to start any new processes work ok. named and nfsd both work fine when the system is in this state. inetd works fine until it has to fork off a telnetd or other server for you. More often than not, large programs on the order of 1-6meg have been running when the lockup occurs. We've had a series of other very strange problems and I don't know if they're related or not: - Large programs, like emacs, sometimes get a segmentation fault when they start. If you run something else then try again it works. Usually no core file is produced. Once it did, and it was the core file of a csh. - Someone running emacs on a small plain text file file crashed (not hung) our system twice with a "tlbmod on invalid pte" error. I have no idea what this error means. - The crash dump procedure included with Ultrix 3.1 failed the first two times we tried it. The dump code appears to have been corrupted. It worked on successive tries. We've tried everything that we can think of to stop this from happening. We've already tried a different system unit, altering configuration parameters, and disabling any suspect local or third party programs. In desperation, we even tried rebooting the system once a day from crontab. Nothing helped. I've got about 5 people at DEC looking at this now. They have several guesses: - A manufacturing problem caused memory modules to not be seated correctly and the diagnostics won't pick it up (this was a DEC internal report). So, they were here today to reseat my memory modules. I am skeptical of this being the problem. - They say our system is "memory starved". They recommended doubling our memory from 12meg to 24meg. Granted, 12meg is not much for a system with alot of users, let alone a RISC architecture, but it still shouldn't cause this type of problem. Our swap space has never been over half full with 64meg configured. - They also recommended that we drop our maxuprc from 100 back to the default 50. Ok, I'll try it, but it isn't a large savings. It looks to me like this is a kernel problem and not a parameter tuning or configuration mismatch. -mark
paul@speedmetal.engin.umich.edu (Paul Killey) (10/13/89)
In article <15444@netnews.upenn.edu> litwack@operations.dccs.upenn.edu (Mark Litwack) writes: >In article <9475@june.cs.washington.edu> richk@june.cs.washington.edu (Richard Korry) writes: About weird swapping/paging behavior ... Same here with both of the above! --paul "Don't call me baby when she's waiting in the car."
swilson@pprg.unm.edu (Scott Wilson [CHTM]) (10/13/89)
In article <15444@netnews.upenn.edu> litwack@operations.dccs.upenn.edu (Mark Litwack) writes: >In article <9475@june.cs.washington.edu> richk@june.cs.washington.edu (Richard Korry) writes: >>behavior: jobs that are swapped (ps stat = W) get placed on the >>run queue (ps stat = RW) but never seem to run. the load average shoots >>up but nothing is actually executing. The only fix has been to reboot. >>Anyone else ever seen this? >> rich > > Yes, I'm glad (in a way) that we aren't the only ones. We have seen something like this too - I happened to catch a "sysmon" of what it was doing. I had a program running that had dynamically allocated large amounts of memory, in little bits at a time. At some point the program wanted wanting "just a little more memory" , say 200 bytes. The kernel would SWAP the entire working set out to disk, and then fault it all back in a little shred at a time until we finally get back to asking for the SAME "little bit of memory", and then this all happens again, ad infinitum. Load average goes nuts. May or may not be able to suspend or kill the job. May lock up other system resources too. Often times, the amount of free memory was still a Meg or so... This was apparently "fixed" when we upgraded to the latest 3100 rev of Ultrix, but the behavior persits. If the memory allocation is done in all big chunks (totally to the same amount as before, but not in little chunks), then everything seems to be OK. Note that the behavior I describe will occur even if I am the ONLY user, and (other a few of the background processes) am the only runnable process. Working off of an NFS-mounted partition seems to make it worse, sort-of. Scott Wilson University of New Mexico Center for High Technology Materials Albuquerque, NM 87131 (505)277-0780
te07@edrc.cmu.edu (Thomas Epperly) (10/13/89)
I have had very similar problems. I have gotten the panic invalid pte crash many times. I also get the RW problem from the way you describe it. It has happened on several different main CPU boards and with a completely replaced set of RAM. The segmentation faults when Emacs starts also appear often. DEC please fix this bug/these bugs! Tom Epperly
mikem+@andrew.cmu.edu (Michael Meyer) (10/14/89)
> Excerpts from netnews.comp.sys.dec: 13-Oct-89 Re: swap problem on DS3100 > Thomas Epperly@edrc.cmu. (354) > I have had very similar problems. I have gotten the panic invalid pte > crash many times. I also get the RW problem from the way you describe > it. > It has happened on several different main CPU boards and with a > completely > replaced set of RAM. The segmentation faults when Emacs starts also > appear > often. Me too. --Mike