braun@drivax.UUCP (02/04/87)
We managed to crash our Vax 11/780 this week. I was wondering if anybody could help me understand what went on. I have some strong suspicions, but if they are right, they point out some pretty weak places in Unix; the kind that my co-workers point at and say "See, Unix isn't a REAL operating system! Real Operating Systems wouldn't do that", then go back to the VMS machine, snickering. Anyway, the symptoms were as follows: A user called and said that some of his processes were hung, and although he had a prompt, he couldn't kill -9 any of them. Another user was having the same problem. It turns out that no one could kill any of these processes. But some processes could be killed. Only those that were in STAT 'D' (or 'DW') could NOT be killed. This makes sense to me, as I assume that a process has to come off of an event list before it can be killed. If this is assumption is correct, it seems like a weak point, but I can appreciate the difficulty of killing processes waiting on events. Anyway, the system finally ground to a halt, although not for some time (about 15 minutes after the first report came in). It turns out that no one had been writing to the file system that was being used by the processes in question. One of the processes was a compile in the assembly phase. This was not the native Unix compiler, but a cross compiler for another architecture. It is only slightly suspected as having been the culprit; only becuase it is a 3rd party product with a relatively short history. A stronger candidate was a combination of 'make' processes and .logout process which, by a strong co-incidence happened to be executing at the same time. The combination of processes produced the following tasks: 1: compile, creating file x.o 2: mv x.o /work/user/trashcan/x.o 3: rm /work/user/trashcan/x.o (1) was the result of 'make'ing x. (2) is the result of the users' "rm" command being aliased to "mv \!* ~/trashcan". (3) is the result of the users' .logout containing "/bin/rm /work/user/trashcan/*", and the user logging out while (1) and (2) were running. This makes me think that the file system either bogged down or got confused trying to chase it's own tail. And although I don't have a solution to the problem off hand, I think that this type of thing shouldn't bring a system to it's knees. Do you think I have made an adequate assesment of the problem? Do you agree or disagree with my opinion of it? Mail or Followup as appropriate. -- kral 408/647-6112 ...!{amdahl,ihnp4}!drivax!braun
amos@instable.UUCP (02/04/87)
If you want help from this group please *always* state the type of system you're on, and preferably the entire hardware configurattion. Now, down to business: the 'D' status is DELAYED, which means usually the process is waiting for a device at a high priority; this should last for only a fraction of a second. The symptoms indicate either: - A known problem in BSD4.2's 'close' routine, that crashed the system if process tries to remove a file while another one is creating it. This also appeared in Ultrix 1.1; as far as I know it has been fixed in BSD4.3. - The Fujitsu Eagle disk drive has been known to put itself off-line spontanously when used with a certain type of controller. This would leave processes hanging waiting for interrupt. - Some types of terminal multiplexing devices also tend to hang. There are a few other possibilities, but since I know nothing about your sw/hw configuration, I can't give a better answer. BTW: Anyone out there wishes to contribute horror stories about the behaviour of Real Systems? :-) -- Amos Shapir National Semiconductor (Israel) 6 Maskit st. P.O.B. 3007, Herzlia 46104, Israel (011-972) 52-522261 amos%nsta@nsc.com 34.48'E 32.10'N