shaw@paralogics.UUCP (Guy Shaw) (09/05/89)
In article <712@skye.ed.ac.uk>, richard@aiai.UUCP (Richard Tobin) writes: > [. . .] > Running under SunOS 4, we occasionally encounter an annoying problem: > a pipeline (eg cat /etc/passwd | more) will stop, with the message > > Stopped (tty output) > > [. . .] > Presumably using vfork() forces things to happen in the right order. In article <1127@tukki.jyu.fi>, eloranta@tukki.jyu.fi (Jussi Eloranta) writes: > We have the same problem here.... (with tcsh from tut.cis.ohio-state.edu) > Does anyone have a fixed version? We don't (yet) have the sources so I would > need binaries... I have the same problem. I am using the binary version for Sun 3's running SunOS 3.4, from tut.cis.ohio-state.edu, but am running SunOS 4.0. It seemed like I got away with it until I noticed this problem when I piped something to `less'. At first, I thought maybe it was a problem with `less', but it is just that I didn't notice `more' exhibiting that behavior, because I use `less' a great deal more than `more'. The idea that using vfork() would cure this problem sounds reasonable to me. Can anyone verify that this is all that is needed? Can anyone direct me to a binary of tcsh for Sun 3, which has a fix for this problem? Thanks in advance. -- Guy Shaw Paralogics paralogics!shaw@uunet.uu.net or uunet!paralogics!shaw
gwyn@smoke.BRL.MIL (Doug Gwyn) (09/06/89)
In article <243@paralogics.UUCP> shaw@paralogics.UUCP (Guy Shaw) writes: >> Stopped (tty output) >> Presumably using vfork() forces things to happen in the right order. >The idea that using vfork() would cure this problem sounds reasonable to me. NO! All that using vfork() instead of fork() does in this case is to change the multiprocess timing so that the real problem, a race condition involving process groups, is less evident. Chris Torek recently posted the explanation and suggested fix (set the process group N+1 times in an N-process pipeline).
shaw@paralogics.UUCP (Guy Shaw) (09/09/89)
A short while ago I asked if there was some place that had a version of tcsh that uses vfork(). Maybe I should rephrase that. I would like to know if there is some site which has a version of tcsh that solves the "csh pgrp problem", one way or another. There really are two issues which I should keep separate: 1) I want to know how things work, just because; 2) I want a fixed tcsh. I did get one reply. In article <10941@smoke.BRL.MIL>, gwyn@brl.arpa (Doug Gwyn) writes: > In article <243@paralogics.UUCP> shaw@paralogics.UUCP (Guy Shaw) writes: > >> Stopped (tty output) > >> Presumably using vfork() forces things to happen in the right order. > >The idea that using vfork() would cure this problem sounds reasonable to me. > > NO! All that using vfork() instead of fork() does in this case is to > change the multiprocess timing so that the real problem, a race condition > involving process groups, is less evident. Chris Torek recently posted > the explanation and suggested fix (set the process group N+1 times in > an N-process pipeline). Thank you. I did read Chris Torek's article. I have read these articles on the "csh pgrp problem" subject: <712@skye.ed.ac.uk>, richard@aiai.ed.ac.uk (Richard Tobin), 9 Aug 89 <19000@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek), 11 Aug 89 <1127@tukki.jyu.fi>, eloranta@tukki.jyu.fi (Jussi Eloranta), 11 Aug 89 <920@legato.LEGATO.COM>, mojo@legato (Joseph Moran), 12 Aug 89 <184@sunquest.UUCP>, terry@sunquest.UUCP (Terry Friedrichsen), 17 Aug 89 <19143@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek), 18 Aug 89 and your reply has prompted me to go back and read them all again, to see if I interpret them differently the second time. Blimey! This redistribution of knowledge is trickier than I thought. [Dennis Moore, mangled a bit] In article <19000@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: > >Presumably using vfork() forces things to happen in the right order. > > This analysis is correct (congratulations: discovering this bug is > rather tricky---the POSIX folks noticed it eventually, but it took > quite a while). Chris Torek didn't seem to be saying that vfork() caused incorrect behavior, only that there is something better. > The accepted solution is to set the terminal's process group k+1 times > when there are k children in a pipeline (or k times with the current > system): once in each child and once in the parent. Setting the pgroup > to whatever it is already is harmless, and this ensures that the pgroup > is set by the time it needs to be. Do you mean do a right-to-left series of TIOCSPGRP ioctl calls, as well as setpgrp calls? If I understand correctly, the basic idea is that if you startup a pipeline, say "a | b | c", then you should proceed from right to left. So, starting with "c", you should set up EVERYTHING as if "c" were the only thing you were going to run, without trying to get too clever and take advantage of the fact that some of the setup of "c" is going to be overridden in the next stage, right away. You should not short-stroke any part of it, no matter how short-lived some aspect of the setup of "c" will be. This includes the process group, and the terminal process group. Then, proceed to establish the pipeline, "b | c", in the same way. Then, finally build "a | b | c". This way, the shell never leaves a pipeline in a state that isn't completely setup to run on its own, except for reading from a pipe with no producer. I take it that, the way things are now, process "a" is the only one that bothers with a TIOCSPGRP. Sorry if I misunderstand this, I have no source. > (Most of the mess would go away if process groups were allocated > by the system, rather than by user code. Yeah, what he said! In article <920@legato.LEGATO.COM>, mojo@legato (Joseph Moran) writes: > Unfortunately, the `simple' fix I know of is to continue to use vfork > with csh... > [ . . . ] > >Presumably using vfork() forces things to happen in the right order. > > Exactly - when using vfork the child process gets to run first and > "borrow the address space" of the parent until the child exec's or > exit's. After the child exec's or exit's, the parent gets to run after > it gets its address space back from the child process. > > I think that the general lesson to be learned here is to not introduce > "temporary hack system calls" because it can be hard to later get rid > of them because some important program(s) either accidentally or > consciencely depending on the (subtle effects of that) hack. Well, what I got from these articles is that, although there is an "accepted solution" *and* there is a "simple fix", which uses a "temporary hack system call", the "simple fix" would work correctly. When you (Doug Gwyn) say " ... the real problem, a race condition involving process groups, is less evident", do mean that there is still a chance that vanilla csh from Sun will give me a "Stopped (tty output)" message, but it just happens less often? I was left with the impression that I had my choice between two correct solutions: one that is "the right thing", but for some reason, I shouldn't expect to see this solution implemented, soon; and a "simple fix" which nobody likes, but that is somehow simpler, in the short run, than "the right thing". So, it would be more realistic to expect to see a version of tcsh available with the "simple fix". But wait. Why is the "simple fix" simpler than the "accepted solution"? Now, that I reread these articles, I am guessing that, when Joseph Moran says "the `simple' fix I know of is to continue to use vfork", he is referring to how much more complicated it is to fix ALL programs that rely on vfork() semantics in some way. He later says, "As time went on, we found more places that depended on the subtle effects of vfork." But I started getting the notion that vfork() was the simpler fix, even when confining the discussion to fixing tcsh. So much for trying to understand what is going on; I would like a tcsh that has this problem fixed and I don't care how. My personal experience is that tcsh on a Sun 3 runs into this problem frequently, so this is not just a problem for armchair shell writers. While running csh, I have not been unable to cause this problem to manifest itself, *even once*. This is the ONLY thing that I have noticed about tcsh that detracts from its record as an interactive shell which is superior in every way to vanilla csh. I DO NOT want to go back to using csh. If the world must be divided between "scruffies" and "neats", then I am a split personality. As a "neat", I do prefer correct and satisfying solutions; but a hack will do, in an emergency. For instance, I prefer constructivist mathematics, but I don't just dismiss existence proofs and indirect proofs, especially when that is all there is, for now. "I'll admit, it's not the most satisfying way to conquer the world, but I'll take what I can get." -- Dr. Destructo -- Guy Shaw Paralogics paralogics!shaw@uunet.uu.net or uunet!paralogics!shaw
chris@mimsy.UUCP (Chris Torek) (09/09/89)
In article <246@paralogics.UUCP> shaw@paralogics.UUCP (Guy Shaw) writes: >Chris Torek didn't seem to be saying that vfork() caused incorrect >behavior, only that there is something better. Actually, it causes (indirectly) correct behaviour. The C shell runs a pipeline such as a | b | c by doing the sequence <<make pipe 1>> if ((pgroup = vfork()) == 0) { <<move pipe 1 to stdout>> <<set tty pgroup and process group>> <<exec a>> } <<make pipe 2>> if (vfork() == 0) { <<move pipe 1 to stdin>> <<move pipe 2 to stdout>> <<set process group>> <<exec b>> } <<close pipe 1>> if (vfork() == 0) { <<move pipe 2 to stdin>> <<set process group>> <<exec c>> } <<close pipe 2>> Since vfork() suspends the execution of the parent process until the child process either exec()s or exit()s, the `set tty pgroup' happens before any of the three child processes actually start running. Fork(), however, does *not* suspend the parent process, and suddenly we have a race to see whether the tty pgroup gets set in time. >>The accepted solution is to set the terminal's process group k+1 times >>when there are k children in a pipeline (or k times with the current >>system): once in each child and once in the parent. Setting the pgroup >>to whatever it is already is harmless, and this ensures that the pgroup >>is set by the time it needs to be. >Do you mean do a right-to-left series of TIOCSPGRP ioctl calls, >as well as setpgrp calls? Since there is a race, the order is (and must be) irrelevant. The Bourne shell happens to fork and exec in such an order that, in `a | b | c', the shell is the parent of processes `a' and `c', but process c is the parent of process b. The C shell (since it wants to do job control) makes sure that all three are direct descendents of the shell itself, and happens to fork `a' first, `b' second, and `c' third. >I take it that, the way things are now, process "a" >is the only one that bothers with a TIOCSPGRP. Sorry if I misunderstand >this, I have no source. This is correct, but the source itself is unnecessary in this case. We can deduce this from the `jobs' command: % sleep 1 | sleep 10 | sleep 30 & [1] 300 301 302 % jobs -l [1] - 300 Done sleep 1 | 301 Running sleep 10 | 302 sleep 30 % % sleep 10; jobs [1] - Done sleep 1 | sleep 10 | Running sleep 30 % sleep 20; jobs [1] - Done sleep 1 | sleep 10 | sleep 30 % This tells us that csh is the direct parent of each element of the pipeline (since it gets status updates when each child exits). If we assume that csh uses vfork() (and is deterministic), we can deduce that csh vfork()s once, the child runs the first command, the parent unsticks and vfork()s again, the child runs the second command, etc. The child of the first vfork() must exec the first command, or the `sleep 1' above would not have the first process ID. Since the first child of vfork is guaranteed to run first, it must be `a' that sets the tty pgroup (given that the tty pgroup is set only once). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
das@eplunix.UUCP (David Steffens) (09/15/89)
In article <10941@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes: >In article <243@paralogics.UUCP> shaw@paralogics.UUCP (Guy Shaw) writes: >>> Stopped (tty output) >>> Presumably using vfork() forces things to happen in the right order. >>The idea that using vfork() would cure this problem sounds reasonable to me. >NO! All that using vfork() instead of fork() does in this case is to >change the multiprocess timing so that the real problem, a race condition >involving process groups, is less evident. Chris Torek recently posted >the explanation and suggested fix (set the process group N+1 times in >an N-process pipeline). First of all, let me say that I haven't read any of the original articles. Not because I didn't want to, but because I joined this discussion late and the original articles had already been expunged from our news machine. But some experimentation on my Sun4 running tcsh w/o vfork under SunOS4.0.3 leads me to believe that the above explanation is only partly correct. The correct part is that vfork gets things to happen in the right order. The incorrect part is that the race involves setting of process groups. While it is true that the tty ends up in the wrong process group, the _real_ race is over which process gets to run (and possibly finish) first. Take a simple pipe: ``ls | more''. If the 1st process (ls) finishes _before_ the 2nd process (more) is completely setup, then one gets the "Stopped (tty ouytput)" message, otherwise not. In terms of the source, the palloc() needed to associate the ``more'' process with the job has not been done by the time that the ``ls'' process finishes. Then the wait loop in pwait() thinks the _whole_ job is done and steals the tty out from under the ``more'' process. Now the ``more'' process gets to run and set itself up, but too late! This isn't possible w/ vfork because the child _always_ gets to run 1st. As of this writing, I don't yet have a fix. Since I might be barking up the wrong tree, I thought I'd check in with the wizards before spending a lot of time on it. Is my analysis correct? Am I missing something? -- {harvard,mit-eddie,think}!eplunix!das David Allan Steffens 243 Charles St., Boston, MA 02114 Eaton-Peabody Laboratory (617) 573-3748 Mass. Eye & Ear Infirmary
chris@mimsy.UUCP (Chris Torek) (09/17/89)
In article <783@eplunix.UUCP> das@eplunix.UUCP (David Steffens) writes: >Take a simple pipe: ``ls | more''. If the 1st process (ls) finishes >_before_ the 2nd process (more) is completely setup, then one gets >the "Stopped (tty ouytput)" message, otherwise not. > >In terms of the source, the palloc() needed to associate the ``more'' >process with the job has not been done by the time that the ``ls'' >process finishes. Then the wait loop in pwait() thinks the _whole_ job >is done and steals the tty out from under the ``more'' process. >Now the ``more'' process gets to run and set itself up, but too late! >This isn't possible w/ vfork because the child _always_ gets to run 1st. This explanation does not hold up, because vfork() does not change this. Csh (and therefore tcsh) must, and presumably does (since it works now), hold SIGCHLD during job creation. If it did not, the following sequence would be possible: csh: vfork child: set up tty pgroup, etc execl("ls") csh: resume; stash process for pwait() <here csh runs out of its scheduled cpu and gets bumped> ls: read dir, print, exit <now csh gets resumed> csh: take signal; pwait() `takes' the whole job so vfork() would not prevent the same problem. As long as SIGCHLD is held, however, we have csh: fork csh: stash process for pwait | child: set up tty pgroup | execl("ls") csh: fork | ls: read, etc | ls: <exit> csh: (parent) signal held | csh: (child) execl("more") stash process for pwait | release signal | more: <run> take signal, consume ls exit status; continue waiting for rest of pipeline Here the actual order of things happening in the child process is irrelevant (provided, of course, that one fixes the tty pgroup race). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris