karl.kleinpaste@osc.edu (04/23/91)
A couple people have asked me for some info on how I emulate job control under UNIX System V using ptrace(2). Rather than get involved in a bunch of individual discussions, I thought I'd post it once here and be done with it. Probably forever, in fact, assuming that SVr4 with full job control finally takes over from all other SVr[0-3]. But given that large AT&T-internal installations are still running SVr2, that may take quite some time. "Ptrace(2) is a disease from which I have recovered." --Peter Honeyman in net.unix-wizards, summer 1985. "The SysV csh has not recovered." --me, in man page jobs(1) written for this csh, January 1987. First, no, I can't distribute source. Csh, especially an old one such as I have used under SysV, is AT&T-restricted, and the work I did on it for job control was done while working for AT&T as well, which leaves it in AT&T's control, too. I once even tried to get this stuff released, along with the line editor I built for it; alas, I ran into what at the time was believed to be an administrative brick wall, that there existed no procedure by which a Bell Labs person could give away such stuff. I tried, really I did. Job control can be emulated reasonably successfully using the ptrace(2) system call. Every non-set[ug]id process can be initialized for job control by having csh act as though it's going to debug the process and invoke ptrace(0,0,0,0) before execv'ing the child. Because set[ug]id is defeated by ptrace(2), you have to check for this first, and set[ug]id processes are not job-controllable. Once having started a job, csh must now deal with signal stoppages on the part of the running child(ren) by examining the reason for the stop. If the stop is due to SIGTRAP (the typical "debugger trace trap"), the child should be restarted; note that every process which is ptrace'd stops on SIGTRAP before it enters main(), so csh must first execv() and then help the process continue therafter. I emulate SIGTSTP on SIGQUIT, so if the child stops on SIGQUIT, then we just allow it to remain in that state if it was the fg job, with appropriate bookkeeping updates on the stopped state and a return to a command prompt; or else we restart it with no signals pending if it was a bg job; this is due to the fact that even bg jobs will be hit by the SIGQUIT due to the nature of the SysV tty driver. Any other signal stop is "normal," so the process is restarted with that signal pending, and the process {ignores, catches, dies due to} it. My internal structures for keeping track of jobs are unrelated to later BSD csh. I maintain a doubly-linked ring of jobs, each of which contains in turn a doubly-linked ring of specific children which are part of the job. Each child has state associated with it, and the job as a whole has some state as well, regarding number of children, whether any of the job's N children has stopped/exited, whether the user has been notified of job state change, that sort of thing. A side benefit of having to introduce job control into an environment where it is not expected is that I taught csh to be responsible for saving/restoring tty state. There is no good reason why every single program needs to catch SIGTSTP, put tty state back to normal, and hit itself again with SIGTSTP, then go back to "raw" mode again when restarted. Under SysV, programs don't even know to try (and couldn't prevent it if they did know), so csh takes care of it for them -- if you hit "vi" with SIGQUIT, which I learned to attach to ^Z because it felt right, csh itself will deal with the tty save/restore when you fg it. You needn't fret that the capability to core-dump a process with SIGQUIT has been lost. It's just that now you require 2 stages. Stop the job once with SIGQUIT, then use the new builtin "core" against it to restart it with SIGQUIT pending, and it'll die, assuming it wasn't catching/ignoring SIGQUIT in the first place, of course. The infrequency of the use of SIGQUIT for a genuine core-dump was low enough that trading easy SIGQUIT death for job control seemed a fair trade. There are limitations to the emulation. Lacking the BSD tty driver, hitting SIGQUIT affects all tty-attached processes. For processes which create subprocesses, this is Bad, as the subprocs actually die. For this reason, job control can be disabled, and so I would typically issue jobs - ; make ; jobs + when invoking make, since any use of ^Z would nuke the underlying cc, for example. If any die, all die. Also, there are 2 UNIX kernel bugs in SysV, and one common programming error in many applications, which limit the success of the emulation. [1] ptrace(2) ought to be able to be shut off. There isn't any way to undo the fact of being a traced process. There ought to be, and cbrma.att.com has (or had) it, by allowing "ptrace(-1,0,0,0)" as a trace-disable. [2] Parents aren't properly informed of the child's stoppage. In the kernel routine stop(), when a traced process is being stopped, it merely issues wakeup(parent). This is fine for debuggers, which typically start one process and then stare intently at it via wait(2), waiting for it to do something untoward. But it's not general enough for something that wants, e.g., to start an arbitrary number of children with tracing enabled, and go on to some other more interesting or more important task until one of the children gets stuck somehow. What stop() ought to do is to call psignal(parent, SIGCLD), and again cbrma.att.com can do it right, but probably no other SysV box in the world does. These two bugs I reported a long long time ago, but I don't think the fixes ever made their way into the SysV kernel. A pity, that. The 1st one is fixed in 5 lines; the 2nd, 1 line. [3] Many processes, including such fundamental things as cat(1), do not deal with read(2) returning -1/EINTR properly. cat just sees the -1 and departs. When job-controlling a cat command, and hitting it with SIGQUIT, the read(2) is broken even if the job is restarted with no signals pending ("fg %1"). Boo-hiss, poor programming practice. Item 3 is something you just have to remember and put up with. Item 2 has to be worked around in some cases; for example, just saying % foo & will result in a SIGTRAP-stopped "foo" process while csh sits at its next prompt. Running any other command, such as sync(1), will cause csh to take care of the problem by restarting the "foo" process. With the kernel fix in place, csh gets the SIGCLD notification right away, and restarts it on its own. Item 1 isn't really necessary, but it's nice to reduce the number of processs interactions when you have a csh under a csh (e.g., you have su'd), and the sub-csh should be able to turn off tracing, since it is smart enough to deal with matters itself. There's some other stuff about this csh that I like, but those are the high points about job control emulation. It's doable, and it's just weird to me that no one else ever tried to do it this way. --karl