[net.unix-wizards] 4.1BSD ^Z/tty mode problem

wales@ucla-cs.UUCP (02/09/85)

Here at UCLA, we have noted a problem with ^Z and processes that change
the terminal modes.  We are running LOCUS (a distributed system built on
top of 4.1BSD), but the problem in question seems to exist in vanilla
4.1 as well.  I would like to know whether anyone else has seen this
problem and can propose a fix for it.

The problem happens when you fork a process from /bin/csh, and that pro-
cess in turn forks a second process which sets non-standard TTY modes.
(For example, a user runs a mail-sending program, which in turn forks a
full-screen text editor.)

The symptom of the problem is that, when you type a ^Z to suspend the
pair of processes and go back to your shell, the TTY modes are occasion-
ally not restored properly.  Instead, you sometimes end up back in the
shell with the same "non-standard" modes as were set by the "grandchild"
process (e.g., "cbreak" mode with echo off).  If you then put the sus-
pended process group back into the foreground (via "%" or "fg"), it will
immediately stop again, and this time you are returned to the shell with
the correct TTY modes set.

Here is my theory as to what is happening.  (For convenience' sake, I
will call the process directly forked from the shell "process A", and
the second process -- forked from process A -- I will call "process B".)

(1) Process "A" does not have a signal-catcher for the SIGTSTP signal
    (which is what ^Z generates).  Process "B", on the other hand, has
    a signal-catcher for SIGTSTP which restores the original TTY modes
    before stopping.

    Presumably, process "B" uses code similar to that shown in the
    "jobs(3)" manual article.

(2) When the user types a ^Z, both process "A" and process "B" receive
    a SIGTSTP signal.

    (a) When "B" gets a SIGTSTP, it restores the TTY modes and throws
	another SIGTSTP to both itself and "A" (via a "kill(0,SIGTSTP)"
	call).

	This means that "A" is getting SIGTSTP'ed twice -- once from the
	^Z, and once from "B" -- but I don't THINK this should be caus-
	ing any problems here, since both signals should get recorded
	before "A" gets scheduled again.

    (b) When "A" gets a SIGSTSP, it stops right away (since it has no
	SIGTSTP signal-catcher).

    (c) The shell catches the SIGCHLD signal which is generated when "A"
	stops, notes that "A" has stopped (the shell, of course, neither
	knows nor cares anything about "B"), and reclaims the TTY via a
	TIOCSPGRP "ioctl" call.

    Ideally, process "B" should restore the TTY modes before anything
    else happens.  HOWEVER, if the scheduler starts up "A" before "B" --
    AND subsequently starts up the shell to process the SIGCHLD signal
    before "B" can change the modes -- then the shell is left with B's
    funny TTY modes.  ("B" in this case will get slapped with a SIGTTOU
    signal when it tries to change the TTY modes, because it is no
    longer in the TTY's controlling process group once the shell starts
    up again.  This SIGTTOU, by the way, appears to get sent irrespec-
    tive of whether "stty tostop" is in effect on the TTY in question.)

(3) When the user puts "A" back into the foreground (via "%" or "fg"):

    (a) Process "A", at the time the ^Z was typed, was waiting for "B"
	to finish -- and once restarted, it will resume said waiting.

    (b) Process "B" will go ahead and change the TTY modes back to their
	original state, and then send a SIGTSTP to its process group --
	stopping both itself and "A".

    (c) The shell will get a SIGCHLD signal, discover that its child "A"
	has stopped, and start up again.  This time, however, the TTY
	modes will be correct.

MY QUESTIONS FOR THE GROUP:

(1) Has anyone else (on either a 4.1 or a 4.2 system) ever seen this
    problem?

(2) Is the above analysis of what is causing the problem correct?  If
    not, what in fact is happening?

(3) Has anyone ever fixed their kernel to get around the problem?  If
    there is a fix in 4.2, could someone please describe the basic idea
    behind the fix and point me to the appropriate part of the kernel?
    (I do have access to the 4.2 sources, by the way.)

(4) Does anyone know of a way to fix the application program(s) in ques-
    tion (process "B" in my example above -- the program that plays with
    the TTY modes) so that this problem will not occur, even assuming no
    changes to the 4.1 kernel?

Please, by the way -- no flames about why we aren't running 4.2.  When
the LOCUS project was started, 4.2 was naught but a gleam in Bill Joy's
eye :-} -- and when 4.2 did finally come out, LOCUS was already in pro-
duction, and the effort required to reimplement it on top of 4.2 would
have been infeasibly monumental.  As a result, comments of the form

	    This was fixed in 4.2, so what's the problem?

are useful only to the extent that the fix can be back-ported into our
existing, 4.1-based LOCUS system.
-- 

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
                                                             Rich Wales
                           University of California, Los Angeles (UCLA)
                                            Computer Science Department
                                                      3531 Boelter Hall
                                   Los Angeles, California 90024 // USA
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Phone:    (213) 825-5683 // +1 213 825 5683
ARPANET:  wales@UCLA-LOCUS.ARPA
UUCP:     ...!{cepu,ihnp4,trwspp,ucbvax}!ucla-cs!wales
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

lepreau@utah-cs.UUCP (Jay Lepreau) (02/12/85)

1. Yes, this problem exists on both 4.1 and 4.2 and I've seen it (in "ded").
2. Your analysis is pretty much what I came up with (but not quite so
clearly!).
3. I wouldn't really call it a kernel problem, though, because:
4. I solved it in my application by disabling t_*suspc when in a funny
tty mode in the child (B).  This allows the child to read the ^Z itself,
reset its tty modes, and then issue the TSTP to the pgrp, avoiding
having the kernel issue the TSTP to the whole pgrp, with the attendant
problems. (Note that this problem doesn't occur in raw mode for the same
reason my solution works, so all those raw mode screen applications
don't see it.  Must be cbreak.)  Anyone have other ways?

Jay Lepreau, lepreau@utah-cs, {ihnp4,decvax}!utah-cs!lepreau