[net.unix-wizards] problems with BPT/TRACE traps

rws@mit-bold.ARPA (04/28/84)
From:  "Robert W. Scheifler" <rws@mit-bold.ARPA>

Description:
	We have been developing an in-process debugger, using the VAX
	BPT instruction and the trace bit mechanism, with the signals
	handled in the same (user) process.  This works fine by itself.
	We are also using keyboard generated SIGQUITs to interrupt the
	program and get the debugger's attention, so the user can poke
	around and then ultimately continue execution.  This also works
	fine by itself.

	However, combining the two mechanisms got us into trouble.  The
	basic problem is that SIGTRAP needs to be handled synchronously,
	but SIGQUIT (and others) can preempt it.  Signals are not handled
	on a first-come-first server basis, but on a "find-first-set" basis,
	which means lowest signal number first.

	So the scenario is this.  A BPT trap takes place, and a
	psignal(SIGTRAP) takes place in trap().  Just then the user
	types the quit character, and you eventually get to ttyinput(),
	which does a gsignal(SIGQUIT).  Now we continue on inside
	trap(), doing "if (ISSIG(p)) psig()", and the signal chosen
	is SIGQUIT, surprise.  So we hack the stack for SIGQUIT,
	and go off to the first instruction of the signal trampoline code.
	However, the psignal() back when did an aston(), so at this point
	we take the AST, and we are back in trap() doing another
	"if (ISSIG(p)) psig()", and so we hack the stack for SIGTRAP,
	only now, lo and behold, the PC is no longer at the BPT instruction,
	but at the start of the signal trampoline code instead,
	which is mighty confusing.

	But, you say, the solution is of course to mask out SIGTRAP inside
	of SIGQUIT.  But, I say, there are two problems with this.  The
	first, which I can live with, is that then the SIGQUIT handler
	can't be debugged.  The bigger problem is that it still doesn't work.
	There is an "extraneous" REI in the signal trampoline code (that I
	have complained about before for a different reason).  This REI is
	executed on the way out of a handler, and is a one instruction
	bridge back to user code that gets executed WITHOUT the signal mask
	defined by the handler.  So even if you mask SIGTRAP inside SIGQUIT,
	you simply change the PC at the time of the SIGTRAP to be at the
	REI rather than the CALLS in the trampoline code.

	Our solution to this problem was to notice that, if the SIGTRAP
	handler does nothing, the BPT instruction will be executed again
	and we will get another trap.  So, we don't mask SIGTRAP inside
	SIGQUIT, and in the SIGTRAP handler we check the PC, and if it's
	in the trampoline code, we just return and let the BPT execute
	again.

	Having taken a BPT, we need to reinstall the actual instruction,
	execute it using the T-bit, and then reinsert the BPT instruction.
	Once again, the PC you get in the SIGTRAP handler can be bogus.
	Just returning won't work, however, because the T-bit has been cleared
	and you won't get another trap.  Fortunately, the debugger can know it
	is expecting a T-bit trap, and can save away the correct PC, and ignore
	the PC reported by the kernel.

	Actually, as it turns out, you CAN get multiple SIGTRAPs from setting
	the T-bit.  I don't think this was intended.  Fix is provided below.
Repeat-By:
	See above.
Fix:
	In trap(), in trap.c, change
		case T_TRCTRAP+USER:	/* trace trap */
			locr0[PS] &= PSL_T;
	to
		case T_TRCTRAP+USER:	/* trace trap */
			locr0[PS] &= ~(PSL_T|PSL_TP);

	In sendsig(), in machdep.c, change:
		regs[PS] &= ~(PSL_CM|PSL_FPD);
	to
		regs[PS] &= ~(PSL_CM|PSL_FPD|PSL_T|PSL_TP);

	I didn't bother to figure out if both changes are necessary, but
	they can't hurt.