[net.micro.pc] Suspected "popf" bug in Intel 80286

tom@vcvax1.UUCP (04/18/86)

[

While experimenting with an asynchronous communication driver
for VENIX (in protected mode) on the IBM PC/AT, I
encountered some rather strange behavior that I now
attribute to a bug in the Intel 80286 processor.  In
brief, I suspect that the "popf" instruction enables
interrupts under certain circumstances even though the
IF flag is 0 before the instruction is executed and set
to 0 by the "popf" instruction itself. 

Because of the "popf" is so often used successfully,
I would like to hear from others about whether they
have encountered the same or a similar problem.
If so, I would like to know how they programmed around it.
My concern is that I'm either mistaken and there is no
hardware bug or that I'm correct and the fix that
I found is not applicable to all processor lots.

			The Evidence

The problem expresses itself in the asynchronous driver by
a loss of characters on OUTPUT.  Only single characters are
lost every now and then.  The problem can express itself when
kernel printf's indicate that no interrupts other than transmit
interrupts are occurring.  I have tried 2 different AT's 
with the same results.  

The reason for the character loss turns out to be due
to a completion interrupt occurring while the tty startup
routine is running.  The startup routine disables interrupts 
and stuffs a character into the transmit buffer.  But for
some reason, the 80286 allows the COM port to interrupt
causing the transmit interrupt routine to overwrite the character
just put in the buffer.  Kernel printf's triggered by sanity
checks in the driver indicate the the instruction being
interrupted is a "popf" and that the IF flag before and after the
"popf" is 0. 

A number of obvious checks were done to rule out programmer
error.  For example, after a kernel was demonstrated to show
the bug, I dumped the code segment of the running kernel and
compared it to the object file.  No difference.

			The Fix

Since the source of the interrupt was always a particular "popf"
(in the splx() routine), I concentrated on recoding the kernel
where the "popf" occurs.  To convince myself that the symptoms
were not due to a bug in the loader, I recoded the kernel using
"adb" on the object file and then booted the modified object file.
Therefore, changes in behavior are directly correlated with
code changes (and not with changes in linking the kernel, compilation,
size of kernel, etc.).

The following code sequences for the routine splx() were tried and FAILED.
The first is the original:

1)	pop	cx	| return address
	popf		| new flags
	pushf		| dummy arg
	jmp	cx	| return

2)	pop	cx
	nop, nop, nop, nop	| Padding for timing.
	popf
	pushf
	jmp	cx

3)	pop	cx
	popf
	push	cx		| Could "pushf" be messing up "popf" ?
	jmp	cx

4)	pop	cx
	popf
	push	cx
	push	cx		| Could "jmp cx" be messing up "popf"?
	ret

5)	pop	cx
	xor	ax,ax		| Dummy register access.
	popf
	push	cx
	jmp	cx

6)	pop	cx
	mov	ax,#0		| Dummy memory access.
	popf
	pushf
	jmp	cx

Perhaps the reasons for testing the above code will be clear
by the following coding of splx() that FIXES the problem:

7)	pop	cx
	pop	ax
	push	ax
	test	ax,#0x200	| Don't use popf!
	bne	Lsplx
	cli
	jmp	cx
Lsplx:	sti
	jmp	cx

8)	pop	cx
	pop	ax
	and	ax,#0x7FD5	| Mask off "don't care" bits.
	push	ax
	popf
	pushf
	jmp	cx

9)	pop	cx
	pop	ax
	and	ax,#0xFFFF	| Does masking really work?
	push	ax
	popf
	pushf
	jmp	cx

10)	pop	cx
	pop	ax		| Is it the "and" that does it?
	push	ax
	popf
	pushf
	jmp	cx 

11)	mov	bx,sp
	push	*2(bx)
	popf
	ret

Remember that the kernel was patched with each of the above
codings of splx().  Those that worked, worked for as long as I
watched them (several minutes at 9600 baud).  Those that failed,
failed every few seconds or so.  The interrupted instruction was
always the "popf".

Tom Scott
VenturCom, Inc.
..!seismo!harvard!cybvax0!vcvax1!tom

cck@cucca.UUCP (Charlie C. Kim) (04/21/86)

In article <179@vcvax1.UUCP> tom@vcvax1.UUCP (tom) writes:
> ...
>encountered some rather strange behavior that I now
>attribute to a bug in the Intel 80286 processor.  In
>brief, I suspect that the "popf" instruction enables
>interrupts under certain circumstances even though the
>IF flag is 0 before the instruction is executed and set
>to 0 by the "popf" instruction itself. 
> ...
>
>Tom Scott
>VenturCom, Inc.
>..!seismo!harvard!cybvax0!vcvax1!tom

Believe it or not, this is a documented "feature" or misfeature of the
286 processor (in both protected and real modes).  See page 9-6 of the
IBM PC/AT Technical Reference or (I remember seeing it here) the
appropriate Personal Computer Seminar Proceedings to find out the
exact conditions under which it can occur, but it all has something to
do with the condition "CPL <= IOPL" holding true (I honestly have no
idea what they are talking about here.  I suppose it is some internal
chip condition).

The documented workaround is:

	jmp L1		; jump around iret
L2:	iret		; pop cs,ip, flags
L1:	push cs		; push cs
	call L2		; call near
	....		; program to continue here

which would work for any 80* cpu. They suggest coding this as a macro
which make sense. (I guess you could replace the call near with a call
far and drop the push cs).  I would probably have programmed it as:

	push cs		; save fake cs
	push L2		; push return point (286 only)
	iret		; really just jump ahead one instr with popped flags
L2:	...		; continuation point

if it wasn't going to have to run on anything but a 286 or 186 (push
immediate was new with these) so things at least look linear!

I'm sure you can come up with a dozen different way to accomplish the
same thing now that you know the problem exists.

Charlie C. Kim
User Services

maddox@renoir.berkeley.edu (William Maddox) (04/22/86)

In article <179@vcvax1.UUCP> tom@vcvax1.UUCP (tom) writes:
>
>While experimenting with an asynchronous communication driver
>for VENIX (in protected mode) on the IBM PC/AT, I
>encountered some rather strange behavior that I now
>attribute to a bug in the Intel 80286 processor.  In
>brief, I suspect that the "popf" instruction enables
>interrupts under certain circumstances even though the
>IF flag is 0 before the instruction is executed and set
>to 0 by the "popf" instruction itself. 

This is a known incompatibility between the 80286 and other members of
the 8086 family.  See pages 9-6 and 9-7 of the AT Technical Reference
for details.  I quote: 

	If the system microprocessor executes a POPF instruction in
	either the real or the virtual address mode with CPL <= IOPL,
	then a pending maskable interrupt (the INTR pin active) may be
	improperly recognized after executing the POPF instruction even
	if maskable interrupts were disabled before the POPF instruction
	and the value popped had IF = 0.

IBM suggests the following code sequence as a compatible
replacement:

	JMP	$+3
	IRET
	PUSH	CS
	CALL	$-2

----------------------------------------------------------
Bill Maddox
ucbvax!renoir!maddox

"Lisp programmers know the value of everything but the cost of nothing."
							- Alan Perlis

john@inthap.UUCP (John Casey) (04/23/86)

> 
> While experimenting with an asynchronous communication driver
> for VENIX (in protected mode) on the IBM PC/AT, I
> encountered some rather strange behavior that I now
> attribute to a bug in the Intel 80286 processor.  In
> brief, I suspect that the "popf" instruction enables
> interrupts under certain circumstances even though the
> IF flag is 0 before the instruction is executed and set
> to 0 by the "popf" instruction itself. 

Early 80286 chips (B step) did contain a problem like the one
described here. These chips can be identified by the copyright
notice marked on the chip. Chips bearing the notice:

(C) INTEL '83

may suffer from the popf problem. Chips bearing a later copyright date
(84 or beyond) do not suffer from this problem. While the B step 80286
has not been produced by Intel for some time they may exist in some
older 80286 based machines such as early ATs.

Since their are obviously machines out their with these older chips
let me further describe both the problem and the workarounds.

If a B step 80286 executes a POPF instruction while interrupts are disabled
a pending maskable interrupt (INTR pin active) may be improperly recognized
after executing the POPF instruction even if maskable interrupts were disabled
before the POPF instruction and the value popped had IF=0. IF the interrupt
is improperly recognized, it is processed correctly.

The problem occurs in B step 80286s executing a POPF instruction while
interrupts are disabled in either Real Address Mode, or in Protected Mode
with CPL <= IOPL. Executing in Protected Mode with CPL <= IOPL implies
that the processors Current Privilege Level (CPL) is numerically less than
or equal to the IO Privilege Level (IOPL), that is, the currently executing
code is privileged enough to enable/disable interrupts and do IO. 

The occurrence of this errata may be affected by the number of wait
states during the data read bus cycle of the POPF, and by even or odd
address alignment of the stack. Two wait states added to the memory read
bus cycle will eliminate the problem. 

The problem can be avoided by replacing POPF with an alternate code sequence
in code that may be susceptible to the problem. The original posting
stated that the following code failed:

> 
> 1)	pop	cx	| return address
> 	popf		| new flags
> 	pushf		| dummy arg
> 	jmp	cx	| return

Recoding without using POPF produced the following working code:

> 
> 7)	pop	cx
> 	pop	ax
> 	push	ax
> 	test	ax,#0x200	| Don't use popf!
> 	bne	Lsplx
> 	cli
> 	jmp	cx
> Lsplx:	sti
> 	jmp	cx

This illustrates one possible alternate code sequence for POPF
which will work if AX need not be saved and if IF is the only flag of interest.

Other more general alternatives to POPF are as follows:

		push	cs
		push	#popflags
		iret
	popflags:	

OR
		call	popflags	| must be a far call

	where the routine popflags is defined as follows

	popflags: iret


NOTE:	This problem exists only in older 80286s, and occurs
	only when a POPF is executed with interrupts disabled
	and results in interrupts remaining disabled.

John Casey
Intel Corporation
(516) 231-3300
...!philabs!ron1!polyof!inthap!john

jrc@hpcnof.UUCP (04/24/86)

You're right, the POPF fails.  IBM publishes the fix in their "PC/AT
Technical Reference Manual", Part Number 1502243.

Jim Conrad
ucbvax!hplabs!hpcnof!j_conrad