[comp.arch] How does compiled code use the floating point unit?

daw@houxs.UUCP (D.WOLVERTON) (12/06/86)

In some systems, the hardware floating point (fp) unit is _optional_.
The Itty Bitty Machines (IBM) PC is a good example.  From the
point of view of a compiler writer, how does one deal with
that uncertainty? [<--this one's a rhetorical question]

I know of, or can imagine, several flavors of code generation
in the face of this situation:

1)  Code generation emits calls to a floating point library.  
This library checks for the presence of fp hardware, and uses 
the fp hardware it is is present, otherwise it emulates the operation.

2)  Like (1), but the test for fp unit is made before the function
call.  The code is larger, but in the case where the fp unit is
present it is faster because not function call was performed.

3)  Code generation pretends that the fp unit will always be present,
so it emits code which uses the fp unit directly into the instruction
stream.  If a fp unit is not present, the hardware arranges for a trap
to occur which transfers control to the OS.  At this point either:

	a)  The OS recognizes that a fp operation was intended,
	and completes the operation by executing its own emulation
	code.  Control is then transferred back to the user code.

	b)  The OS recognizes that a fp operation was intended,
	and calls a special fp emulation entry point in the user code.  
	When the function which emulates the fp operation is finished,
	it transfers control back to the user code.

4)  Code generation emits code which always causes transfer to the
OS, e.g. by illegal opcodes or TRAP instructions.  The OS then
proceeds like (3a) or (3b) above except that the fp unit may be used 
if present.

I like (3) the best.  In the case where a fp unit is present, the
performance is no worse than if it was assumed that the fp unit would
_always_ be present.  If a fp unit is not present, the user's code
will still execute, but more slowly.  Furthermore, the user can
upgrade his floating point performance by adding the fp unit, without
re-compiling his code.

(3a) has the slight additional advantage over (3b) that the user programs
will be smaller because they do not have to carry the baggage of a fp 
emulation library.

However, (3) also requires that the fp unit architecture is known a priori.
It also does not account for a need to support more than one incompatible
fp unit.

Now the questions:

	Are there other scenarios in use?

	Anyone have a different choice for "best"?  Why?

	Which is "best" if more than one fp unit must be supported, or
	if the architecture of the fp unit is not known a priori?

===================================================================
David Wolverton
...!ihnp4!houxs!daw		AT&T Information Systems, Holmdel

merlin@hqda-ai.UUCP (David S. Hayes) (12/08/86)

In article <394@houxs.UUCP>, daw@houxs.UUCP (D.WOLVERTON) writes:
> In some systems, the hardware floating point (fp) unit is _optional_.

> 4)  Code generation emits code which always causes transfer to the
> OS, e.g. by illegal opcodes or TRAP instructions.  The OS then
> proceeds like (3a) or (3b) above except that the fp unit may be used 
> if present.
> 
> 	Are there other scenarios in use?

According to my memory, (which operates on fuzzy logic :-), the
Sun-2 had several different optional FPU boards.  The compiler
would generate code that always trapped to the OS.  Then:

No FPU:
	OS calls a subroutine to do the work.

FPU:
	OS replaces the user instruction (a 68010 TRAP) with
	the equivalent hardware instruction.  The user program
	is then restarted at the new instruction, which now
	causes the FPU to do the work.

I like this scheme.  The overhead of going into the OS is only
paid once (assuming you actually have a FPU).  Once the OS changes
the TRAP instruction, further FP work goes directly to the hardware,
without software intervention.

Of course, this had to be done once to each different FP instruction.
In a large program, that could take a while.  On the other hand, the
most popular instructions should be replace fairly quickly.

Anyone (particularly old Sun engineers) care to correct my memory?
-- 
	David S. Hayes, The Merlin of Avalon
	PhoneNet:	(202) 694-6900
	ARPA:		merlin%hqda-ai@brl
	UUCP:		...!seismo!sundc!hqda-ai!merlin

greg@utcsri.UUCP (Gregory Smith) (12/08/86)

In article <394@houxs.UUCP> daw@houxs.UUCP (D.WOLVERTON) writes:
>In some systems, the hardware floating point (fp) unit is _optional_.
>The Itty Bitty Machines (IBM) PC is a good example.  From the
>point of view of a compiler writer, how does one deal with
>that uncertainty?

As a compiler writer, you should provide an option to directly use the
fp unit. Run-time subroutines will be used otherwise.
If you are writing distribution software, which must run whether an fp
unit exists or not, and which uses the fpu if it does exist, then
you have this problem.

>
>I know of, or can imagine, several flavors of code generation
>in the face of this situation:
>
>1)  Code generation emits calls to a floating point library.  
>This library checks for the presence of fp hardware, and uses 
>the fp hardware it is is present, otherwise it emulates the operation.
>
>2)  Like (1), but the test for fp unit is made before the function
>call.
>
>3)  Code generation pretends that the fp unit will always be present,
>so it emits code which uses the fp unit directly into the instruction
>stream.  If a fp unit is not present, the hardware arranges for a trap
>to occur which transfers control to the OS [ and the fp op is emulated..]

>4)  Code generation emits code which always causes transfer to the
>OS, e.g. by illegal opcodes or TRAP instructions.  The OS then
>proceeds like (3a) or (3b) above except that the fp unit may be used 
>if present.

5) The program contains a jump table to fp routines. At program start-up,
the presence or absence of the fp unit is determined, and the jump table
is modified to point either to routines which use the fp hardware, or to
routines which do the work in software. The generated code then makes calls
indirectly via this jump table. So no testing is done at run-time once
the table is set up.

Even better, but a a little weirder: The code directly calls routines which
use the fp hardware. If there is no fp unit, the start-up code puts a jump
instruction at the start of each routine, which jumps to the equivalent
subroutine. This makes the code a little faster when an fp is present. When
it isn't, the extra jump won't matter much anyway.


-- 
----------------------------------------------------------------------
Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg
Have vAX, will hack...

johnl@ima.UUCP (John R. Levine) (12/09/86)

In article <394@houxs.UUCP> daw@houxs.UUCP (D.WOLVERTON) writes:
>In some systems, the hardware floating point (fp) unit is _optional_.
> ...
>I know of, or can imagine, several flavors of code generation
>in the face of this situation:
>
>1)  Code generation emits calls to a floating point library.  
>This library checks for the presence of fp hardware, and uses 
>the fp hardware it is is present, otherwise it emulates the operation.
This is the most common in PC languages I've seen.

>2)  Like (1), but the test for fp unit is made before the function
>call.  The code is larger, but in the case where the fp unit is
>present it is faster because not function call was performed.
Never seen it.  PC compilers usually are more concerned with small code
size than fast execution.

>3)  Code generation pretends that the fp unit will always be present,
>so it emits code which uses the fp unit directly into the instruction
>stream.  If a fp unit is not present, the hardware arranges for a trap
>to occur which transfers control to the OS. ...

This is what the PDP-11 versions of Unix always did.  Originally, the
FP emulator was linked into all of the executables which caught their own
illegal instruction faults and then did the emulation.  More recent versions
have moved the FP emulation into the OS so that you can just emit code that
assumes that the floating point is present.

Unfortunately, this trick does not work on PCs and other 8088 machines because
if you have no 8087, your floating point instructions go into outer space and
hang or return random results.  One clever trick used in the PC/IX version of
Unix is this:  Every floating point instruction has to be preceded by a
one-byte "wait" instruction to make sure that the FP unit has finished the
preceding instruction.  It also turns out that the first byte of all FP
instructions is DC, DD, or DE hex.  When the assembler emits an FP instruction,
rather than emitting a wait instruction, it emits the first byte of an INT
instruction which causes a software trap.  The trap number is determined by
the next byte in the instruction stream, which is the DC, DD, or DE.  When the
OS gets such an interrupt, it checks to see if the system has an 8087.  If so,
it patches the INT instruction to a WAIT and returns to it, so that the hardware
executes the instruction.  Otherwise it emulates the operation and returns.
This means that there is a trap for each instruction the first time it is
encountered in the program, but if there is an 8087, the program runs at full
speed after that.  The 80286 and its successors make this hack unnecessary,
since they have a bit you can set to force traps on execution of unimplemented
instructions.
-- 
John R. Levine, Javelin Software Corp., Cambridge MA +1 617 494 1400
{ ihnp4 | decvax | cbosgd | harvard | yale }!ima!johnl, Levine@YALE.EDU
The opinions expressed herein are solely those of a 12-year-old hacker
who has broken into my account and not those of any person or organization.

guy@sun.uucp (Guy Harris) (12/09/86)

> Anyone (particularly old Sun engineers) care to correct my memory?

Yes.  The Sun-2 did, in fact, have an FPU option, namely a Sky floating
point board.  However, unless we did so in release 1.x, Sun NEVER generated
code to trap to the OS.  In 2.0, I think the compiler generated subroutine
calls for floating-point operations.  By default, the subroutines either
jumped to software floating-point routines or to Sky floating-point
routines, depending on whether the program detected that there was a Sky
board on the machine when it started (the C startup code did this check).
An option could tell the compiler to generate direct calls to the Sky
routines.

In 3.0 and later releases, the same scheme was used to support the 3
possibilities for Sun-3 floating point support: no hardware, MC68881, and
FPA.  There were now options to tell the compiler to generate code to call
the "switched" floating-point routines (that used the information on what
hardware was present to choose which routines to jump to), to call the
software routines directly, to call the Sky routines, to use the 68881
instructions, or to use the FPA "instructions" ("move"s, etc. to the FPA
registers).

Since it took *several* 68010 instructions to make the Sky board perform a
floating-point operation (unless you're talking about the 68881 or, in some
cases, the FPA, there aren't any single-instruction floating-point
operations on Suns; there certainly weren't any on the Sun-2), you couldn't
just replace a TRAP with the operation in question.  And, since many Sun-2s
didn't have a Sky board, the overhead of doing floating point in the OS
would have been prohibitive in most cases, so the OS certainly wouldn't have
done floating-point computations if there wasn't a Sky board.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)

stuart@bms-at.UUCP (Stuart D. Gathman) (12/10/86)

In article <394@houxs.UUCP>, daw@houxs.UUCP (D.WOLVERTON) writes:

> In some systems, the hardware floating point (fp) unit is _optional_.
> The Itty Bitty Machines (IBM) PC is a good example.  From the

> 3)  Code generation pretends that the fp unit will always be present,
> so it emits code which uses the fp unit directly into the instruction
> stream.  If a fp unit is not present, the hardware arranges for a trap
> to occur which transfers control to the OS.  At this point either:

> 4)  Code generation emits code which always causes transfer to the
> OS, e.g. by illegal opcodes or TRAP instructions.  The OS then

With the *86/*87 chips, there is a very elegant solution combining
3 & 4.  Both emulation and real code are easily defined to be
identical except by a constant difference in the first byte.
Then the OS can select for the presence or absence of the chip at
program load time by using a relocation table to modify the initial
instruction bytes if the *87 is absent.  This way you get optimal
emulator performance and optimal hardware performance.
-- 
Stuart D. Gathman	<..!seismo!dgis!bms-at!stuart>

spain@alliant.UUCP (12/10/86)

In article <394@houxs.UUCP> daw@houxs.UUCP (D.WOLVERTON) writes:
>
>In some systems, the hardware floating point (fp) unit is _optional_.
>...
>	Are there other scenarios in use?

I am familiar with one more mechanism, call it 3.5 which goes something
like:

3.5)  Code generation pretends that the fp unit will always be present,
so it emits code which uses the fp unit directly in the instruction
stream. If a fp unit is not present, the "hardware" in the form of the
machines' microcode, emulates the instruction using the machine's integer
hardware. No OS trapping is involved and there is no change of control from
the user's code.

jc@piaget.UUCP (John Cornelius) (12/11/86)

David Wolverton gives 3 of the most common methods for doing floating
point in an environment where the availability of floating point 
hardware is unknown.  I suggest that the most common method, however,
is to assume that floating point hardware does not exist.  In Unix this
is accomplished by having the cc command map to cc -f which uses library
routines that do not check for the presence of floating point hardware.

If floating point hardware is subsequently installed, the C compiler
invocation routine (/bin/cc usually) is changed to cause pass 2 to emit
actual floating point code of the desired type.

-- 
John Cornelius
(...!sdcsvax!piaget!jc)