[net.arch] Delayed Loads

aglew@ccvaxa.UUCP (09/14/86)

There has been some discussion of delayed branches in this newsgroup;
can anybody say anything useful about delayed load/stores? Ie. memory
access functions that are defined to work the same way as delayed
branches, not to take effect until after a few more instructions.

Eg.:	memaddr: .word 2

	LOAD r0 := #1
	LOAD r0 := [memaddr]
	MOV r1 := r0		-- r1 contains 1, not 2
	MOV r2 := r0		-- r2 contains 2, the load has completed

Or, the contents of r0 might be undefined on the second load. This might
be preferable, since it would mean that, if you could eventually build a
faster memory system that completes in one cycle, you could use it.

What systems use these?

Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms

mahar@weitek.UUCP (Mike Mahar) (09/16/86)

In article <5100133@ccvaxa>, aglew@ccvaxa.UUCP writes:
> 
> There has been some discussion of delayed branches in this newsgroup;
> can anybody say anything useful about delayed load/stores? Ie. memory
> access functions that are defined to work the same way as delayed
> branches, not to take effect until after a few more instructions.
> 
> Eg.:	memaddr: .word 2
> 
> 	LOAD r0 := #1
> 	LOAD r0 := [memaddr]
> 	MOV r1 := r0		-- r1 contains 1, not 2
> 	MOV r2 := r0		-- r2 contains 2, the load has completed
> 
Microcoded systems have used delayed loads for some time. Weitek's 7137
and 7136 alu & sequencer use a delayed load model of the form
	address is presented on address bus
	data is loaded on a later cycle

Weitek splits presenting the address and the loading into two seperate
instructions.

The above example is not interruptable. If an interrupt happens between
the second load and the first mov the contents of r0 will be the same
for both moves.
 
-- 

	Mike Mahar
	UUCP: {turtlevax, cae780}!weitek!mahar

	Disclaimer: The above opinions are, in fact, not opinions.
	They are facts.

mash@mips.UUCP (John Mashey) (09/17/86)

In article <486@weitek.UUCP> mahar@weitek.UUCP (Mike Mahar) writes:
>In article <5100133@ccvaxa>, aglew@ccvaxa.UUCP writes:
>> 
>> There has been some discussion of delayed branches in this newsgroup;
>> can anybody say anything useful about delayed load/stores? Ie. memory
>> access functions that are defined to work the same way as delayed
>> branches, not to take effect until after a few more instructions.
>> (example; further discussion by Mike on Weitek 713[67] alsu & sequencer.)

I missed the original of this.  Both the Stanford MIPS and the MIPS Computer
Systems R2000 use non-interlocked load instructions.  A code reorganizer
rearranges instructions to place independent ones in the "load-delay-slots".

Note that the load latency always exists, whether or not software
fills the slot, leaves a nop, or the hardware provides an interlock.
The main issues are in deciding how much interlocking and optimization
one can expect from the software, and therefore can leave out of the
hardware.

One can observe several distinct design styles in the handling of
load-delay latency, or of other operations that produce results used by
later instructions.  All of these require:
	a) Hardware interlocks, with some parallelism.
	b) Non-pipelined implementation, i.e., an extreme form of a)
		that makes most interlocks unnecessary! (but slow)
	c) Software scheduling required for correctness everywhere.
	d) Some combination of a) and c) required for correctness.
	e) Basically a), but designed with c) in mind.

Most computers use a), with the complexity of interlock dependent on
the nature of the architecture, and on the aggressiveness of pipelining.
A good example would be a 360/91, and presumably some of the faster 30XX
machines.  Note that a complex architecture may require considerable
hardware to dynamically detect opearations that can be done in parallel,
do them that way, and make sure everythign is fine when exceptions happen.
[I.e., nothing stops CISCs from being fast, but it takes a lot of gates!]
Branch handling gets exciting, for example.

The "bottom" end of many computer families is often in class b).

I assume that many specialized VLSI parts use c).  I don't know of
any aggressively-pipelined general CPU architecture that does this.
Can anybody post some?

d) Many RISC designs fall in d).  For example, MIPS R2000 uses software
to fill load and branch delays, while using hardware interlocks for
integer multiply/divide, and for some floating-point operations.
The HP SPectrum (I think) fills branch delays by software, but uses
hardware for load delays.  In either case, at least some of the hardware
design was predicated on the expected nature of compilers, i.e., things
were left out of the hardware based on knowing what the compilers
might be able to do.

e) The CDC 6600 probably falls in e), i.e., the FORTRAN compiler would
rearrange code to help things go fast, but the hardware could handle
all of the interlocks itself [I think. Anybody know different?]

It's amusing to note that people have done reorganizing compilers for
machines whose architecture provides interlocks, but whose faster
members can run faster given code that has been organized with
more aggressive pipelines in mind. [i.e., big IBM machines]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

hank@masscomp.UUCP (Hank Cohen) (09/18/86)

In article <5100133@ccvaxa> aglew@ccvaxa.UUCP writes:
>
>There has been some discussion of delayed branches in this newsgroup;
>can anybody say anything useful about delayed load/stores? Ie. memory
>access functions that are defined to work the same way as delayed
>branches, not to take effect until after a few more instructions.
>
The benefit of such an approach is similar to that of delayed 
branches.  In a pipelined processor the result of an operation is not
available immediately so if the next instruction in the pipe requires the
result then the pipeline must be stopped until the result is ready.  This
interlock logic tends to significantly complicate the design of the CPU
and slows down execution times.  Performance of pipelined processors can be
improved by generating code that does not generate data dependent pipeline
interlocks.  Presumably microprocessors without pipeline interlocks have
delayed stores as well as delayed branches and for the same reason.

An even thornier problem arises if you allow self modifying code to be run
on your machine. i.e. You build a real Von Neuman machine.  The problem of
detecting stores into the instruction stream of  a pipelined processor is
even more difficult than detecting data interdependencies.  On the Amdahl
470 v8 (the pipelined processor that I am most familiar with) the attempt
is not even made to detect stores into instructions that are already in
execution.  All that they try to do is see if a store is "close" in  which
case the entire pipeline is flushed and serialized.

	Illigitimi non carborundum
		Hank Cohen

stubbs@ncr-sd.UUCP (Jan Stubbs) (09/19/86)

In article <5100133@ccvaxa> aglew@ccvaxa.UUCP writes:
>
>can anybody say anything useful about delayed load/stores? Ie. memory
>access functions that are defined to work the same way as delayed
>branches, not to take effect until after a few more instructions.
>
>What systems use these?
>Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew

NCR 8500 (circa 1975) and 8600 (circa 1979)is one such machine made with 100K
ECL, 56 and 38 nanosecond cycle time repectively, one instruction per
cycle. It  has a fetch instruction which gives an address to fetch in 
one of its 64 registers, and a place to load it in another register. 
You may execute dozens of other instructions including three more fetches 
while waiting for the contents of memory to show up in the specified register, 
but if you reference that register the pipeline hangs till the word shows up.

The NCR/32 Microprocessor (circa 1982) is similar except you first do a Fetch,
then later you specify the destination to load it with a Receive instruction,
which hangs the pipeline if it isn't ready. You can put any instruction
except a fetch or store between the fetch and receive.

I believe most "RISC" machines (Pyramid, MIPS, Motorola 78000) do this
somehow or other.

ed@mtxinu.UUCP (Ed Gould) (09/19/86)

>e) The CDC 6600 probably falls in e), i.e., the FORTRAN compiler would
>rearrange code to help things go fast, but the hardware could handle
>all of the interlocks itself [I think. Anybody know different?]

I suspect it's true of the 6600; it's definitely true of the 6400,
which is the low-end machine of the original 6000 series.  (CDC
later came out with the 6200, but it was really a slowed-down 6400.)
Even on this low-end machine, many of the same reorderings worked
as on the 6600, even thought the 6400 had no parallelism.  These
optimizations had to do with where within the 60-bit word the
instructions - which were generally either 15 or 30 bits long -
landed.  Other opimizations - ones which took advantage of the
parallelism of the 6600 - were either meaningless on the 6400,
or, sometimes, a cycle slower than the obvious sequence!  The
compilers that did optimizations needed to know which member of the
family the code was for.  One example of this type of optimization
that I remember was when copying two X registers (the machine's
accumulators, more or less; they're 60-bit registers) into two
other registers.  The obvious sequence is

	BX1	X2+X2	bitwise "or" of X2 with X2 into X1
	BX3	X4+X4	likewise for X4 into X3

Redundant operatorands could be elided, so that the X2+X2 could be
abbreviated by just using X2.  The 6600 had separate functional
units - operating in parallel with interlocks on using the results -
including a "boolean" unit to do the "B" instructions and a "logical"
unit that did shifts - "L" instructions.

	LX1	X2,B0	copy X2 into X1 using the "logical" unit
	BX3	X4+X4	copy X4 into X3 using the "boolean" unit

The LX1 instruction left-circular-shifts X2 by the number of bits
specified by the value of B0, which is a hard-wired 0, and leaves the
resunt in X1.  (The other seven B registers were real 18-bit
registers.  They are essentially index registers; addresses are 18
bits.)  The L unit was typically one cycle slower than the B unit,
so the above sequence was optimal on a 6600, where both instructions
would finish at the same time.  On a 6400, however, (if I remember
correctly) the L instructions were also one cycle slower than the
B instructions, so that the optimized sequence would be one cycle
slower than the obvious sequence.

-- 
Ed Gould                    mt Xinu, 2560 Ninth St., Berkeley, CA  94710  USA
{ucbvax,decvax}!mtxinu!ed   +1 415 644 0146

"A man of quality is not threatened by a woman of equality."

kjm@ut-ngp.UUCP (09/19/86)

[]

>e) The CDC 6600 probably falls in e), i.e., the FORTRAN compiler would
>rearrange code to help things go fast, but the hardware could handle
>all of the interlocks itself [I think. Anybody know different?]
>[John Mashey]

This is correct.  CDC 6600's and their descendants have a reservation
scheme which delays execution of an instruction if a register it uses
is being loaded from memory.

--
Kenneth J. Montgomery  "Shredder-of-hapless-smurfs"
[Charter?] Member, Heathen and Atheist SCUM Alliance, "Heathen" division

...!{ihnp4,allegra,seismo!ut-sally}!ut-ngp!kjm  [Usenet, when working]
kjm@ngp.{ARPA, UTEXAS.EDU, CC.UTEXAS.EDU}  [Old, New, and Very New Internet]

kjm@ut-ngp.UUCP (09/20/86)

[]
> One example of this type of optimization
>that I remember was when copying two X registers (the machine's
>accumulators, more or less; they're 60-bit registers) into two
>other registers.  The obvious sequence is
>
>        BX1     X2+X2   bitwise "or" of X2 with X2 into X1
>        BX3     X4+X4   likewise for X4 into X3
>
>Redundant operatorands could be elided, so that the X2+X2 could be
>abbreviated by just using X2.
>
>[Ed Gould]

Just as a point of information, "BX1 X2+X2" and "BX1 X2", are not the
same instruction -- they generate different opcodes.  They do have the
same effect, though.

--
The above viewpoints are mine.  They are unrelated to
those of anyone else, including my cat and my employer.

Kenneth J. Montgomery  "Shredder-of-hapless-smurfs"
[Charter?] Member, Heathen and Atheist SCUM Alliance, "Heathen" division

...!{ihnp4,allegra,seismo!ut-sally}!ut-ngp!kjm  [Usenet, when working]
kjm@ngp.{ARPA, UTEXAS.EDU, CC.UTEXAS.EDU}  [Old, New, and Very New Internet]

mash@mips.UUCP (John Mashey) (09/20/86)

In article <1115@masscomp.UUCP> hank@masscomp.UUCP (Hank Cohen) writes:
>In article <5100133@ccvaxa> aglew@ccvaxa.UUCP writes:
>>
>>There has been some discussion of delayed branches in this newsgroup;
>>can anybody say anything useful about delayed load/stores? Ie. memory
>>access functions that are defined to work the same way as delayed
>>branches, not to take effect until after a few more instructions.
>>
>The benefit of such an approach is similar to that of delayed 
>branches.  In a pipelined processor the result of an operation is not
>available immediately so if the next instruction in the pipe requires the
>result then the pipeline must be stopped until the result is ready.  This
>interlock logic tends to significantly complicate the design of the CPU
>and slows down execution times.  Performance of pipelined processors can be
>improved by generating code that does not generate data dependent pipeline
>interlocks.  Presumably microprocessors without pipeline interlocks have
>delayed stores as well as delayed branches and for the same reason.
No.
Delayed branches and delayed loads are the identical problem, one each
for Instruction and Data.  There's no reason to delay stores, since you
already have the data you want.  The problem with stores is having enough
buffering to smooth the flow of data to memory, and not stall the processor
waiting for the write to happen.  Solutions to the problem include:
register windows (which help the subset of writes that would be subroutine
register saves), stack caches (which help the writes that are near the
top of the stack), and either write-back caches (like on an 8600), or
write-thru caches with write buffers [i.e., like the 1-deep write buffer
on the 780, or a MIPS 4-deep write buffer, or (lots of others)].
>
>An even thornier problem arises if you allow self modifying code to be run
>on your machine. i.e. You build a real Von Neuman machine.  The problem of
>detecting stores into the instruction stream of  a pipelined processor is
>even more difficult than detecting data interdependencies.  On the Amdahl
>470 v8 (the pipelined processor that I am most familiar with) the attempt
>is not even made to detect stores into instructions that are already in
>execution.  All that they try to do is see if a store is "close" in  which
>case the entire pipeline is flushed and serialized.

A pleasant thing about doing an architecture from scratch is the ability to
forbid the use of stores into the instruction stream. [Obviously, you must
be able to create executable code, but you can require a system call to
indicate weird cache manipulations.]  There appears to be a fair amount of
hardware in many high-end machines dedicated to worrying about this
[relatively rare] event, which is too bad.  Had it been forbidden from
day one, I suspect little performance would be lost; certainly, most
high-level languages don't do this kind of thing anyway.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

aglew@ccvaxa.UUCP (09/22/86)

Thank you to those who responded, in net.arch or by e-mail, to my question
about delayed memory accesses.

    All sorts of people know machines that let you execute ahead, but have an
    interlock. Even Goulds have these (:-).

    MIPS, some NCR microprocessors, and microcoded engines like Weitek's have
    explicit delayed loads, without interlock. And, I assume, with the usual
    interrupt restart problems.

Do you mind if I take another stab at expressing my curiousity about delayed
memory accesses?

    Q1: What is the success rate of code reorganization to use the delayed
	slots without conflict? Is it more or less successful than code 
	reorganization for delayed branches? What are the static/dynamic 
	rates? I understand that they are typically 90% for the first slot, 
	80% for the second, and so on, for delayed branches.

    Q2: Does anybody have special knowledge about delayed memory accesses
	in vector machines, particularly machines where the vector startup
	time is high?

    Q3: All of the discussion so far has been about delayed LOADS. What about
	delayed STORES, where you can't touch the data to be stored for a
	few instructions after the store instruction? Does anybody try to
	save a latch?

I wonder if the idea of an architectural family is dead? If not, how do you
reconcile it with delayed loads/branches? Start off with the longest possible
delay factor, and reduce it as machines get faster?

Do machines that have no interlocks on their delayed loads have strict
or relaxed semantics? Ie. in the code sequence

	ADD r1 += r0
	LOAD r0 := [memaddr]
	-- delay slot

do MIPS et cie. permit the rearrangement

	LOAD r0 := [memaddr]
	ADD r1 := r0

where the value put in r1 is not [memaddr], but whatever was there before?

There is a difference between saying "the result is not available for N
cycles" and saying "the destination value is not changed for M cycles".

(Oh, more of a personal self-development, I wouldn't bother the net but
I'm not quite sure who to ask, question: NCR has built some interesting
machines. Where can I get information, spec sheets, data books on them?)

Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms

kim@amdahl.UUCP (Kim DeVaughn) (09/24/86)

[ "Send lawyers, guns, and money ..." ]

In article <697@mips.UUCP>, mash@mips.UUCP (John Mashey) writes:
>A pleasant thing about doing an architecture from scratch is the ability to
>forbid the use of stores into the instruction stream. [Obviously, you must
>be able to create executable code, but you can require a system call to
>indicate weird cache manipulations.]  There appears to be a fair amount of
>hardware in many high-end machines dedicated to worrying about this
>[relatively rare] event, which is too bad.  Had it been forbidden from
>day one, I suspect little performance would be lost; certainly, most
>high-level languages don't do this kind of thing anyway.

There is a related problem here with machines that have seperate
instruction and data caches, and which run s/w that doesn't distinguish
between "code" and "data" in the object/load file formats.

With large cache line sizes (64 byte lines on Amdahl 5890's), you can
frequently end up with data in the I-cache because the compiler gens
short chunks of code followed by short chunks of data followed by more
code, etc.  This also happens with some styles of coding in assembly
language.  So we end up "shuffling" cache-lines back and forth between
the I-cache and Op-caches.  And yes, you're right ... it does take alot
of h/w to do this kind of thing without incurring a (big) performance
penelty.

/kim

-- 
UUCP:  {sun,decwrl,hplabs,ihnp4}!amdahl!kim
DDD:   408-746-8462
USPS:  Amdahl Corp.  M/S 249,  1250 E. Arques Av,  Sunnyvale, CA 94086
CIS:   76535,25

[  Any thoughts or opinions which may or may not have been expressed  ]
[  herein are my own.  They are not necessarily those of my employer. ]

johnl@ima.UUCP (John R. Levine) (09/24/86)

In article <1174@ncr-sd.UUCP> stubbs@ncr-sd.UUCP (0000-Jan Stubbs) writes:
>In article <5100133@ccvaxa> aglew@ccvaxa.UUCP writes:
>>can anybody say anything useful about delayed load/stores? Ie. memory
>>access functions that are defined to work the same way as delayed
>>branches, not to take effect until after a few more instructions.
>NCR 8500 (circa 1975) and 8600 (circa 1979)is one such machine ...

The IBM 360/91, circa 1969, had overlapped loads and stores.  The manual
suggested that if you reorder instructions so that the result register
of one instruction is not used as an operand until a few instructions later,
your program will run a lot faster.  But there was also considerable expensive
hardware so that if you did use your results immediately, it interlocked to
make the program work correctly.  The CDC 6600 had similar overlaps and
interlocks even earlier, about 1966.
-- 
John R. Levine, Javelin Software Corp., Cambridge MA +1 617 494 1400
{ ihnp4 | decvax | cbosgd | harvard | yale }!ima!johnl, Levine@YALE.EDU
The opinions expressed herein are solely those of a 12-year-old hacker
who has broken into my account and not those of any person or organization.

jans@stalker.gwd.tek.com (Jan Steinman) (09/24/86)

In article <697@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
>In article <1115@masscomp.UUCP> hank@masscomp.UUCP (Hank Cohen) writes:
>>In article <5100133@ccvaxa> aglew@ccvaxa.UUCP writes:
>>>
>>>There has been some discussion of delayed branches in this newsgroup;
>>>can anybody say anything useful about delayed load/stores? Ie. memory
>>>access functions that are defined to work the same way as delayed
>>>branches, not to take effect until after a few more instructions.
>>
>>An even thornier problem arises if you allow self modifying code to be run
>>on your machine. i.e. You build a real Von Neuman machine.  The problem of
>>detecting stores into the instruction stream of  a pipelined processor is
>>even more difficult than detecting data interdependencies.
>
>A pleasant thing about doing an architecture from scratch is the ability to
>forbid the use of stores into the instruction stream.

For an example of something useful to do with self-modifying code on a
pipelined machine, see September Dr. Dobbs Journal, pp 114.  Motorola
gave the capability to forbid stores in the code area, but few people
use it.  (Is anybody out there using the FC lines to write-protect
code memory?)  If Mota had been on time with an MMU that utilized the
FC lines, they would have been useful, but most designers ignored
them.

I think the best policy is "caveat emptor", let the programmer beware!
Note that an explicit cache flush should be provided on heavily
pipelined/cached machines that allow code writes.  The 68020 has this,
the 680[01]0 does not, but neither waste time *detecting* this
condition, which I agree with wholeheartedly.  The DDJ code cited
depends on the ability to change the opcode of the instruction that
has already been prefetched.
:::::: Artificial   Intelligence   Machines   ---   Smalltalk   Project ::::::
:::::: Jan Steinman		Box 1000, MS 60-405	(w)503/685-2956 ::::::
:::::: tektronix!tekecs!jans	Wilsonville, OR 97070	(h)503/657-7703 ::::::

hansen@mips.UUCP (Craig Hansen) (09/25/86)

>     Q3: All of the discussion so far has been about delayed LOADS. What about
> 	delayed STORES, where you can't touch the data to be stored for a
> 	few instructions after the store instruction? Does anybody try to
> 	save a latch?

	This isn't a good idea.  While you can easily determine when
	two registers are identical, it is hard to determine at
	compile-time, that two address expressions are potentially
	conflicting. Thus it is hard to effectively reorganize code
	to avoid conflicts over delayed stores.

> I wonder if the idea of an architectural family is dead? If not, how do you
> reconcile it with delayed loads/branches? Start off with the longest possible
> delay factor, and reduce it as machines get faster?

	In general, you'd take one or two delay slots, which is all
	you can usually fill in software and make them architectural.
	If the machine you are building has more than the architecturally
	defined number of delay slots, it better interlock.
	This tradeoff permits building a fast, simple, machine now,
	and a faster, more complicated machine later.
	However, even the more complicated machine has fewer, less
	stressful interlock cases than a machine without delay slots.

> Do machines that have no interlocks on their delayed loads have strict
> or relaxed semantics? Ie. in the code sequence
> 
> 	ADD r1 += r0
> 	LOAD r0 := [memaddr]
> 	-- delay slot
> 
> do MIPS et cie. permit the rearrangement
> 
> 	LOAD r0 := [memaddr]
> 	ADD r1 := r0
> 
> where the value put in r1 is not [memaddr], but whatever was there before?


	This is explicitly not permitted in MIPS or in any other RISC
	machine I am aware of. The reason is that the occurance of
	an interrupt between the two instructions will cause it to fail.

> There is a difference between saying "the result is not available for N
> cycles" and saying "the destination value is not changed for M cycles".

-- 

Craig Hansen			|	 "Evahthun' tastes
MIPS Computer Systems		|	 bettah when it
...decwrl!mips!hansen		|	 sits on a RISC"

cameron@foxy.UUCP (cameron spitzer) (10/01/86)

<  eat me if you dare  >
>> For an example of something useful to do with self-modifying code on a
Self modifying code?  Masochism!
It's very dangerous in the 68020 because you don't modify the on-chip cache,
and in a protected machine you the user ain't allowed to flush cache.

>> pipelined machine, see September Dr. Dobbs Journal, pp 114.  Motorola
>> gave the capability to forbid stores in the code area, but few people
>> use it.  (Is anybody out there using the FC lines to write-protect
>> code memory?)
Of course. Everybody who builds a protected (memory-managed?) machine
builds their own MMU or buys Mot's.
Arete's MMU uses FC2 to decide if you have user or kernel permissions.
We also detect if you're trying to write your code or execute your data
(which takes FC1 and FC0);
even kernels ain't got that permission (but they can change the page map).
Our MMU is made of Schottky and very fast CMOS and takes no wait states,
but the same protection is available in Mot's MMU if you're short of
board space.

>>If Mota had been on time with an MMU that utilized the
>> FC lines, they would have been useful, but most designers ignored
>> them.
Huh?  I've never seen a 680x0 design which left FC dangling.
If nothing else, you need them to detect interrupt acknowledge.
Do I hear a put-down here?  Did Mot let you down for a single-chip MMU?
I was delighted their floating point was so nice;
I'd rather roll my own MMU, which is application specific,
than go in the floating point business.
You'd be surprised where FPUs get used;
take a look at printf(3s) or nroff(1) on your machine.