[comp.arch] Memory-mapped floating point

w-colinp@microsoft.UUCP (Colin Plumb) (11/30/88)

In article <8939@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>Note, of course, that this is a model of the world likely to go away,
>for anybody who is serious about scalar floating point.  The structure
>is only likely to happen when you have a long-cycle-count FP device,
>which is still fast enough to be desirable, but where the addition of a few
>cycles' overhead doesn't clobber the performance.  As micros with
>tight-coupled FPUs (MIPS R3000/R3010, Moto 88K, Cypress SPARC + TI 8847)
>and low-cycle count operations (2-10 cycles for 64-bit add & mult)
>become more common, you just lose too much performance moving data
>around to afford that kind of interface, at least in a scalar unit.

Gee, this is intersesting, considering that the MIPS integer multiply/divide
is done basically the same way: move operands to special registers, issue
instruction, if a read is attempted before the results are ready, stall.

The only difference is that the registers are on-chip and that the is
instruction that starts the multiply isn't a store.

On, say, a 68020, you simply can't have loads and stores complete in one
cycle, so this wouldn't work, and the address computation on the MIPS might
cost a cycle, but on the 29000, where the address is known by the end of
the decode stage, the load/store can leave the chip during the execute
phase and complete that same cycle, so you're only losing one cycle
transferring operands (less , if you make setup time assumptions.

This seems pretty tightly coupled to me.

(For the confused: the 29000 has some extra address modifier bits, some of
which are used to address the floating-point unit, and the others are
used to indicate what sort of transaction this is - are you writing
operand 1, operand 2, or an instruction.  You can use both address and
data buses to send 64 bits of data in one cycle.  Reads are only 32
bits at a time, sorry.)

Basically, I claim that this model of FPU operation is very close
in speed to one where the CPU has more knowledge of the FPU, if
done properly.

(P.S. Question to MIPS: I think you only need to back up one extra instruction
on an exception if that instruction is a *taken* branch.  Do you do it this
way, or on all branches?)
-- 
	-Colin (microsof!w-colinp@sun.com)

mash@mips.COM (John Mashey) (12/01/88)

In article <1044@microsoft.UUCP> w-colinp@microsoft.UUCP (Colin Plumb) writes:
>In article <8939@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>Note, of course, that this is a model of the world likely to go away,
>>for anybody who is serious about scalar floating point.  The structure
>>is only likely to happen when you have a long-cycle-count FP device,...

>Gee, this is intersesting, considering that the MIPS integer multiply/divide
>is done basically the same way: move operands to special registers, issue
>instruction, if a read is attempted before the results are ready, stall.
ACtually, it doesn't work this way: the mult instruction picks up the values
from the regular registers, i.e., one instruction both fetches the data and
initiates the instruction.  The only "extra" instruction is the one that
moves the result back.

>On, say, a 68020, you simply can't have loads and stores complete in one
>cycle, so this wouldn't work, and the address computation on the MIPS might
>cost a cycle, but on the 29000, where the address is known by the end of
>the decode stage, the load/store can leave the chip during the execute
>phase and complete that same cycle, so you're only losing one cycle
>transferring operands (less , if you make setup time assumptions.
>This seems pretty tightly coupled to me.

>(For the confused: the 29000 has some extra address modifier bits, some of
>which are used to address the floating-point unit, and the others are
>used to indicate what sort of transaction this is - are you writing
>operand 1, operand 2, or an instruction.  You can use both address and
>data buses to send 64 bits of data in one cycle.  Reads are only 32
>bits at a time, sorry.)

>Basically, I claim that this model of FPU operation is very close
>in speed to one where the CPU has more knowledge of the FPU, if
>done properly.

Here's an example.  MAybe this is close in speed, or maybe not, or maybe
I don't understand the 29K FPU interface well enough.  here's a small
example (*always be wary of small examples; I don't have time right now
for anything more substantive):
main() {
	double x, y, z;
	x = y + z;
}

this generates (for an R3000), and assuming data & instrs in-cache:
 #   2		double x, y, z;
 #   3		x = y + z;
	l.d	$f4, 8($sp)
	l.d	$f6, 0($sp)
	add.d	$f8, $f4, $f6
	s.d	$f8, 16($sp)
The l.d and s.d actually turn into pairs of loads  & stores, and this sequence
takes 9 cycles: 4 lwc1's, a nop (which might be scheduled away), 2 cycles for
the add.d, and 2 swc1's.

Assume a 29K with cache, so it has loads like the R3000's.
As far as I can tell, a 29K would use 17:
4 load cycles (best case: in some cases, offset calculations, or use of
	load multiple with count-setup would take another one or two)
2 writes (get the data over to the FPU)
1 write (start the add-double)
6 cycles (do the add.double)
2 reads (get the 64-bit result back)
2 stores (put the result back in memory; again assume no offset calculations)

Now, the biggest difference is the 6 cycles versus 2, and if this were
* instead of +, it would be 6 versus 5.  Still, as it stands, making worst
case assumptions about the R3000 (that the nop gets left in), and some
best-case assumptions about the 29K, you get a 9:17 ratio for this case,
a 12:17 for y*z;  if you do the single-precision case with the same assumptions,
you get a 6:12 ratio for x+y, and 8:12 for x*y.
So, we get:
	DP +	DP *	SP +	SP *
R3000	9	12	6	8	(subtract 1 if the nop gets scheduled)
29K	17	17	12	12

Also, interesting to see what would happen if you compare two non-existent
machines: an R3000* whose FP operation times increase to match the 29K's,
and a 29K whose FP op times decrease to match the R3000s:
R3000*	13	15	10	10	AN R3000 with 29K FP cycle times
29K*	13	16	8	10	A 29K whose FP operations had R3000
					cycle times
What does this mean:
ANS: statistically, not much, except that it means that the extra cycles
overhead getting data in and out would cancel the advantage of decreasing the
FP cycle times.  A better comparison might be between R3000 and 29K*,
which shows that even with the same FP operation times, one is paying the
following % cycle count penalties:  44%, 33%, 33%, 25%, at a minimum.
If you guess that you schedule the nop away 70% of the time, you get:
R3000	8.3	11.3	5.3	7.3
29K*	13	16	8	10
% hit	57%	42%	51%	37%

Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THRU COMPILERS.
This is a microscopic example, and one should never believe in them
too much.  However, the general effect is to add at least 1 cycle
for every FP variable load/store, just by the difference in interface.
Having stood on their heads to get things like 2-cycle adds, MIPS designers
would roll in their graves before adding a cycle to each variable load/store!
(Of course, AMD and we aim at somewhat different markets, and each company
has the own tradeoffs, so this tradeoff is not inherently evil, and there
are certainly classes of programs where this is less of problem.  still...)
If I've misunderstood the 29K interface, somebody (Tim?) correct me.

>(P.S. Question to MIPS: I think you only need to back up one extra instruction
>on an exception if that instruction is a *taken* branch.  Do you do it this
>way, or on all branches?)

I'm not sure what you mean.
When there is an exception, the Exception Program Counter EPC points at
the instruction that should be resumed to restart execution.  If the
exception occurred in the branch-delay slot, the cause register's
Branch Delay bit is set, so that the OS can analyze those cases where
the exception is caused by an instruction in the BD-slot, and where you
want to do something different, like emulating an FP operation on a system
that has no FP.

there is normally no difference in handling taken or untaken branches.
the only place you'd need to figure that out is in the case where you
want to go back to the user and skip the instruction in the delay slot.
then, you figure out whether or not the branch is being taken, and either
take it, or not.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

w-colinp@microsoft.UUCP (Colin Plumb) (12/01/88)

In article <9061@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>Actually, it doesn't work this way: the mult instruction picks up the values
>from the regular registers, i.e., one instruction both fetches the data and
>initiates the instruction.  The only "extra" instruction is the one that
>moves the result back.

I stand corrected.  Still, If I use the 29027 simply as an integer mutiplier,
I can do exactly the same thing.

>Here's an example.  Maybe this is close in speed, or maybe not, or maybe
>I don't understand the 29K FPU interface well enough.  here's a small
>example (*always be wary of small examples; I don't have time right now
>for anything more substantive):
>main() {
>	double x, y, z;
>	x = y + z;
>}

This is a *bit* heavy on the loads and stores - register allocation, anyone?

While people do add up vectors (which is 1 access per add, not 3), the
heaviest thing I can think of that's popular is dot products.  Now if
I'm allowed to put the 29027 into pipeline mode, it can eat 4 words of
data every 3 clocks, which is gonna tax anybody's memory system, but
I'll factor time to do the f.p. op out of the comparison.

You need to get 4 words per step of the dot product from memory to the
floating point chip.  That's 4 loads, either way, and an additional 2
stores for the 29027.  That's only 50% overhead, assuming all cache hits.

On the 29027, I can just leave the multiply-accumulate opcode in the
instruction register and keep feeding it operands, while on the MIPS
chip I have to issue add.d and mul.d, which is an extra 2 cycles penalty
it pays.

But sigh, I'm losing track of the point of the argument.  If I try and
code up a dot-product loop, it gets worse, as I can unroll the pointer
incrementing on the MIPS chip by using the base-plus-offset addressing
mode, which the 29000 doesn't have.  Of course, if it's a small array
(up to 30 64-bit words each, or so), I'll just keep the whole thing in
registers on the 29000...

>Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THROUGH COMPILERS.
>This is a microscopic example, and one should never believe in them
>too much.  However, the general effect is to add at least 1 cycle
>for every FP variable load/store, just by the difference in interface.
>Having stood on their heads to get things like 2-cycle adds, MIPS designers
>would roll in their graves before adding a cycle to each variable load/store!

Actually, that's half a cycle to loads and a cycle to stores.  But you're
right, this is a silly sort of thing to benchmark.  How about some
figures on the frequencies of the various ops in floating point programs,
O great benchamrk oracle? :-)  Just how often do I need to load and store?

And don't forget that on the 29000, if it's just a local variable, I
store it in the register file/stack cache (guaranteed one cycle) and
the actual memory move may be obviated entirely.

Even without all this logic, I think I can safely say that for vector
operations, the memory->fpu->memory speed is essential, thus all the
tricky things they do in Crays, avoiding the two steps of mem->reg
and reg->fpu.  For all-register work, like Mandelbrot kernels, it
doesn't matter.  And in between, I dunno.  I still think it doesn't
hurt *that* bad.  What's the silicon cost for the coprocessor interface
on the R2000/R3000?

>>(P.S. Question to MIPS: I think you only need to back up one extra instruction
>>on an exception if that instruction is a *taken* branch.  Do you do it this
>>way, or on all branches?)
>
>I'm not sure what you mean.

Quote from the R2000 architecture manual:
[The Exception Program Counter register]
This register contains the virtual address of the instruction that caused
the exception.  When that instruction resides in a branch delay slot, the
EPC register contains the virtual address of the immediately preceding
Branch or Jump instruction.

What I'm wondering is, in the instruction sequence

	foo
	bar
	jump (untaken)
	baz
	quux

where the jump is not taken, is "baz" considered to be in the jump's delay
slot?  I.e. if baz faults, will the EPC point to it, or to the jump.
Of course, if the jump *is* taken, then EPC will point to the jump, but
I'm not sure if a "branch delay slot" is the instruction after a change-
flow-of-control instruction, or a change in the flow of control.  I.e.
is the labelling static or dynamic?  If dynamic, an instruction emulator
wouldn't have to recompute the condition; it would know that the branch
should be taken to find the correct return address.

(It's late... er, early.  Sorry if this could use a little polishing.)
-- 
	-Colin (microsof!w-colinp@sun.com)

tim@crackle.amd.com (Tim Olson) (12/02/88)

In article <9061@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
| Here's an example.  MAybe this is close in speed, or maybe not, or maybe
| I don't understand the 29K FPU interface well enough.  here's a small
| example (*always be wary of small examples; I don't have time right now
| for anything more substantive):
| main() {
| 	double x, y, z;
| 	x = y + z;
| }
| 
| this generates (for an R3000), and assuming data & instrs in-cache:
|  #   2		double x, y, z;
|  #   3		x = y + z;
| 	l.d	$f4, 8($sp)
| 	l.d	$f6, 0($sp)
| 	add.d	$f8, $f4, $f6
| 	s.d	$f8, 16($sp)
| The l.d and s.d actually turn into pairs of loads  & stores, and this sequence
| takes 9 cycles: 4 lwc1's, a nop (which might be scheduled away), 2 cycles for
| the add.d, and 2 swc1's.
| Assume a 29K with cache, so it has loads like the R3000's.
| As far as I can tell, a 29K would use 17:
| 4 load cycles (best case: in some cases, offset calculations, or use of
| 	load multiple with count-setup would take another one or two)
| 2 writes (get the data over to the FPU)
| 1 write (start the add-double)
| 6 cycles (do the add.double)
| 2 reads (get the 64-bit result back)
| 2 stores (put the result back in memory; again assume no offset calculations)

No, local doubles are kept in the Am29000 register file, so no
loads/stores will occur to the memory stack.  The Am29000 has two
methods of generating floating-point code, either emitting floating
point instructions (which trap in the current Am29000 implementation) or
emitting inline '027 code directly.  The fp instruction code for:

double
g(double x, double y)
{
	return x+y;
}

(essentially the same as your test case, but I had to revise it to make
it emit *any* code) is:

	jmpi	lr0
	dadd	gr96,lr4,lr2

(i.e. return, adding the incoming parameters (lr2-3, lr4-5) into the
return-result registers (gr96,gr97) in the delay slot of the return.

The in-line '027 code for this looks like:

	const	gr96,1	; (0x1)
	consth	gr96,65536	; (0x10000)
	store	1,38,gr96,gr96
	store	1,32,lr4,lr5
	store	1,97,lr2,lr3
	load	1,1,gr97,gr96
	load	1,0,gr96,gr96

The first two instructions "build" the '027 instruction that is to be
performed (in this case, a DADD).  The first store stores that
instruction to the '027 coprocessor.  The second store transfers the 'y'
parameter to the coprocessor in a single cycle, and the third store
transfers the 'x' parameter, as well as starts the coprocessor
operation.  The Am29000 then stalls on the load of the result lsb's, (5
cycles) then grabs the msb's and returns.  This takes 12 cycles total,
counting the building of the add instruction (which would be cached in a
local register if it were to be used again). 

| * instead of +, it would be 6 versus 5.  Still, as it stands, making worst
| case assumptions about the R3000 (that the nop gets left in), and some
| best-case assumptions about the 29K, you get a 9:17 ratio for this case,
| a 12:17 for y*z;  if you do the single-precision case with the same assumptions,
| you get a 6:12 ratio for x+y, and 8:12 for x*y.
| So, we get:
| 	DP +	DP *	SP +	SP *
| R3000	9	12	6	8	(subtract 1 if the nop gets scheduled)
| 29K	17	17	12	12

Nope, it is:

	DP +	DP *	SP +	SP *
  R3000 9	12	6	8
  29K	12	12	9	9

This again assumes that the '027 instruction is not reused (which it
would be in "real" code).  If it were reused, the counts would drop by 2
cycles.

| Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THRU COMPILERS.

Agreed.

	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

mash@mips.COM (John Mashey) (12/02/88)

In article <1054@microsoft.UUCP> w-colinp@microsoft.UUCP (Colin Plumb) writes:
.....
>But sigh, I'm losing track of the point of the argument.  ....
>...  Of course, if it's a small array
>(up to 30 64-bit words each, or so), I'll just keep the whole thing in
>registers on the 29000...
It would be very interesting to see what realistic high-level languauge
programs end up allocating floating-point arrays in the registers...
especially given FORTRAN call-by-reference.....

>>Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THROUGH COMPILERS.

>Actually, that's half a cycle to loads and a cycle to stores.  But you're
>right, this is a silly sort of thing to benchmark.  How about some
>figures on the frequencies of the various ops in floating point programs,
>O great benchamrk oracle? :-)  Just how often do I need to load and store?
Here's a few numbers, real quick, % instructions that are FP load/store:
		load	store
spice		15.2%	8.8%		(scalar)
doduc		26.1%	8%		(scalar)
linpack, 64bit	34.5%	18.6%		(vector)

>And don't forget that on the 29000, if it's just a local variable, I
>store it in the register file/stack cache (guaranteed one cycle) and
>the actual memory move may be obviated entirely.
>
>Even without all this logic, I think I can safely say that for vector
>operations, the memory->fpu->memory speed is essential, thus all the
>tricky things they do in Crays, avoiding the two steps of mem->reg
>and reg->fpu.  For all-register work, like Mandelbrot kernels, it
>doesn't matter.  And in between, I dunno.  I still think it doesn't
>hurt *that* bad.  What's the silicon cost for the coprocessor interface
>on the R2000/R3000?

There are algorithms where FP values stick in the registers.  Many
very scalar real programs do many loads&stores that simply will not
go away with zillions of on-chip registers [unless they're stack cache
like CRISP's, where the registers have addresses just like memory.  Even
then, it appears that typical allocatable-on-the-stack arrays blow
away any reasonable on-chip register caches, for a while.

>Quote from the R2000 architecture manual:
>[The Exception Program Counter register]
>This register contains the virtual address of the instruction that caused
>the exception.  When that instruction resides in a branch delay slot, the
>EPC register contains the virtual address of the immediately preceding
>Branch or Jump instruction.
>
>What I'm wondering is, in the instruction sequence
>
>	foo
>	bar
>	jump (untaken)
>	baz
>	quux
>
>where the jump is not taken, is "baz" considered to be in the jump's delay
>slot?  I.e. if baz faults, will the EPC point to it, or to the jump.
>Of course, if the jump *is* taken, then EPC will point to the jump, but
>I'm not sure if a "branch delay slot" is the instruction after a change-
>flow-of-control instruction, or a change in the flow of control.  I.e.
>is the labelling static or dynamic?  If dynamic, an instruction emulator
>wouldn't have to recompute the condition; it would know that the branch
>should be taken to find the correct return address.

It's static, i.e., it's irrelevant whether jump is taken or not.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (12/02/88)

In article <23656@amdcad.AMD.COM> tim@crackle.amd.com (Tim Olson) writes:
>In article <9061@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>| Here's an example.  MAybe this is close in speed, or maybe not, or maybe
>| I don't understand the 29K FPU interface well enough.  here's a small
>| example (*always be wary of small examples; I don't have time right now
>| for anything more substantive):
>| main() {
>| 	double x, y, z;
>| 	x = y + z;
>| }

Sorr

>No, local doubles are kept in the Am29000 register file, so no
>loads/stores will occur to the memory stack.  The Am29000 has two
Sorry, I was using a simple example, without optimization, to create
code typical of references to variables in memory.  The optimizer 
of course erases this code.
>methods of generating floating-point code, either emitting floating
>point instructions (which trap in the current Am29000 implementation) or
>emitting inline '027 code directly.  The fp instruction code for:

>double
>g(double x, double y)
>{
>	return x+y;
>}
>
>(essentially the same as your test case, but I had to revise it to make
>it emit *any* code) is:
>
>	jmpi	lr0
>	dadd	gr96,lr4,lr2

As I understand it, this is a trap to a routine that ends up issuing
something the code in your next example....
The R3000 code for this routine is of course:
  [y.c:   3] 0x0:	03e00008	jr	ra
  [y.c:   3] 0x4:	462e6000	add.d	f0,f12,f14


>The in-line '027 code for this looks like:
>
>	const	gr96,1	; (0x1)
>	consth	gr96,65536	; (0x10000)
>	store	1,38,gr96,gr96
>	store	1,32,lr4,lr5
>	store	1,97,lr2,lr3
>	load	1,1,gr97,gr96
>	load	1,0,gr96,gr96

>Nope, it is:
>
>	DP +	DP *	SP +	SP *
>  R3000 9	12	6	8
>  29K	12	12	9	9
>
>This again assumes that the '027 instruction is not reused (which it
>would be in "real" code).  If it were reused, the counts would drop by 2
>cycles.
Given the same assumptions (variables already in registers,
and we have enough of them to make that happen occasionally also),
but giving 29K benefit of the doubt on the constants, we end up with:
	DP +	DP *	SP +	SP *
  R3000 2	5	2	4
  29K	10	10	7	7
This, of course, is one of the most favorable-to-MIPS comparisons,
which is why I didn't use it in the first case.  Needless to say,
29K code generators will want to allocate FP variables to the FP unit
whenever possible, to avoid this kind of hit.

>| Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THRU COMPILERS.
>Agreed.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

tim@crackle.amd.com (Tim Olson) (12/03/88)

In article <9139@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
| Given the same assumptions (variables already in registers,
| and we have enough of them to make that happen occasionally also),
| but giving 29K benefit of the doubt on the constants, we end up with:
| 	DP +	DP *	SP +	SP *
|   R3000 2	5	2	4
|   29K	10	10	7	7
| This, of course, is one of the most favorable-to-MIPS comparisons,
| which is why I didn't use it in the first case.  Needless to say,
| 29K code generators will want to allocate FP variables to the FP unit
| whenever possible, to avoid this kind of hit.

Of course.  I was trying to make a fair comparison.  Our local scalar
stack *is* in the local register file, which is what we were comparing. 
If you want to compare operations where the operands are already in the
FP registers, then:

 	DP +	DP *	SP +	SP *
   R3000 2	5	2	4
   29K	 5	5	5	5

(Assuming non-pipelined '027 operation).

	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)