[comp.arch] m88000 benchmarks

lindsay@k.gp.cs.cmu.edu (Donald Lindsay) (06/14/88)

We don't have much benchmarking info yet about the Motorola 88000.
However, the May issue of "VLSI Systems Design" contains a pipeline timing
chart for an FFT inner loop.

The (compiler generated) code does 4 loads, 4 stores, 10 single precision
float calculations, and 4 other things, in 27 clocks. At 20 MHz, that's
7.4 MFLOPS.

A 16KB CMMU can hold 4K floats, but they all have to be faulted in.  A
recent post suggested counting 10 clocks per 16 byte fault. That's 2.5
clocks per float, but since a large FFT visits each data point several
times (say, 11) we can amortize the startup cost to about 1 clock per
inner loop.

So, "about 7 MFLOPS on an FFT benchmark" seems fair.
-- 
Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

"Imitation is not the sincerest form of flattery. Payments are."
- a British artist who died penniless before copyright law.

pajari@grads.cs.ubc.ca (George Pajari) (06/17/88)

From: lindsay@k.gp.cs.cmu.edu (Donald Lindsay) message <1941@pt.cs.cmu.edu>
> We don't have much benchmarking info yet about the Motorola 88000....
> The (compiler generated) code does 4 loads, 4 stores, 10 single precision
> float calculations, and 4 other things, in 27 clocks. At 20 MHz, that's
> 7.4 MFLOPS.

Don't believe it!

Although the benchmark shows both the C code and the assembler, the
assembler is NOT compiler generated, although you are supposed to draw
that conclusion!  (PS. it's 29 clocks, not 27).

When pressed at a local presentation on the M88000, Motorola reps conceded
that the 'infamous' FFT benchmark represented carefuly crafted HAND-CODED
RISC instructions ordered to take maximal advantage of the pipelining in
the 88100.  Furthermore, they admitted that the best code by any compiler
resulted in a 54 cycle loop...ALMOST TWICE THE LENGTH OF THE BENCHMARK CODE.

It would seem more resonable to call it about 3.5 MFLOPS.  (Impressive none
the less.)

Aren't MIPS/MFLOPS fun?  (And we thought Intel had the record for most cooked
benchmark figures in marketing literature.)


George Pajari
sometime grad student

(MIPS = Meaningless Indicator of Processor Speed)

nather@ut-sally.UUCP (Ed Nather) (06/17/88)

In article <3208@ubc-cs.UUCP>, pajari@grads.cs.ubc.ca (George Pajari) writes:
> 
> When pressed at a local presentation on the M88000, Motorola reps conceded
> that the 'infamous' FFT benchmark represented carefuly crafted HAND-CODED
> RISC instructions ordered to take maximal advantage of the pipelining in
> the 88100.  Furthermore, they admitted that the best code by any compiler
> resulted in a 54 cycle loop...ALMOST TWICE THE LENGTH OF THE BENCHMARK CODE.
> 

Gosh!  Does this mean that careful hand-coding can yield a factor of 2
faster code even on a RISC machine?  I know it means that on a CISC machine
because I've done better than that myself. (Current record: a factor of 22
on a Nova computer, me vs. Fortran, with the sieve benchmark).

But where does this leave all the CS statements that assembly-language
coding is no longer needed?  Are factors of 2 really unimportant?
As I understand it, one of the major arguments for RISC architecture is
that "...only those instructions that a compiler can easily generate are
included in the instruction set."  So the compiler-builders stack the
deck and still fall a factor of 2 short.

Before you reach for the "r" key to send flames, be aware that I am guilty
of once writing a compiler, long ago.  But I wrote in in ASM.

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin
{allegra,ihnp4}!{noao,ut-sally}!utastro!nather
nather@astro.AS.UTEXAS.EDU

bzs@bu-cs.BU.EDU (Barry Shein) (06/17/88)

>Gosh!  Does this mean that careful hand-coding can yield a factor of 2
>faster code even on a RISC machine?  I know it means that on a CISC machine
>because I've done better than that myself. (Current record: a factor of 22
>on a Nova computer, me vs. Fortran, with the sieve benchmark).
>
>But where does this leave all the CS statements that assembly-language
>coding is no longer needed?  Are factors of 2 really unimportant?
>-- 
>Ed Nather
>Astronomy Dept, U of Texas @ Austin

Um, Ed, before you jump (or is that JRST) to all sorts of conclusions
note that the 88000 is a very new architecture, I doubt they have the
compilers up to snuff yet nor would be disappointed if they were still
working on it.

It might be interesting to see what the case is on RISC systems with
more mature compilers (which generally means "working for a year
now".)

	-Barry Shein, Boston University

jgk@speech2.cs.cmu.edu (Joe Keane) (06/18/88)

All this tells me is that badly need better compilers.  Sad to say,
most Unix computers are still using some variant of pcc (don't get me
wrong, pcc was a good idea, but that was a long time ago).  I think
gcc is starting to look good; it's nice to recompile a program and get
a 20% across-the-board speedup.  Some of the new architectures out
there are really going to put a strain on the compiler writers, more
power to them.

And no one should be ashamed of hand-coding a hundred lines of
assembly.  There's always that inner loop that really needs to be
fast.

--Joe

lindsay@k.gp.cs.cmu.edu (Donald Lindsay) (06/18/88)

In article <3208@ubc-cs.UUCP> pajari@grads.cs.ubc.ca (George Pajari) writes:
>When pressed at a local presentation on the M88000, Motorola reps conceded
>that the 'infamous' FFT benchmark represented carefuly crafted HAND-CODED
>RISC instructions ordered to take maximal advantage of the pipelining in
>the 88100.  Furthermore, they admitted that the best code by any compiler
>resulted in a 54 cycle loop...ALMOST TWICE THE LENGTH OF THE BENCHMARK CODE.

The article says that the code was "compiled with the scheduler algorithm
(release 1.0)". They also say that the same code, executed without
overlaps, requires 53 clocks. 

If your rep is correct, then it's likely that the article was submitted for
publication some months ago. The authors were probably just optimistic
about how quickly everything would come together.

Writing a "code scheduler" for a compiler is an interesting task. The 88000
has several nice properties in this regard; only branches have data
dependent timing; and only load/store can take cache hits.  Plus,
imperfections in the scheduling algorithm don't lead to incorrect code.
(The scoreboard insures that you don't e.g. read a register before incoming
data has arrived in it.)

Viewed from a distance, an 88000 scheduler looks quite simple: the "machine
model" isn't too complex, and the interface (machine code in, machine code
out) is straightforward. Viewed close up, I'm sure that the implementers
have hit some kind of grief: a problem with register lifetimes and live
sets: or, problems with labels and directives: or just suboptimal results
on the biggest customer's favorite benchmark. Or, more likely, everybody
was too busy fixing bugs in some other component.

I'd like to hear from Tadpole Technology about what they did that gave such
a goose to the Dhrystone figures.


-- 
Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

"Imitation is not the sincerest form of flattery. Payments are."
- a British artist who died penniless before copyright law.

mash@mips.COM (John Mashey) (06/18/88)

In article <23406@bu-cs.BU.EDU> bzs@bu-cs.BU.EDU (Barry Shein) writes:

>>Gosh!  Does this mean that careful hand-coding can yield a factor of 2
>>faster code even on a RISC machine?  I know it means that on a CISC machine
>>because I've done better than that myself. (Current record: a factor of 22
>>on a Nova computer, me vs. Fortran, with the sieve benchmark).

1) Remember that by now we have a whole collection of things labeled
RISC machines, whose architectures varied quite a lot.

2) At least one flavor of RISC design style [i.e., as practiced by,
at least, IBM, HP, and MIPS] selects instructions and features as seen
in their effects upon compiled code, using optimizing compiler statistics
to drive the design, i.e., the COMPILERS EXIST (at least substantially)
BEFORE THE INSTRUCTION SET ARCHITECTURE.

>>But where does this leave all the CS statements that assembly-language
>>coding is no longer needed?  Are factors of 2 really unimportant?
>
>Um, Ed, before you jump (or is that JRST) to all sorts of conclusions
>note that the 88000 is a very new architecture, I doubt they have the
>compilers up to snuff yet nor would be disappointed if they were still
>working on it.

>It might be interesting to see what the case is on RISC systems with
>more mature compilers (which generally means "working for a year
>now".)
>
>	-Barry Shein, Boston University

3) Just because something is labeled a RISC doesn't mean that even a
good compiler can always do almost as well as assembler; it depends on:
	a) Was the instruction set REALLY designed from compilation
	statistics?
	b) What is the current state of the compilers? (see Barry's note)
	c) Who's writing the asembler code? 
For example, if a) is true, and the compilers are good, then c) matters a
little less, i.e., the oppurtunities for usefully writing assembler
go down.  As an example, now that we have experience of several years
in running systems:
	a) We wrote low-level interrupt code, block-copy, machine-dependent
	code (like tlb-flushing), and floating-point-emulation code
	in assembler.
	b) We wrote heavily-tuned math libraries in assembler.
	c) We wrote a couple of the C string routines in assembler.
	(actually, at one point, I wrote ALL of the string routines in
	assembler, but the compiled C code was close enough that I threw
	away most of the assmebler routines in disgust at the time I'd wasted.)
As another example, in our case FORTRAN Linpack and Coded Linpack are usually
within 10%.

There is almost always some use for assembler.  From past experience, I'd
say that compiler-driven RISC designs + good compilers reduce the
temptation considerably.

Since this started with 88000 numbers, does anybody have any Double
Precision 88000 numbers (like FORTRAN Linpack, or DP Doduc, or Spice)?
The only DP number I've seen is DP Whetstones @ around 3Mwhets.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

smryan@garth.UUCP (Steven Ryan) (06/19/88)

>Viewed from a distance, an 88000 scheduler looks quite simple: the "machine
>model" isn't too complex, and the interface (machine code in, machine code
>out) is straightforward. Viewed close up, I'm sure that the implementers
>have hit some kind of grief: a problem with register lifetimes and live
>sets: or, problems with labels and directives: or just suboptimal results
>on the biggest customer's favorite benchmark.

The problem with writing a scheduler is finding an efficient (read: heuristic)
method of traversing the code graph. The simplest method produces the best code
at exponential compilation time: try every possible sequence. The grief is
trying to find a linear to quadratic time method which still gives a good
sequence.

We humans seem to have innate ability to find the best graph traversal (or
perhaps we are not overwhelmed until the graph becomes large), so human
experts will continue outcode compilers for the near future.

                                             sm ryan

            Nature is immacuately fair and implacably cruel.
                                 -- Sara James

aglew@urbsdc.Urbana.Gould.COM (06/20/88)

>Gosh!  Does this mean that careful hand-coding can yield a factor of 2
>faster code even on a RISC machine?  I know it means that on a CISC machine
>because I've done better than that myself. (Current record: a factor of 22
>on a Nova computer, me vs. Fortran, with the sieve benchmark).

My personal record is a 3-fold improvement -- translating assembly code
into C, and then improving the algorithm. It goes either way.
   I've done better going fom C into assembly, but I'm most proud
of the speedups I've obtained going from assembly into C. I somehow
think that that code is more likely to be running 2 processor 
families from now than the assembly.

aglew@urbsdc.Urbana.Gould.COM (06/20/88)

>The problem with writing a scheduler is finding an efficient (read: heuristic)
>method of traversing the code graph. The simplest method produces the best code
>at exponential compilation time: try every possible sequence. The grief is
>trying to find a linear to quadratic time method which still gives a good
>sequence.

Do any compilers contain the simple, brute-force, exponential algorithm,
to be turned on when you are willing to spend the time, or when the graph
for a procedure is small enough?
    There was a paper, an ASPLOS back, on "Super-Optimizer", a brute force
searcher. It might be nice to have that in your toolbox...

smryan@garth.UUCP (Steven Ryan) (06/22/88)

In article <28200167@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>Do any compilers contain the simple, brute-force, exponential algorithm,
>to be turned on when you are willing to spend the time, or when the graph
>for a procedure is small enough?
>    There was a paper, an ASPLOS back, on "Super-Optimizer", a brute force
>searcher. It might be nice to have that in your toolbox...

I remember something like that--the running time was like a week on 10 line
program, and even then it has to be handchecked.

It would be interesting to combine profilers and optimisers to locate that
one innermost loop and then exhaustively enumerate the possible paths. If
the loop is small and this is only done once or twice in a large program,
the exponential time could be subsumed by rest of the compilation.

jesup@cbmvax.UUCP (Randell Jesup) (06/23/88)

In article <28200167@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>>The problem with writing a scheduler is finding an efficient (read: heuristic)
>>method of traversing the code graph. The simplest method produces the best code
>>at exponential compilation time: try every possible sequence. The grief is
>>trying to find a linear to quadratic time method which still gives a good
>>sequence.

>Do any compilers contain the simple, brute-force, exponential algorithm,
>to be turned on when you are willing to spend the time, or when the graph
>for a procedure is small enough?
>    There was a paper, an ASPLOS back, on "Super-Optimizer", a brute force
>searcher. It might be nice to have that in your toolbox...

	It certainly wouldn't be hard.  You'd want to limit it to simple
graphs for obvious reasons, since it can get out of control pretty fast.
However, in a properly designed version, the only limit would be how much
VM you have, and how long you're willing to wait.  You only need this sort
of optimization for final production code, anyway.

	I think (if I remember correctly) that the Gross, et al, methods
run in O(n^4) time, which is a lot better than the brute force method, at
least for blocks bigger than 10 or 15 instructions.  However, there are a
LOT of small (< 20) instruction blocks for which brute force can make sense.

	However, for the really big blocks you see sometimes (especially
from certain types of numerical/FORTRAN/bad/bizarre code), even O(n^4) is
painful.  There are faster (O(n^2)) methods, that produce less optimum code,
but LOTS faster.  Or you might try breaking the large blocks into some minimum
block size artificially, so n doesn't get so big.

Randell Jesup, Commodore Engineering {uunet|rutgers|ihnp4|allegra}!cbmvax!jesup

chris@mimsy.UUCP (Chris Torek) (06/28/88)

In article <754@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes:
>The problem with writing a scheduler is finding an efficient (read: heuristic)
>method of traversing the code graph. The simplest method produces the best code
>at exponential compilation time: try every possible sequence. ...

>We humans seem to have innate ability to find the best graph traversal (or
>perhaps we are not overwhelmed until the graph becomes large), so human
>experts will continue outcode compilers for the near future.

A recent (I think) ASPLOS has a paper on something called
`Superoptimizer'.  This is a program that uses exponential time to find
the best 68000 or 80x86 code sequence to replace some other sequence.
It is not suitable for more than about 13 instructions at a time, but
the sequences it produces are better than those that humans usually
derive.  It came up with a four-68000-instruction `signum' function,
and its versions of min and max (and minmax together) are also quite
surprising.  None of them have any branches.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

dik@cwi.nl (Dik T. Winter) (06/28/88)

In article <12171@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
 >          It came up with a four-68000-instruction `signum' function,
 > and its versions of min and max (and minmax together) are also quite
 > surprising.  None of them have any branches.
(What is signum?)
Not so surprising for min and max (and minmax).  However, they require a
sign extending shift, so that implementation is not possible on all
machines.  Also, if a branch takes only one cycle (with delay slot),
you do not gain anything (in general).
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

aglew@urbsdc.Urbana.Gould.COM (06/29/88)

..> [Talking (I assume - the original note passed me by) about superoptimization]:
>Not so surprising for min and max (and minmax).  However, they require a
>sign extending shift, so that implementation is not possible on all
>machines.  Also, if a branch takes only one cycle (with delay slot),
		  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>you do not gain anything (in general).
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>dik t. winter, cwi, amsterdam, nederland

You don't gain anything by avoiding branches if branches are cheap
- until the next generation of your machine comes by, more deeply
pipelined, with more expensive branches.

I've talked with representatives of several RISC manufacturers about
"the next generation" - I think MIPS was one of them, John Mashey can correct
me, and I know SPARC was. I keep asking "how many delay slots are you going
to need on your next generation". They say 2 - but to be backward compatible
they'll interlock, with the time delay for the interlock circuitry masked
by the delay slot. 

wca@oakhill.UUCP (william anderson) (07/01/88)

In messages <1941@pt.cs.cmu.edu> and <3208@ubc-cs.UUCP> Donald Lindsay
and George Pajari discussed a particular benchmark for the M88000, the
inner loop of an FFT calculation.  In this article, we examine the
output from two compilers for this code, and present the results of
instruction rescheduling on the assembly code produced by these
compilers.  We also include a hand-coded version of the code which is
nearly optimal for the given task.

INTRODUCTION

The subprogram which is considered here is a C language version of a fast
Fourier transform subroutine, to wit:

	fft(xr,xi,n,m,wr,wi)
		float xr[],xi[],wr[],wi[];
		unsigned int n,m;
	{

		unsigned int n1,n2,ia,ie,i,j,k,l;
		float c,s,xrt,xit;

		n2 = n;
		ie = 1;
		for (k=0 ; k<m ; k++)
		{
			n1 = n2;
			n2 = n2>>1;
			ia = 0;
			for (j=0 ; j<n2 ; j++)
			{
				c = wr[ia];
				s = wi[ia];
				ia = ia+ie;

				/* BEGIN INNER LOOP */

				for (i=0 ; i<n ; i+=n1)
				{
					l = i+n2;
					xrt = xr[i] - xr[l];
					xr[i] = xr[i] + xr[l];
					xit = xi[i] - xi[l];
					xi[i] = xi[i] + xi[l];
					xr[l] = c*xrt - s*xit;
					xi[l] = c*xit + s*xrt;
				}

				/* END INNER LOOP */

			}
			ie = ie<<1;
		}
	}

In particular, we will focus on the inner loop code which is bracketed by
appropriate comments in the source above.

We will examine the assembly output for this inner loop from two compilers,
a commercial C compiler (which we will designate "compiler A") and a public-
domain C compiler ("compiler B"), both of which  currently exist for the
M88000 architecture.

Finally, we will look at a hand-coded inner loop and discuss some of the
instruction timing for the M88000.

COMPILER A

The code from compiler A for the inner loop is 39 instructions and, as
scheduled by compiler A, requires 75 clocks to execute.  Although
compiler A is supposed to have a Motorola-developed instruction
scheduling strategy incorporated in it, we can filter the above code
through the Motorola instruction scheduler to get a code sequence which
executes in 64 clocks.  Both code sequences are given below:

Code scheduled by compiler A		Code scheduled by Motorola scheduler
	(75 clocks)				(64 clocks)

@L18:                                   @L18:
        mak             r5,r23,2                ld              r8,r25,0
        addu            r10,r5,r17              mak             r5,r23,2
        st              r10,r31,92              addu            r10,r5,r17
        ld              r9,r10,0                ld              r9,r10,0
        ld              r8,r25,0                st              r10,r31,92
        fsub.sss        r10,r8,r9               ld              r12,r31,92
        ld              r12,r31,92              fsub.sss        r10,r8,r9
        st              r10,r31,80              ld              r6,r12,0
        ld              r6,r12,0                st              r10,r31,80
        addu            r10,r5,r16              addu            r10,r5,r16
        fadd.sss        r7,r8,r6                st              r10,r31,88
        st              r10,r31,88              fadd.sss        r7,r8,r6
        st              r7,r25,0                st              r7,r25,0
        ld              r4,r24,0                ld              r4,r24,0
        ld              r5,r10,0                ld              r5,r10,0
        ld              r10,r31,76              ld              r10,r31,76
        ld              r3,r24,0                ld              r3,r24,0
        fsub.sss        r4,r4,r5                fsub.sss        r4,r4,r5
        fadd.sss        r3,r3,r5                fmul.dss        r7,r10,r21
        fmul.dss        r7,r10,r21              fadd.sss        r3,r3,r5
        fmul.dss        r5,r4,r20               ld              r10,r31,80
        fsub.sdd        r9,r7,r5                addu            r25,r25,r22
        st              r3,r24,0                fmul.dss        r5,r4,r20
        st              r9,r12,0                st              r3,r24,0
        ld              r10,r31,80              addu            r24,r24,r22
        fmul.dss        r2,r10,r20              fsub.sdd        r9,r7,r5
        fmul.dss        r7,r4,r21               fmul.dss        r2,r10,r20
        fadd.sdd        r4,r2,r7                fmul.dss        r7,r4,r21
        ld              r12,r31,88              st              r9,r12,0
        st              r4,r12,0                ld              r12,r31,88
        ld              r10,r31,84              ld              r10,r31,84
        ld              r12,r31,76              fadd.sdd        r4,r2,r7
        addu            r24,r24,r22             st              r4,r12,0
        addu            r25,r25,r22             ld              r12,r31,76
        addu            r23,r23,r12             addu            r23,r23,r12
        addu            r10,r10,r12             addu            r10,r10,r12
        st              r10,r31,84              st              r10,r31,84
        cmp             r10,r14,r10             cmp             r10,r14,r10
        bb1             hi,r10,@L18             bb1             hi,r10,@L18

Several comments are in order regarding the code generated by compiler A:

 1 - Compiler A generates code which clocks at 2.67 Mflops using its own
     scheduling algorithm, and which clocks at 3.13 Mflops using Motorola's
     rescheduling filter (speeds are for a 20 MHz M88000).

 2 - Compiler A fails to use the scaled index addressing mode for either
     load (ld) or store (st) instructions.  The above "mak" and "addu"
     instructions followed by a "ld" or "st" can be collapsed into a
     single scaled indexed load or store.  Interestingly, compiler A
     was written by a software company that just finished a C compiler
     for a RISC architecture which lacks indexed addressing, either
     scaled or unscaled.  Compiler A does use scaled indexed addressing,
     however, when the arrays are integers and not floats.

 3 - Even though the most aggressive optimization was called for, the code
     in the inner loop was still loading/storing values to/from the stack
     instead of promoting the temporary stack-resident variables into
     register variables.  Half of the load/store traffic in the above inner
     loop (10 of 20 loads or stores) is unnecessary if this promotion is
     done.

 4 - Note that compiler A is using mixed-mode arithmetic in its loop.
     In particular, it is keeping the intermediate products in doubles
     (which are register-resident) while doing the final add or subtract
     with a single result for the store.  This preserves the spirit of
     K&R C and does not greatly impact the loop timing since the
     mixed-mode instructions are scheduled to avoid stalls.  The mixed-
     mode arithmetic costs an additional 2 clocks per loop. 

5 - Since the final cmp instruction reuses r10 instead of another register,
    the Motorola rescheduler cannot put the the final store of the loop into
    the branch delay slot.  If another register is used for the compare
    target, we can save yet another clock from the loop.

COMPILER B

Compiler B is a public-domain C compiler which has been recently been
adapted to generate 88K code.  Compiler B generated an inner loop which
was 22 instructions in length and which required 52 clocks as
scheduled.  After scheduling with Motorola's rescheduler, the code
required 39 clocks to execute.  As above, we list the assembly code
output using the compiler's internal scheduling algorithm side by side
with the code as scheduled by Motorola's instruction rescheduler:

Code scheduled by compiler B		Code scheduled by Motorola scheduler
	(52 clocks)				(39 clocks)

@L12:                                   @L12:
        lda.b           r2,r7,r10               ld              r4,r12[r7]
        ld              r4,r12[r7]              lda.b           r2,r7,r10
        ld              r5,r12[r2]              ld              r5,r12[r2]
        fsub.sss        r3,r4,r5                fsub.sss        r3,r4,r5
        fadd.sss        r4,r4,r5                fadd.sss        r4,r4,r5
        st              r4,r12[r7]              st              r4,r12[r7]
        ld              r5,r11[r7]              ld              r5,r11[r7]
        ld              r6,r11[r2]              ld              r6,r11[r2]
        fsub.sss        r4,r5,r6                fsub.sss        r4,r5,r6
        fadd.sss        r5,r5,r6                fadd.sss        r5,r5,r6
        st              r5,r11[r7]              fmul.sss        r6,r13,r4
        fmul.sss        r5,r14,r3               fmul.sss        r4,r14,r4
        fmul.sss        r6,r13,r4               st              r5,r11[r7]
        fsub.sss        r5,r5,r6                fmul.sss        r5,r14,r3
        st              r5,r12[r2]              fmul.sss        r3,r13,r3
        fmul.sss        r4,r14,r4               lda.b           r7,r7,r18
        fmul.sss        r3,r13,r3               cmp             r25,r7,r17
        fadd.sss        r4,r4,r3                fsub.sss        r5,r5,r6
        st              r4,r11[r2]              fadd.sss        r4,r4,r3
        lda.b           r7,r7,r18               st              r5,r12[r2]
        cmp             r25,r7,r17              bb1.n           lo,r25,@L12
        bb1             lo,r25,@L12             st              r4,r11[r2]

Several comments about the code from compiler B:

 1 - Compiler B generates code which clocks at 3.85 Mflops using its own
     scheduling algorithm, and which clocks at 5.13 Mflops using Motorola's
     rescheduling filter (speeds are for a 20 MHz M88000).

 2 - Compiler B uses the scaled index addressing mode for loads and stores,
     resulting in more efficient code than compiler A.  Note also that the
     Motorola rescheduler changed the last branch-on-bit-set (bb1) instruction
     into a bb1.n and put the last store into the delay slot.

 3 - Compiler B promoted the temporary stack-resident variables into
     register variables, thus minimizing load/store traffic.

 4 - Compiler B uses only single-precision arithmetic (in the spirit of
     ANSI C).

HAND-CODED EXAMPLE

As a limit to the optimal instruction ordering for this fft inner loop
example, we include here an equivalent inner loop, hand-coded in M88000
assembler by Marvin Denman of the MC88100 Design Team.  Although the
set-up and register usage for this loop differ slightly from the compiler
examples given above, the number and selection of instructions is identical
to compiler B.  This loop of 22 instructions executes in 23 clocks.

Hand-scheduled code
   (23 clocks)

@L12:
	fsub.sss	r7,r8,r6
	fadd.sss	r6,r8,r6
	fsub.sss	r10,r9,r5
	fadd.sss	r5,r9,r5
	st		r1,r2[r16]
	st		r20,r3[r16]
	lda.b		r16,r15,r11
	ld		r8,r21[r15]
	ld		r9,r22[r15]
	st		r6,r2[r15]
	fmul.sss	r1,r13,r7
	fmul.sss	r20,r14,r7
	st		r5,r3[r15]
	fmul.sss	r17,r14,r10
	fmul.sss	r18,r13,r10
	lda.b		r15,r15,r19
	cmp		r12,r15,r25
	ld		r6,r23[r15]
	ld		r5,r24[r15]
	fsub.sss	r1,r1,r17
	bb1.n		lo,r12,@L12
	fadd.sss	r20,r18,r20

Although the last two FP instructions in the loop do not complete before
the next iteration, there are no data dependencies which cause a stall
from loop to loop.  This code loop executes at 8.70 Mflops on an M88000
system running at 20 MHz.  This corresponds to 19.1 (native) Mips.

Note that the hand-coded assembler makes an assumption that the
compilers do not make: the data arrays are assumed not to overlap.  If
the arrays do overlap, then the minimum number of clocks for the loop
is about 36 clocks, less than 10% different from the compiler B code
above.

INSTRUCTION TIMING

Since this discussion has included the results of Motorola's instruction
rescheduler for the MC88000, a short discussion of the instruction timing
for the 88K instructions mentioned in this article is in order.  First, we
present a table of the timings of several common instructions:

Instruction Latency Comments
----------- ------- ------------------------------------------------------
addu           1    add unsigned
mak            1    make bit field
lda            1    load address
st             1    store register (optional indexed, scaled-index modes);
ld             3    load register (optional indexed, scaled-index modes)
fadd.sss       5    FP add (SP)
fadd.sdd       6    FP add (DP operands, SP result)
fsub.sss       5    FP subtract (SP)
fsub.sdd       6    FP subtract (DP operands, SP result)
fmul.sss       6    FP multiply (SP)
fmul.dss       6    FP multiply (SP operands, DP result)

The M88000 has a fully-pipelined data memory unit, so that loads and
stores can be issued each cycle in an overlapped fashion.  The results
of a load, of course, cannot be used until the load completes, and the
scoreboard enforces this.  Note also that the indexed and
scaled-indexed addressing modes supported by the data unit obviate the
need to use the integer unit adder for data array access.  The
low-level concurrency provided by this feature is one reason why
compiler B's inner loop was shorter and faster than compiler A's loop.

The FP unit actually has two pipes, one for multiply and one for add, which
share the first and last stages.  Note that the latency for the mixed
precision multiply (DP result) is the same as for the SP result, since the
register file's feed-forward circuitry manages this case.  The FP unit is
pipelined to allow the FP instructions listed to be issued each cycle.

We now look at the timing diagrams for the rescheduled loop of compiler B
and the hand-coded loop, both listed above.  These diagrams give an idea of
the concurrency available at a low level in the M88000 architecture.  In
these diagrams, the instruction duration is signified with a dash ("-")
for every clock that the instruction is active.  Note that the time that
an instruction is active may not be the same as its latency (e.g. st).
Note also that the effective latency may be greater than the nominal
latency for a particular instruction due to contention for write-back
slots to the register file.

Rescheduled compiler B code (39 clocks)

               5    10   15   20   25   30   35   40
CLOCKS     ----+----+----+----+----+----+----+----+
ld         ---
lda.b       -
ld           ---
fsub.sss        -----
fadd.sss         -----
st                    --
ld                     ---
ld                      ---
fsub.sss                   -----
fadd.sss                    -----
fmul.sss                        ------
fmul.sss                         ------
st                                --
fmul.sss                           ------
fmul.sss                            ------
lda.b                                -
cmp                                   -
fsub.sss                                  -----
fadd.sss                                   -----
st                                             --
bb1.n                                           --
st                                               --


Hand-coded loop (23 clocks)

               5    10   15   20   25   30   35   40
CLOCKS     ----+----+----+----+----+----+----+----+
fsub.sss   -----
fadd.sss    -----
fsub.sss     -----
fadd.sss      -----
st             --
st              --
lda.b            -
ld                ---
ld                 ---
st                  --
fmul.sss             ------
fmul.sss              ------
st                     --
fmul.sss                ------
fmul.sss                 ------
lda.b                     -
cmp                        -
ld                          ---
ld                           ---
fsub.sss                       -----
bb1.n                           --
fadd.sss                         -----

This diagram is particularly interesting in showing the FP pipeline
performance of the M88000.  Note that in 2 places there are 5 instructions
in progress at once, 4 of which are FP instructions.  This illustrates
Frank McMahon's assertion that, for pipelined architectures, the numerical
computation rate will be directly correlated to basic code block size
(roughly, the number of instructions in the inner loop).  For an interesting
comparison of the sensitivity of overall FP performance to the number of
FP instructions per inner loop, see McMahon's "The Livermore Fortran Kernels:
A Computer Text of the Numerical Performance Range", Lawrence Livermore
National Laboratory, UCRL-53745, December 1986, pp. 8-9.

CONCLUSIONS

The M88000 assembly code output from 2 different compilers (as well as
an optimum hand-coded example) for the inner loop of an FFT routine have
been compared.  The value of instruction scheduling for a given M88000
code fragment has been shown.  Some of the M88000 instruction timings have
been given and the timing diagrams for two of the loops were given.

       /\        /\         Mitch Alsup
      //\\      //\\        Willie Anderson
     ///\\\    ///\\\       Marvin Denman
    //    \\  //    \\      Mike Paton
   /        \/        \
  /                    \    Members the of Motorola 88000 Development Team

earl@mips.COM (Earl Killian) (07/02/88)

In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (william anderson) writes:
> 	for (i=0 ; i<n ; i+=n1) {
> 		l = i+n2;
> 		xrt = xr[i] - xr[l];
> 		xr[i] = xr[i] + xr[l];
> 		xit = xi[i] - xi[l];
> 		xi[i] = xi[i] + xi[l];
> 		xr[l] = c*xrt - s*xit;
> 		xi[l] = c*xit + s*xrt;
> 	}
> As a limit to the optimal instruction ordering for this fft inner loop
> example, we include here an equivalent inner loop, hand-coded in M88000
> assembler by Marvin Denman of the MC88100 Design Team.  Although the
> set-up and register usage for this loop differ slightly from the compiler
> examples given above, the number and selection of instructions is identical
> to compiler B.  This loop of 22 instructions executes in 23 clocks.
> [88100 code excised]
> Although the last two FP instructions in the loop do not complete before
> the next iteration, there are no data dependencies which cause a stall
> from loop to loop.  This code loop executes at 8.70 Mflops on an M88000
> system running at 20 MHz.  This corresponds to 19.1 (native) Mips.

An fun exercise.  Here's the MIPSco R3000 code for the inner loop:

	l.s	f0, 0(t0)		# xr[i]
l:	l.s	f2, 0(t1)		# xr[l]
	l.s	f4, 0(t2)		# xi[i]
	sub.s	f10, f0, f2		# xrt = xr[i] - xr[l]
	l.s	f6, 0(t3)		# xi[l]
	mul.s	f20, f28, f10		# c*xrt
	sub.s	f12, f4, f6		# xit = xi[i] - xi[l]
	addu	t0, 4
	addu	t1, 4
	mul.s	f22, f30, f12		# s*xit
	add.s	f14, f0, f2		# xr[i] + xr[l]
	addu	t2, 4
	addu	t3, 4
	mul.s	f24, f28, f12		# c*xit
	add.s	f16, f4, f6		# xi[i] + xi[l]
	s.s	f14, -4(t0)		# xr[i] = xr[i] + xr[l]
	l.s	f0, 0(t0)		# xr[i+1]
	mul.s	f26, f30, f10		# s*xrt
	sub.s	f20, f20, f22		# c*xrt - s*xit
	s.s	f16, -4(t2)		# xi[i] = xi[i] + xi[l]
	s.s	f20, -4(t1)		# xr[l] = c*xrt - s*xit
	add.s	f24, f24, f26		# c*xit + s*xrt
	bne	t0, t4, l
	 s.s	f24, -4(t3)		# xi[l] = c*xit + s*xrt

Summary: 23 instructions, 23 cycles, 10.9 mflops @ 25MHz, and 25
native mips.  No pipelining within an op unit required (the R3010 has
none).  Multiply/add/load/store overlap is extensively used.

Analyzing in terms of the previous posting m=.8 (8 load/store cycles
per 10 flops), f1=.6 (6 add/subtract out of 10 flops), f2=.4 (4
multiply out of 10 flops).  Thus a new add/subtract required every
(m+1)/f1 = 3 cycles, and a new multiply required every (m+1)/f2 = 4.5
cycles.  Since the respective latencies are 2 and 4, no pipelining is
required.
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

brooks@maddog.llnl.gov (Eugene D. Brooks III) (07/06/88)

In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (william anderson) writes:
A detailed exposition of the performance of two compilers and hand assembly
for the FFT inner loop.

For even better pipeline utilization from compiler B, the public domain
compiler, change the register allocation strategy so that when looking
for a scratch register is remembers the last one allocated and starts
upward folding around to R0 when it hits the top of the register file.
By doing this the false dependencies caused by register reuse will be
minimized and the postprocessing rescheduler will have more code that
can be moved around.  We used this strategy with PCC for the Cerberus
instruction set with a postprocessing optimizer and it exploited the
functional unit pipelines very well.  The same will apply for the 88000.
The situation for Cerberus will be even better using a GCC/{postprocessing
scheduler} combo.  I am willing to bet that that "public domain" compiler
was GCC which is an ANSI C compiler that is to be taken quite seriously!

smryan@garth.UUCP (Steven Ryan) (07/07/88)

>For even better pipeline utilization from compiler B, the public domain
>compiler, change the register allocation strategy so that when looking
>for a scratch register is remembers the last one allocated and starts
>upward folding around to R0 when it hits the top of the register file.
>By doing this the false dependencies caused by register reuse will be

Or do the scheduling before assigning registers (FTN200 EBBO).

yuval@taux01.UUCP (Gideon Yuval) (07/13/88)

In his 6/88 posting, William Anderson of Motorola says:
> The code from compiler A for the inner loop is 39 instructions and, as
> scheduled by compiler A, requires 75 clocks to execute.  Although
> compiler A is supposed to have a Motorola-developed instruction
> scheduling strategy incorporated in it, we can filter the above code
> through the Motorola instruction scheduler to get a code sequence which
> executes in 64 clocks.  Both code sequences are given below:

Is this instruction-scheduler available to the public?
-- 
Gideon Yuval, yuval@taux01.nsc.com, +972-2-690992 (home) ,-52-522255(work)
 Paper-mail: National Semiconductor, 6 Maskit St., Herzliyah, Israel
             (alternative E-mail address: decwrl!nsc!taux01!yuval@uunet.uu.net)

wca@oakhill.UUCP (william anderson) (07/15/88)

In article <2532@wright.mips.COM> earl@mips.COM (Earl Killian) writes:
>In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (I) wrote:

  [FFT inner loop in C:]

>> 	for (i=0 ; i<n ; i+=n1) {
>> 		l = i+n2;
>> 		xrt = xr[i] - xr[l];
>> 		xr[i] = xr[i] + xr[l];
>> 		xit = xi[i] - xi[l];
>> 		xi[i] = xi[i] + xi[l];
>> 		xr[l] = c*xrt - s*xit;
>> 		xi[l] = c*xit + s*xrt;
>> 	}

Mr. Killian responded with a hand-coded inner loop for the MIPS
architecture:

>	l.s	f0, 0(t0)		# xr[i]
>l:	l.s	f2, 0(t1)		# xr[l]
>	l.s	f4, 0(t2)		# xi[i]
>	sub.s	f10, f0, f2		# xrt = xr[i] - xr[l]
>	l.s	f6, 0(t3)		# xi[l]
>	mul.s	f20, f28, f10		# c*xrt
>	sub.s	f12, f4, f6		# xit = xi[i] - xi[l]
>	addu	t0, 4
>	addu	t1, 4
>	mul.s	f22, f30, f12		# s*xit
>	add.s	f14, f0, f2		# xr[i] + xr[l]
>	addu	t2, 4
>	addu	t3, 4
>	mul.s	f24, f28, f12		# c*xit
>	add.s	f16, f4, f6		# xi[i] + xi[l]
>	s.s	f14, -4(t0)		# xr[i] = xr[i] + xr[l]
>	l.s	f0, 0(t0)		# xr[i+1]
>	mul.s	f26, f30, f10		# s*xrt
>	sub.s	f20, f20, f22		# c*xrt - s*xit
>	s.s	f16, -4(t2)		# xi[i] = xi[i] + xi[l]
>	s.s	f20, -4(t1)		# xr[l] = c*xrt - s*xit
>	add.s	f24, f24, f26		# c*xit + s*xrt
>	bne	t0, t4, l
>	 s.s	f24, -4(t3)		# xi[l] = c*xit + s*xrt
>
>Summary: 23 instructions, 23 cycles, 10.9 mflops @ 25MHz, and 25
>native mips.  No pipelining within an op unit required (the R3010 has
>none).  Multiply/add/load/store overlap is extensively used.

One problem with this code is that is assumes the "stride" of the loop
(the varible "n1" in the C code segment above) is unity!

What about the code for the inner loop in the general case?  What
effect does the assumption of non-unity stride have on the MIPS loop
timing?  Since the results must be stored at the same addresses as the
loads used, must the MIPS code be altered for the general case to do
its pointer incrementing in some other place than the FP latency
slots?  Since it lacks an indexed addressing mode, in this case the
MIPS architecture must increment 4 pointers instead of one array
index.

The Motorola 88000's ability to do more sophisticated (indexed and
scaled-indexed) addressing in its data memory unit allows a compiler
(or an assembly code writer) to loop-induce the index l out of the loop
and allows reference to all array elements with only one index
increment/loop (as opposed to 4 pointer increments/loop).  Thus the
loop instruction count for the M88K is reduced and performance is
maximized (22 instructions/23 clocks/8.7 MFLOPS/19.1 native MIPS at 20
MHz as per <1359@claude.oakhill.UUCP>).

>UUCP: {ames,decwrl,prls,pyramid}!mips!earl

Thanks to Marvin Denman and Mitch Alsup of the M88K Design Group for
their help with this note.

The statements and opinions presented in this article are my own.
They should not be interpreted as being the opinons or policy,
official or otherwise, of Motorola Inc.

       /\        /\ 		William C. Anderson
      //\\      //\\		Member of the M88000 Design Group
     ///\\\    ///\\\		Motorola Microprocessor Division
    //    \\  //    \\		Oak Hill, TX.
   /        \/        \
  /                    \

jmd@granite.dec.com (John Danskin) (07/16/88)

x
x
x

I have screwed around a little bit with Earl's code to try to
answer some of wca@oakhill (William C. Anderson)'s questions
about performance with non unity stride. There is probably a better
answer, but here goes:


>In article <2532@wright.mips.COM> earl@mips.COM (Earl Killian) writes:
>>In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (I) wrote:
>
>  [FFT inner loop in C:]
>
>>> 	for (i=0 ; i<n ; i+=n1) {
>>> 		l = i+n2;
>>> 		xrt = xr[i] - xr[l];
>>> 		xr[i] = xr[i] + xr[l];
>>> 		xit = xi[i] - xi[l];
>>> 		xi[i] = xi[i] + xi[l];
>>> 		xr[l] = c*xrt - s*xit;
>>> 		xi[l] = c*xit + s*xrt;
>>> 	}
>
>Mr. Killian responded with a hand-coded inner loop for the MIPS
>architecture:
>
>>	l.s	f0, 0(t0)		# xr[i]
>>l:	l.s	f2, 0(t1)		# xr[l]
>>	l.s	f4, 0(t2)		# xi[i]
>>	sub.s	f10, f0, f2		# xrt = xr[i] - xr[l]
>>	l.s	f6, 0(t3)		# xi[l]
>>	mul.s	f20, f28, f10		# c*xrt
>>	sub.s	f12, f4, f6		# xit = xi[i] - xi[l]
>>	addu	t0, 4
>>	addu	t1, 4
>>	mul.s	f22, f30, f12		# s*xit
>>	add.s	f14, f0, f2		# xr[i] + xr[l]
>>	addu	t2, 4
>>	addu	t3, 4
>>	mul.s	f24, f28, f12		# c*xit
>>	add.s	f16, f4, f6		# xi[i] + xi[l]
>>	s.s	f14, -4(t0)		# xr[i] = xr[i] + xr[l]
>>	l.s	f0, 0(t0)		# xr[i+1]
>>	mul.s	f26, f30, f10		# s*xrt
>>	sub.s	f20, f20, f22		# c*xrt - s*xit
>>	s.s	f16, -4(t2)		# xi[i] = xi[i] + xi[l]
>>	s.s	f20, -4(t1)		# xr[l] = c*xrt - s*xit
>>	add.s	f24, f24, f26		# c*xit + s*xrt
>>	bne	t0, t4, l
>>	 s.s	f24, -4(t3)		# xi[l] = c*xit + s*xrt
>>
>>Summary: 23 instructions, 23 cycles, 10.9 mflops @ 25MHz, and 25
>>native mips.  No pipelining within an op unit required (the R3010 has
>>none).  Multiply/add/load/store overlap is extensively used.
>
>One problem with this code is that is assumes the "stride" of the loop
>(the varible "n1" in the C code segment above) is unity!
>
>What about the code for the inner loop in the general case?  What
>effect does the assumption of non-unity stride have on the MIPS loop
>timing?


# 4 * n1 is in a register named 'n1'

	l.s	f0, 0(t0)		# xr[i]
l:	l.s	f2, 0(t1)		# xr[l]
	l.s	f4, 0(t2)		# xi[i]
	sub.s	f10, f0, f2		# xrt = xr[i] - xr[l]
	l.s	f6, 0(t3)		# xi[l]
	mul.s	f20, f28, f10		# c*xrt
	sub.s	f12, f4, f6		# xit = xi[i] - xi[l]
	addu	t5, t0, n1
	addu	t6, t1, n1
	mul.s	f22, f30, f12		# s*xit
	add.s	f14, f0, f2		# xr[i] + xr[l]
	addu	t7, t2, n1
	addu	t8, t3, n1
	mul.s	f24, f28, f12		# c*xit
	add.s	f16, f4, f6		# xi[i] + xi[l]
	s.s	f14, 0(t0)		# xr[i] = xr[i] + xr[l]
	l.s	f0, 0(t5)		# xr[i+1]
	mul.s	f26, f30, f10		# s*xrt
	sub.s	f20, f20, f22		# c*xrt - s*xit
	s.s	f16, 0(t2)		# xi[i] = xi[i] + xi[l]
	s.s	f20, 0(t1)		# xr[l] = c*xrt - s*xit
	add.s	f24, f24, f26		# c*xit + s*xrt
	be	t5, t4, e
	s.s	f24, 0(t3)		# xi[l] = c*xit + s*xrt

	l.s	f2, 0(t6)		# xr[l]
	l.s	f4, 0(t7)		# xi[i]
	sub.s	f10, f0, f2		# xrt = xr[i] - xr[l]
	l.s	f6, 0(t8)		# xi[l]
	mul.s	f20, f28, f10		# c*xrt
	sub.s	f12, f4, f6		# xit = xi[i] - xi[l]
	addu	t0, t5, n1
	addu	t1, t6, n1
	mul.s	f22, f30, f12		# s*xit
	add.s	f14, f0, f2		# xr[i] + xr[l]
	addu	t2, t7, n1
	addu	t3, t8, n1
	mul.s	f24, f28, f12		# c*xit
	add.s	f16, f4, f6		# xi[i] + xi[l]
	s.s	f14, 0(t5)		# xr[i] = xr[i] + xr[l]
	l.s	f0, 0(t0)		# xr[i+1]
	mul.s	f26, f30, f10		# s*xrt
	sub.s	f20, f20, f22		# c*xrt - s*xit
	s.s	f16, 0(t7)		# xi[i] = xi[i] + xi[l]
	s.s	f20, 0(t6)		# xr[l] = c*xrt - s*xit
	add.s	f24, f24, f26		# c*xit + s*xrt
	bne	t0, t4, l
	s.s	f24, 0(t3)		# xi[l] = c*xit + s*xrt

e:

This code has exactly the same performance as earl's code except that
is also handles the non-unity case. Yet another example of needing lots
more registers to write good code in pipelined machines. Can I have lots
more registers please?
-- 
John Danskin				| decwrl!jmd
DEC Technology Development		| (415) 853-6724 
100 Hamilton Avenue			| My comments are my own.
Palo Alto, CA  94306			| I do not speak for DEC.