lindsay@k.gp.cs.cmu.edu (Donald Lindsay) (06/14/88)
We don't have much benchmarking info yet about the Motorola 88000. However, the May issue of "VLSI Systems Design" contains a pipeline timing chart for an FFT inner loop. The (compiler generated) code does 4 loads, 4 stores, 10 single precision float calculations, and 4 other things, in 27 clocks. At 20 MHz, that's 7.4 MFLOPS. A 16KB CMMU can hold 4K floats, but they all have to be faulted in. A recent post suggested counting 10 clocks per 16 byte fault. That's 2.5 clocks per float, but since a large FFT visits each data point several times (say, 11) we can amortize the startup cost to about 1 clock per inner loop. So, "about 7 MFLOPS on an FFT benchmark" seems fair. -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science "Imitation is not the sincerest form of flattery. Payments are." - a British artist who died penniless before copyright law.
pajari@grads.cs.ubc.ca (George Pajari) (06/17/88)
From: lindsay@k.gp.cs.cmu.edu (Donald Lindsay) message <1941@pt.cs.cmu.edu> > We don't have much benchmarking info yet about the Motorola 88000.... > The (compiler generated) code does 4 loads, 4 stores, 10 single precision > float calculations, and 4 other things, in 27 clocks. At 20 MHz, that's > 7.4 MFLOPS. Don't believe it! Although the benchmark shows both the C code and the assembler, the assembler is NOT compiler generated, although you are supposed to draw that conclusion! (PS. it's 29 clocks, not 27). When pressed at a local presentation on the M88000, Motorola reps conceded that the 'infamous' FFT benchmark represented carefuly crafted HAND-CODED RISC instructions ordered to take maximal advantage of the pipelining in the 88100. Furthermore, they admitted that the best code by any compiler resulted in a 54 cycle loop...ALMOST TWICE THE LENGTH OF THE BENCHMARK CODE. It would seem more resonable to call it about 3.5 MFLOPS. (Impressive none the less.) Aren't MIPS/MFLOPS fun? (And we thought Intel had the record for most cooked benchmark figures in marketing literature.) George Pajari sometime grad student (MIPS = Meaningless Indicator of Processor Speed)
nather@ut-sally.UUCP (Ed Nather) (06/17/88)
In article <3208@ubc-cs.UUCP>, pajari@grads.cs.ubc.ca (George Pajari) writes: > > When pressed at a local presentation on the M88000, Motorola reps conceded > that the 'infamous' FFT benchmark represented carefuly crafted HAND-CODED > RISC instructions ordered to take maximal advantage of the pipelining in > the 88100. Furthermore, they admitted that the best code by any compiler > resulted in a 54 cycle loop...ALMOST TWICE THE LENGTH OF THE BENCHMARK CODE. > Gosh! Does this mean that careful hand-coding can yield a factor of 2 faster code even on a RISC machine? I know it means that on a CISC machine because I've done better than that myself. (Current record: a factor of 22 on a Nova computer, me vs. Fortran, with the sieve benchmark). But where does this leave all the CS statements that assembly-language coding is no longer needed? Are factors of 2 really unimportant? As I understand it, one of the major arguments for RISC architecture is that "...only those instructions that a compiler can easily generate are included in the instruction set." So the compiler-builders stack the deck and still fall a factor of 2 short. Before you reach for the "r" key to send flames, be aware that I am guilty of once writing a compiler, long ago. But I wrote in in ASM. -- Ed Nather Astronomy Dept, U of Texas @ Austin {allegra,ihnp4}!{noao,ut-sally}!utastro!nather nather@astro.AS.UTEXAS.EDU
bzs@bu-cs.BU.EDU (Barry Shein) (06/17/88)
>Gosh! Does this mean that careful hand-coding can yield a factor of 2 >faster code even on a RISC machine? I know it means that on a CISC machine >because I've done better than that myself. (Current record: a factor of 22 >on a Nova computer, me vs. Fortran, with the sieve benchmark). > >But where does this leave all the CS statements that assembly-language >coding is no longer needed? Are factors of 2 really unimportant? >-- >Ed Nather >Astronomy Dept, U of Texas @ Austin Um, Ed, before you jump (or is that JRST) to all sorts of conclusions note that the 88000 is a very new architecture, I doubt they have the compilers up to snuff yet nor would be disappointed if they were still working on it. It might be interesting to see what the case is on RISC systems with more mature compilers (which generally means "working for a year now".) -Barry Shein, Boston University
jgk@speech2.cs.cmu.edu (Joe Keane) (06/18/88)
All this tells me is that badly need better compilers. Sad to say, most Unix computers are still using some variant of pcc (don't get me wrong, pcc was a good idea, but that was a long time ago). I think gcc is starting to look good; it's nice to recompile a program and get a 20% across-the-board speedup. Some of the new architectures out there are really going to put a strain on the compiler writers, more power to them. And no one should be ashamed of hand-coding a hundred lines of assembly. There's always that inner loop that really needs to be fast. --Joe
lindsay@k.gp.cs.cmu.edu (Donald Lindsay) (06/18/88)
In article <3208@ubc-cs.UUCP> pajari@grads.cs.ubc.ca (George Pajari) writes: >When pressed at a local presentation on the M88000, Motorola reps conceded >that the 'infamous' FFT benchmark represented carefuly crafted HAND-CODED >RISC instructions ordered to take maximal advantage of the pipelining in >the 88100. Furthermore, they admitted that the best code by any compiler >resulted in a 54 cycle loop...ALMOST TWICE THE LENGTH OF THE BENCHMARK CODE. The article says that the code was "compiled with the scheduler algorithm (release 1.0)". They also say that the same code, executed without overlaps, requires 53 clocks. If your rep is correct, then it's likely that the article was submitted for publication some months ago. The authors were probably just optimistic about how quickly everything would come together. Writing a "code scheduler" for a compiler is an interesting task. The 88000 has several nice properties in this regard; only branches have data dependent timing; and only load/store can take cache hits. Plus, imperfections in the scheduling algorithm don't lead to incorrect code. (The scoreboard insures that you don't e.g. read a register before incoming data has arrived in it.) Viewed from a distance, an 88000 scheduler looks quite simple: the "machine model" isn't too complex, and the interface (machine code in, machine code out) is straightforward. Viewed close up, I'm sure that the implementers have hit some kind of grief: a problem with register lifetimes and live sets: or, problems with labels and directives: or just suboptimal results on the biggest customer's favorite benchmark. Or, more likely, everybody was too busy fixing bugs in some other component. I'd like to hear from Tadpole Technology about what they did that gave such a goose to the Dhrystone figures. -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science "Imitation is not the sincerest form of flattery. Payments are." - a British artist who died penniless before copyright law.
mash@mips.COM (John Mashey) (06/18/88)
In article <23406@bu-cs.BU.EDU> bzs@bu-cs.BU.EDU (Barry Shein) writes: >>Gosh! Does this mean that careful hand-coding can yield a factor of 2 >>faster code even on a RISC machine? I know it means that on a CISC machine >>because I've done better than that myself. (Current record: a factor of 22 >>on a Nova computer, me vs. Fortran, with the sieve benchmark). 1) Remember that by now we have a whole collection of things labeled RISC machines, whose architectures varied quite a lot. 2) At least one flavor of RISC design style [i.e., as practiced by, at least, IBM, HP, and MIPS] selects instructions and features as seen in their effects upon compiled code, using optimizing compiler statistics to drive the design, i.e., the COMPILERS EXIST (at least substantially) BEFORE THE INSTRUCTION SET ARCHITECTURE. >>But where does this leave all the CS statements that assembly-language >>coding is no longer needed? Are factors of 2 really unimportant? > >Um, Ed, before you jump (or is that JRST) to all sorts of conclusions >note that the 88000 is a very new architecture, I doubt they have the >compilers up to snuff yet nor would be disappointed if they were still >working on it. >It might be interesting to see what the case is on RISC systems with >more mature compilers (which generally means "working for a year >now".) > > -Barry Shein, Boston University 3) Just because something is labeled a RISC doesn't mean that even a good compiler can always do almost as well as assembler; it depends on: a) Was the instruction set REALLY designed from compilation statistics? b) What is the current state of the compilers? (see Barry's note) c) Who's writing the asembler code? For example, if a) is true, and the compilers are good, then c) matters a little less, i.e., the oppurtunities for usefully writing assembler go down. As an example, now that we have experience of several years in running systems: a) We wrote low-level interrupt code, block-copy, machine-dependent code (like tlb-flushing), and floating-point-emulation code in assembler. b) We wrote heavily-tuned math libraries in assembler. c) We wrote a couple of the C string routines in assembler. (actually, at one point, I wrote ALL of the string routines in assembler, but the compiled C code was close enough that I threw away most of the assmebler routines in disgust at the time I'd wasted.) As another example, in our case FORTRAN Linpack and Coded Linpack are usually within 10%. There is almost always some use for assembler. From past experience, I'd say that compiler-driven RISC designs + good compilers reduce the temptation considerably. Since this started with 88000 numbers, does anybody have any Double Precision 88000 numbers (like FORTRAN Linpack, or DP Doduc, or Spice)? The only DP number I've seen is DP Whetstones @ around 3Mwhets. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
smryan@garth.UUCP (Steven Ryan) (06/19/88)
>Viewed from a distance, an 88000 scheduler looks quite simple: the "machine >model" isn't too complex, and the interface (machine code in, machine code >out) is straightforward. Viewed close up, I'm sure that the implementers >have hit some kind of grief: a problem with register lifetimes and live >sets: or, problems with labels and directives: or just suboptimal results >on the biggest customer's favorite benchmark. The problem with writing a scheduler is finding an efficient (read: heuristic) method of traversing the code graph. The simplest method produces the best code at exponential compilation time: try every possible sequence. The grief is trying to find a linear to quadratic time method which still gives a good sequence. We humans seem to have innate ability to find the best graph traversal (or perhaps we are not overwhelmed until the graph becomes large), so human experts will continue outcode compilers for the near future. sm ryan Nature is immacuately fair and implacably cruel. -- Sara James
aglew@urbsdc.Urbana.Gould.COM (06/20/88)
>Gosh! Does this mean that careful hand-coding can yield a factor of 2 >faster code even on a RISC machine? I know it means that on a CISC machine >because I've done better than that myself. (Current record: a factor of 22 >on a Nova computer, me vs. Fortran, with the sieve benchmark). My personal record is a 3-fold improvement -- translating assembly code into C, and then improving the algorithm. It goes either way. I've done better going fom C into assembly, but I'm most proud of the speedups I've obtained going from assembly into C. I somehow think that that code is more likely to be running 2 processor families from now than the assembly.
aglew@urbsdc.Urbana.Gould.COM (06/20/88)
>The problem with writing a scheduler is finding an efficient (read: heuristic) >method of traversing the code graph. The simplest method produces the best code >at exponential compilation time: try every possible sequence. The grief is >trying to find a linear to quadratic time method which still gives a good >sequence. Do any compilers contain the simple, brute-force, exponential algorithm, to be turned on when you are willing to spend the time, or when the graph for a procedure is small enough? There was a paper, an ASPLOS back, on "Super-Optimizer", a brute force searcher. It might be nice to have that in your toolbox...
smryan@garth.UUCP (Steven Ryan) (06/22/88)
In article <28200167@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes: >Do any compilers contain the simple, brute-force, exponential algorithm, >to be turned on when you are willing to spend the time, or when the graph >for a procedure is small enough? > There was a paper, an ASPLOS back, on "Super-Optimizer", a brute force >searcher. It might be nice to have that in your toolbox... I remember something like that--the running time was like a week on 10 line program, and even then it has to be handchecked. It would be interesting to combine profilers and optimisers to locate that one innermost loop and then exhaustively enumerate the possible paths. If the loop is small and this is only done once or twice in a large program, the exponential time could be subsumed by rest of the compilation.
jesup@cbmvax.UUCP (Randell Jesup) (06/23/88)
In article <28200167@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes: >>The problem with writing a scheduler is finding an efficient (read: heuristic) >>method of traversing the code graph. The simplest method produces the best code >>at exponential compilation time: try every possible sequence. The grief is >>trying to find a linear to quadratic time method which still gives a good >>sequence. >Do any compilers contain the simple, brute-force, exponential algorithm, >to be turned on when you are willing to spend the time, or when the graph >for a procedure is small enough? > There was a paper, an ASPLOS back, on "Super-Optimizer", a brute force >searcher. It might be nice to have that in your toolbox... It certainly wouldn't be hard. You'd want to limit it to simple graphs for obvious reasons, since it can get out of control pretty fast. However, in a properly designed version, the only limit would be how much VM you have, and how long you're willing to wait. You only need this sort of optimization for final production code, anyway. I think (if I remember correctly) that the Gross, et al, methods run in O(n^4) time, which is a lot better than the brute force method, at least for blocks bigger than 10 or 15 instructions. However, there are a LOT of small (< 20) instruction blocks for which brute force can make sense. However, for the really big blocks you see sometimes (especially from certain types of numerical/FORTRAN/bad/bizarre code), even O(n^4) is painful. There are faster (O(n^2)) methods, that produce less optimum code, but LOTS faster. Or you might try breaking the large blocks into some minimum block size artificially, so n doesn't get so big. Randell Jesup, Commodore Engineering {uunet|rutgers|ihnp4|allegra}!cbmvax!jesup
chris@mimsy.UUCP (Chris Torek) (06/28/88)
In article <754@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes: >The problem with writing a scheduler is finding an efficient (read: heuristic) >method of traversing the code graph. The simplest method produces the best code >at exponential compilation time: try every possible sequence. ... >We humans seem to have innate ability to find the best graph traversal (or >perhaps we are not overwhelmed until the graph becomes large), so human >experts will continue outcode compilers for the near future. A recent (I think) ASPLOS has a paper on something called `Superoptimizer'. This is a program that uses exponential time to find the best 68000 or 80x86 code sequence to replace some other sequence. It is not suitable for more than about 13 instructions at a time, but the sequences it produces are better than those that humans usually derive. It came up with a four-68000-instruction `signum' function, and its versions of min and max (and minmax together) are also quite surprising. None of them have any branches. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
dik@cwi.nl (Dik T. Winter) (06/28/88)
In article <12171@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: > It came up with a four-68000-instruction `signum' function, > and its versions of min and max (and minmax together) are also quite > surprising. None of them have any branches. (What is signum?) Not so surprising for min and max (and minmax). However, they require a sign extending shift, so that implementation is not possible on all machines. Also, if a branch takes only one cycle (with delay slot), you do not gain anything (in general). -- dik t. winter, cwi, amsterdam, nederland INTERNET : dik@cwi.nl BITNET/EARN: dik@mcvax
aglew@urbsdc.Urbana.Gould.COM (06/29/88)
..> [Talking (I assume - the original note passed me by) about superoptimization]: >Not so surprising for min and max (and minmax). However, they require a >sign extending shift, so that implementation is not possible on all >machines. Also, if a branch takes only one cycle (with delay slot), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >you do not gain anything (in general). ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > >dik t. winter, cwi, amsterdam, nederland You don't gain anything by avoiding branches if branches are cheap - until the next generation of your machine comes by, more deeply pipelined, with more expensive branches. I've talked with representatives of several RISC manufacturers about "the next generation" - I think MIPS was one of them, John Mashey can correct me, and I know SPARC was. I keep asking "how many delay slots are you going to need on your next generation". They say 2 - but to be backward compatible they'll interlock, with the time delay for the interlock circuitry masked by the delay slot.
wca@oakhill.UUCP (william anderson) (07/01/88)
In messages <1941@pt.cs.cmu.edu> and <3208@ubc-cs.UUCP> Donald Lindsay and George Pajari discussed a particular benchmark for the M88000, the inner loop of an FFT calculation. In this article, we examine the output from two compilers for this code, and present the results of instruction rescheduling on the assembly code produced by these compilers. We also include a hand-coded version of the code which is nearly optimal for the given task. INTRODUCTION The subprogram which is considered here is a C language version of a fast Fourier transform subroutine, to wit: fft(xr,xi,n,m,wr,wi) float xr[],xi[],wr[],wi[]; unsigned int n,m; { unsigned int n1,n2,ia,ie,i,j,k,l; float c,s,xrt,xit; n2 = n; ie = 1; for (k=0 ; k<m ; k++) { n1 = n2; n2 = n2>>1; ia = 0; for (j=0 ; j<n2 ; j++) { c = wr[ia]; s = wi[ia]; ia = ia+ie; /* BEGIN INNER LOOP */ for (i=0 ; i<n ; i+=n1) { l = i+n2; xrt = xr[i] - xr[l]; xr[i] = xr[i] + xr[l]; xit = xi[i] - xi[l]; xi[i] = xi[i] + xi[l]; xr[l] = c*xrt - s*xit; xi[l] = c*xit + s*xrt; } /* END INNER LOOP */ } ie = ie<<1; } } In particular, we will focus on the inner loop code which is bracketed by appropriate comments in the source above. We will examine the assembly output for this inner loop from two compilers, a commercial C compiler (which we will designate "compiler A") and a public- domain C compiler ("compiler B"), both of which currently exist for the M88000 architecture. Finally, we will look at a hand-coded inner loop and discuss some of the instruction timing for the M88000. COMPILER A The code from compiler A for the inner loop is 39 instructions and, as scheduled by compiler A, requires 75 clocks to execute. Although compiler A is supposed to have a Motorola-developed instruction scheduling strategy incorporated in it, we can filter the above code through the Motorola instruction scheduler to get a code sequence which executes in 64 clocks. Both code sequences are given below: Code scheduled by compiler A Code scheduled by Motorola scheduler (75 clocks) (64 clocks) @L18: @L18: mak r5,r23,2 ld r8,r25,0 addu r10,r5,r17 mak r5,r23,2 st r10,r31,92 addu r10,r5,r17 ld r9,r10,0 ld r9,r10,0 ld r8,r25,0 st r10,r31,92 fsub.sss r10,r8,r9 ld r12,r31,92 ld r12,r31,92 fsub.sss r10,r8,r9 st r10,r31,80 ld r6,r12,0 ld r6,r12,0 st r10,r31,80 addu r10,r5,r16 addu r10,r5,r16 fadd.sss r7,r8,r6 st r10,r31,88 st r10,r31,88 fadd.sss r7,r8,r6 st r7,r25,0 st r7,r25,0 ld r4,r24,0 ld r4,r24,0 ld r5,r10,0 ld r5,r10,0 ld r10,r31,76 ld r10,r31,76 ld r3,r24,0 ld r3,r24,0 fsub.sss r4,r4,r5 fsub.sss r4,r4,r5 fadd.sss r3,r3,r5 fmul.dss r7,r10,r21 fmul.dss r7,r10,r21 fadd.sss r3,r3,r5 fmul.dss r5,r4,r20 ld r10,r31,80 fsub.sdd r9,r7,r5 addu r25,r25,r22 st r3,r24,0 fmul.dss r5,r4,r20 st r9,r12,0 st r3,r24,0 ld r10,r31,80 addu r24,r24,r22 fmul.dss r2,r10,r20 fsub.sdd r9,r7,r5 fmul.dss r7,r4,r21 fmul.dss r2,r10,r20 fadd.sdd r4,r2,r7 fmul.dss r7,r4,r21 ld r12,r31,88 st r9,r12,0 st r4,r12,0 ld r12,r31,88 ld r10,r31,84 ld r10,r31,84 ld r12,r31,76 fadd.sdd r4,r2,r7 addu r24,r24,r22 st r4,r12,0 addu r25,r25,r22 ld r12,r31,76 addu r23,r23,r12 addu r23,r23,r12 addu r10,r10,r12 addu r10,r10,r12 st r10,r31,84 st r10,r31,84 cmp r10,r14,r10 cmp r10,r14,r10 bb1 hi,r10,@L18 bb1 hi,r10,@L18 Several comments are in order regarding the code generated by compiler A: 1 - Compiler A generates code which clocks at 2.67 Mflops using its own scheduling algorithm, and which clocks at 3.13 Mflops using Motorola's rescheduling filter (speeds are for a 20 MHz M88000). 2 - Compiler A fails to use the scaled index addressing mode for either load (ld) or store (st) instructions. The above "mak" and "addu" instructions followed by a "ld" or "st" can be collapsed into a single scaled indexed load or store. Interestingly, compiler A was written by a software company that just finished a C compiler for a RISC architecture which lacks indexed addressing, either scaled or unscaled. Compiler A does use scaled indexed addressing, however, when the arrays are integers and not floats. 3 - Even though the most aggressive optimization was called for, the code in the inner loop was still loading/storing values to/from the stack instead of promoting the temporary stack-resident variables into register variables. Half of the load/store traffic in the above inner loop (10 of 20 loads or stores) is unnecessary if this promotion is done. 4 - Note that compiler A is using mixed-mode arithmetic in its loop. In particular, it is keeping the intermediate products in doubles (which are register-resident) while doing the final add or subtract with a single result for the store. This preserves the spirit of K&R C and does not greatly impact the loop timing since the mixed-mode instructions are scheduled to avoid stalls. The mixed- mode arithmetic costs an additional 2 clocks per loop. 5 - Since the final cmp instruction reuses r10 instead of another register, the Motorola rescheduler cannot put the the final store of the loop into the branch delay slot. If another register is used for the compare target, we can save yet another clock from the loop. COMPILER B Compiler B is a public-domain C compiler which has been recently been adapted to generate 88K code. Compiler B generated an inner loop which was 22 instructions in length and which required 52 clocks as scheduled. After scheduling with Motorola's rescheduler, the code required 39 clocks to execute. As above, we list the assembly code output using the compiler's internal scheduling algorithm side by side with the code as scheduled by Motorola's instruction rescheduler: Code scheduled by compiler B Code scheduled by Motorola scheduler (52 clocks) (39 clocks) @L12: @L12: lda.b r2,r7,r10 ld r4,r12[r7] ld r4,r12[r7] lda.b r2,r7,r10 ld r5,r12[r2] ld r5,r12[r2] fsub.sss r3,r4,r5 fsub.sss r3,r4,r5 fadd.sss r4,r4,r5 fadd.sss r4,r4,r5 st r4,r12[r7] st r4,r12[r7] ld r5,r11[r7] ld r5,r11[r7] ld r6,r11[r2] ld r6,r11[r2] fsub.sss r4,r5,r6 fsub.sss r4,r5,r6 fadd.sss r5,r5,r6 fadd.sss r5,r5,r6 st r5,r11[r7] fmul.sss r6,r13,r4 fmul.sss r5,r14,r3 fmul.sss r4,r14,r4 fmul.sss r6,r13,r4 st r5,r11[r7] fsub.sss r5,r5,r6 fmul.sss r5,r14,r3 st r5,r12[r2] fmul.sss r3,r13,r3 fmul.sss r4,r14,r4 lda.b r7,r7,r18 fmul.sss r3,r13,r3 cmp r25,r7,r17 fadd.sss r4,r4,r3 fsub.sss r5,r5,r6 st r4,r11[r2] fadd.sss r4,r4,r3 lda.b r7,r7,r18 st r5,r12[r2] cmp r25,r7,r17 bb1.n lo,r25,@L12 bb1 lo,r25,@L12 st r4,r11[r2] Several comments about the code from compiler B: 1 - Compiler B generates code which clocks at 3.85 Mflops using its own scheduling algorithm, and which clocks at 5.13 Mflops using Motorola's rescheduling filter (speeds are for a 20 MHz M88000). 2 - Compiler B uses the scaled index addressing mode for loads and stores, resulting in more efficient code than compiler A. Note also that the Motorola rescheduler changed the last branch-on-bit-set (bb1) instruction into a bb1.n and put the last store into the delay slot. 3 - Compiler B promoted the temporary stack-resident variables into register variables, thus minimizing load/store traffic. 4 - Compiler B uses only single-precision arithmetic (in the spirit of ANSI C). HAND-CODED EXAMPLE As a limit to the optimal instruction ordering for this fft inner loop example, we include here an equivalent inner loop, hand-coded in M88000 assembler by Marvin Denman of the MC88100 Design Team. Although the set-up and register usage for this loop differ slightly from the compiler examples given above, the number and selection of instructions is identical to compiler B. This loop of 22 instructions executes in 23 clocks. Hand-scheduled code (23 clocks) @L12: fsub.sss r7,r8,r6 fadd.sss r6,r8,r6 fsub.sss r10,r9,r5 fadd.sss r5,r9,r5 st r1,r2[r16] st r20,r3[r16] lda.b r16,r15,r11 ld r8,r21[r15] ld r9,r22[r15] st r6,r2[r15] fmul.sss r1,r13,r7 fmul.sss r20,r14,r7 st r5,r3[r15] fmul.sss r17,r14,r10 fmul.sss r18,r13,r10 lda.b r15,r15,r19 cmp r12,r15,r25 ld r6,r23[r15] ld r5,r24[r15] fsub.sss r1,r1,r17 bb1.n lo,r12,@L12 fadd.sss r20,r18,r20 Although the last two FP instructions in the loop do not complete before the next iteration, there are no data dependencies which cause a stall from loop to loop. This code loop executes at 8.70 Mflops on an M88000 system running at 20 MHz. This corresponds to 19.1 (native) Mips. Note that the hand-coded assembler makes an assumption that the compilers do not make: the data arrays are assumed not to overlap. If the arrays do overlap, then the minimum number of clocks for the loop is about 36 clocks, less than 10% different from the compiler B code above. INSTRUCTION TIMING Since this discussion has included the results of Motorola's instruction rescheduler for the MC88000, a short discussion of the instruction timing for the 88K instructions mentioned in this article is in order. First, we present a table of the timings of several common instructions: Instruction Latency Comments ----------- ------- ------------------------------------------------------ addu 1 add unsigned mak 1 make bit field lda 1 load address st 1 store register (optional indexed, scaled-index modes); ld 3 load register (optional indexed, scaled-index modes) fadd.sss 5 FP add (SP) fadd.sdd 6 FP add (DP operands, SP result) fsub.sss 5 FP subtract (SP) fsub.sdd 6 FP subtract (DP operands, SP result) fmul.sss 6 FP multiply (SP) fmul.dss 6 FP multiply (SP operands, DP result) The M88000 has a fully-pipelined data memory unit, so that loads and stores can be issued each cycle in an overlapped fashion. The results of a load, of course, cannot be used until the load completes, and the scoreboard enforces this. Note also that the indexed and scaled-indexed addressing modes supported by the data unit obviate the need to use the integer unit adder for data array access. The low-level concurrency provided by this feature is one reason why compiler B's inner loop was shorter and faster than compiler A's loop. The FP unit actually has two pipes, one for multiply and one for add, which share the first and last stages. Note that the latency for the mixed precision multiply (DP result) is the same as for the SP result, since the register file's feed-forward circuitry manages this case. The FP unit is pipelined to allow the FP instructions listed to be issued each cycle. We now look at the timing diagrams for the rescheduled loop of compiler B and the hand-coded loop, both listed above. These diagrams give an idea of the concurrency available at a low level in the M88000 architecture. In these diagrams, the instruction duration is signified with a dash ("-") for every clock that the instruction is active. Note that the time that an instruction is active may not be the same as its latency (e.g. st). Note also that the effective latency may be greater than the nominal latency for a particular instruction due to contention for write-back slots to the register file. Rescheduled compiler B code (39 clocks) 5 10 15 20 25 30 35 40 CLOCKS ----+----+----+----+----+----+----+----+ ld --- lda.b - ld --- fsub.sss ----- fadd.sss ----- st -- ld --- ld --- fsub.sss ----- fadd.sss ----- fmul.sss ------ fmul.sss ------ st -- fmul.sss ------ fmul.sss ------ lda.b - cmp - fsub.sss ----- fadd.sss ----- st -- bb1.n -- st -- Hand-coded loop (23 clocks) 5 10 15 20 25 30 35 40 CLOCKS ----+----+----+----+----+----+----+----+ fsub.sss ----- fadd.sss ----- fsub.sss ----- fadd.sss ----- st -- st -- lda.b - ld --- ld --- st -- fmul.sss ------ fmul.sss ------ st -- fmul.sss ------ fmul.sss ------ lda.b - cmp - ld --- ld --- fsub.sss ----- bb1.n -- fadd.sss ----- This diagram is particularly interesting in showing the FP pipeline performance of the M88000. Note that in 2 places there are 5 instructions in progress at once, 4 of which are FP instructions. This illustrates Frank McMahon's assertion that, for pipelined architectures, the numerical computation rate will be directly correlated to basic code block size (roughly, the number of instructions in the inner loop). For an interesting comparison of the sensitivity of overall FP performance to the number of FP instructions per inner loop, see McMahon's "The Livermore Fortran Kernels: A Computer Text of the Numerical Performance Range", Lawrence Livermore National Laboratory, UCRL-53745, December 1986, pp. 8-9. CONCLUSIONS The M88000 assembly code output from 2 different compilers (as well as an optimum hand-coded example) for the inner loop of an FFT routine have been compared. The value of instruction scheduling for a given M88000 code fragment has been shown. Some of the M88000 instruction timings have been given and the timing diagrams for two of the loops were given. /\ /\ Mitch Alsup //\\ //\\ Willie Anderson ///\\\ ///\\\ Marvin Denman // \\ // \\ Mike Paton / \/ \ / \ Members the of Motorola 88000 Development Team
earl@mips.COM (Earl Killian) (07/02/88)
In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (william anderson) writes: > for (i=0 ; i<n ; i+=n1) { > l = i+n2; > xrt = xr[i] - xr[l]; > xr[i] = xr[i] + xr[l]; > xit = xi[i] - xi[l]; > xi[i] = xi[i] + xi[l]; > xr[l] = c*xrt - s*xit; > xi[l] = c*xit + s*xrt; > } > As a limit to the optimal instruction ordering for this fft inner loop > example, we include here an equivalent inner loop, hand-coded in M88000 > assembler by Marvin Denman of the MC88100 Design Team. Although the > set-up and register usage for this loop differ slightly from the compiler > examples given above, the number and selection of instructions is identical > to compiler B. This loop of 22 instructions executes in 23 clocks. > [88100 code excised] > Although the last two FP instructions in the loop do not complete before > the next iteration, there are no data dependencies which cause a stall > from loop to loop. This code loop executes at 8.70 Mflops on an M88000 > system running at 20 MHz. This corresponds to 19.1 (native) Mips. An fun exercise. Here's the MIPSco R3000 code for the inner loop: l.s f0, 0(t0) # xr[i] l: l.s f2, 0(t1) # xr[l] l.s f4, 0(t2) # xi[i] sub.s f10, f0, f2 # xrt = xr[i] - xr[l] l.s f6, 0(t3) # xi[l] mul.s f20, f28, f10 # c*xrt sub.s f12, f4, f6 # xit = xi[i] - xi[l] addu t0, 4 addu t1, 4 mul.s f22, f30, f12 # s*xit add.s f14, f0, f2 # xr[i] + xr[l] addu t2, 4 addu t3, 4 mul.s f24, f28, f12 # c*xit add.s f16, f4, f6 # xi[i] + xi[l] s.s f14, -4(t0) # xr[i] = xr[i] + xr[l] l.s f0, 0(t0) # xr[i+1] mul.s f26, f30, f10 # s*xrt sub.s f20, f20, f22 # c*xrt - s*xit s.s f16, -4(t2) # xi[i] = xi[i] + xi[l] s.s f20, -4(t1) # xr[l] = c*xrt - s*xit add.s f24, f24, f26 # c*xit + s*xrt bne t0, t4, l s.s f24, -4(t3) # xi[l] = c*xit + s*xrt Summary: 23 instructions, 23 cycles, 10.9 mflops @ 25MHz, and 25 native mips. No pipelining within an op unit required (the R3010 has none). Multiply/add/load/store overlap is extensively used. Analyzing in terms of the previous posting m=.8 (8 load/store cycles per 10 flops), f1=.6 (6 add/subtract out of 10 flops), f2=.4 (4 multiply out of 10 flops). Thus a new add/subtract required every (m+1)/f1 = 3 cycles, and a new multiply required every (m+1)/f2 = 4.5 cycles. Since the respective latencies are 2 and 4, no pipelining is required. -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
brooks@maddog.llnl.gov (Eugene D. Brooks III) (07/06/88)
In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (william anderson) writes:
A detailed exposition of the performance of two compilers and hand assembly
for the FFT inner loop.
For even better pipeline utilization from compiler B, the public domain
compiler, change the register allocation strategy so that when looking
for a scratch register is remembers the last one allocated and starts
upward folding around to R0 when it hits the top of the register file.
By doing this the false dependencies caused by register reuse will be
minimized and the postprocessing rescheduler will have more code that
can be moved around. We used this strategy with PCC for the Cerberus
instruction set with a postprocessing optimizer and it exploited the
functional unit pipelines very well. The same will apply for the 88000.
The situation for Cerberus will be even better using a GCC/{postprocessing
scheduler} combo. I am willing to bet that that "public domain" compiler
was GCC which is an ANSI C compiler that is to be taken quite seriously!
smryan@garth.UUCP (Steven Ryan) (07/07/88)
>For even better pipeline utilization from compiler B, the public domain >compiler, change the register allocation strategy so that when looking >for a scratch register is remembers the last one allocated and starts >upward folding around to R0 when it hits the top of the register file. >By doing this the false dependencies caused by register reuse will be Or do the scheduling before assigning registers (FTN200 EBBO).
yuval@taux01.UUCP (Gideon Yuval) (07/13/88)
In his 6/88 posting, William Anderson of Motorola says: > The code from compiler A for the inner loop is 39 instructions and, as > scheduled by compiler A, requires 75 clocks to execute. Although > compiler A is supposed to have a Motorola-developed instruction > scheduling strategy incorporated in it, we can filter the above code > through the Motorola instruction scheduler to get a code sequence which > executes in 64 clocks. Both code sequences are given below: Is this instruction-scheduler available to the public? -- Gideon Yuval, yuval@taux01.nsc.com, +972-2-690992 (home) ,-52-522255(work) Paper-mail: National Semiconductor, 6 Maskit St., Herzliyah, Israel (alternative E-mail address: decwrl!nsc!taux01!yuval@uunet.uu.net)
wca@oakhill.UUCP (william anderson) (07/15/88)
In article <2532@wright.mips.COM> earl@mips.COM (Earl Killian) writes: >In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (I) wrote: [FFT inner loop in C:] >> for (i=0 ; i<n ; i+=n1) { >> l = i+n2; >> xrt = xr[i] - xr[l]; >> xr[i] = xr[i] + xr[l]; >> xit = xi[i] - xi[l]; >> xi[i] = xi[i] + xi[l]; >> xr[l] = c*xrt - s*xit; >> xi[l] = c*xit + s*xrt; >> } Mr. Killian responded with a hand-coded inner loop for the MIPS architecture: > l.s f0, 0(t0) # xr[i] >l: l.s f2, 0(t1) # xr[l] > l.s f4, 0(t2) # xi[i] > sub.s f10, f0, f2 # xrt = xr[i] - xr[l] > l.s f6, 0(t3) # xi[l] > mul.s f20, f28, f10 # c*xrt > sub.s f12, f4, f6 # xit = xi[i] - xi[l] > addu t0, 4 > addu t1, 4 > mul.s f22, f30, f12 # s*xit > add.s f14, f0, f2 # xr[i] + xr[l] > addu t2, 4 > addu t3, 4 > mul.s f24, f28, f12 # c*xit > add.s f16, f4, f6 # xi[i] + xi[l] > s.s f14, -4(t0) # xr[i] = xr[i] + xr[l] > l.s f0, 0(t0) # xr[i+1] > mul.s f26, f30, f10 # s*xrt > sub.s f20, f20, f22 # c*xrt - s*xit > s.s f16, -4(t2) # xi[i] = xi[i] + xi[l] > s.s f20, -4(t1) # xr[l] = c*xrt - s*xit > add.s f24, f24, f26 # c*xit + s*xrt > bne t0, t4, l > s.s f24, -4(t3) # xi[l] = c*xit + s*xrt > >Summary: 23 instructions, 23 cycles, 10.9 mflops @ 25MHz, and 25 >native mips. No pipelining within an op unit required (the R3010 has >none). Multiply/add/load/store overlap is extensively used. One problem with this code is that is assumes the "stride" of the loop (the varible "n1" in the C code segment above) is unity! What about the code for the inner loop in the general case? What effect does the assumption of non-unity stride have on the MIPS loop timing? Since the results must be stored at the same addresses as the loads used, must the MIPS code be altered for the general case to do its pointer incrementing in some other place than the FP latency slots? Since it lacks an indexed addressing mode, in this case the MIPS architecture must increment 4 pointers instead of one array index. The Motorola 88000's ability to do more sophisticated (indexed and scaled-indexed) addressing in its data memory unit allows a compiler (or an assembly code writer) to loop-induce the index l out of the loop and allows reference to all array elements with only one index increment/loop (as opposed to 4 pointer increments/loop). Thus the loop instruction count for the M88K is reduced and performance is maximized (22 instructions/23 clocks/8.7 MFLOPS/19.1 native MIPS at 20 MHz as per <1359@claude.oakhill.UUCP>). >UUCP: {ames,decwrl,prls,pyramid}!mips!earl Thanks to Marvin Denman and Mitch Alsup of the M88K Design Group for their help with this note. The statements and opinions presented in this article are my own. They should not be interpreted as being the opinons or policy, official or otherwise, of Motorola Inc. /\ /\ William C. Anderson //\\ //\\ Member of the M88000 Design Group ///\\\ ///\\\ Motorola Microprocessor Division // \\ // \\ Oak Hill, TX. / \/ \ / \
jmd@granite.dec.com (John Danskin) (07/16/88)
x x x I have screwed around a little bit with Earl's code to try to answer some of wca@oakhill (William C. Anderson)'s questions about performance with non unity stride. There is probably a better answer, but here goes: >In article <2532@wright.mips.COM> earl@mips.COM (Earl Killian) writes: >>In article <1359@claude.oakhill.UUCP> wca@oakhill.UUCP (I) wrote: > > [FFT inner loop in C:] > >>> for (i=0 ; i<n ; i+=n1) { >>> l = i+n2; >>> xrt = xr[i] - xr[l]; >>> xr[i] = xr[i] + xr[l]; >>> xit = xi[i] - xi[l]; >>> xi[i] = xi[i] + xi[l]; >>> xr[l] = c*xrt - s*xit; >>> xi[l] = c*xit + s*xrt; >>> } > >Mr. Killian responded with a hand-coded inner loop for the MIPS >architecture: > >> l.s f0, 0(t0) # xr[i] >>l: l.s f2, 0(t1) # xr[l] >> l.s f4, 0(t2) # xi[i] >> sub.s f10, f0, f2 # xrt = xr[i] - xr[l] >> l.s f6, 0(t3) # xi[l] >> mul.s f20, f28, f10 # c*xrt >> sub.s f12, f4, f6 # xit = xi[i] - xi[l] >> addu t0, 4 >> addu t1, 4 >> mul.s f22, f30, f12 # s*xit >> add.s f14, f0, f2 # xr[i] + xr[l] >> addu t2, 4 >> addu t3, 4 >> mul.s f24, f28, f12 # c*xit >> add.s f16, f4, f6 # xi[i] + xi[l] >> s.s f14, -4(t0) # xr[i] = xr[i] + xr[l] >> l.s f0, 0(t0) # xr[i+1] >> mul.s f26, f30, f10 # s*xrt >> sub.s f20, f20, f22 # c*xrt - s*xit >> s.s f16, -4(t2) # xi[i] = xi[i] + xi[l] >> s.s f20, -4(t1) # xr[l] = c*xrt - s*xit >> add.s f24, f24, f26 # c*xit + s*xrt >> bne t0, t4, l >> s.s f24, -4(t3) # xi[l] = c*xit + s*xrt >> >>Summary: 23 instructions, 23 cycles, 10.9 mflops @ 25MHz, and 25 >>native mips. No pipelining within an op unit required (the R3010 has >>none). Multiply/add/load/store overlap is extensively used. > >One problem with this code is that is assumes the "stride" of the loop >(the varible "n1" in the C code segment above) is unity! > >What about the code for the inner loop in the general case? What >effect does the assumption of non-unity stride have on the MIPS loop >timing? # 4 * n1 is in a register named 'n1' l.s f0, 0(t0) # xr[i] l: l.s f2, 0(t1) # xr[l] l.s f4, 0(t2) # xi[i] sub.s f10, f0, f2 # xrt = xr[i] - xr[l] l.s f6, 0(t3) # xi[l] mul.s f20, f28, f10 # c*xrt sub.s f12, f4, f6 # xit = xi[i] - xi[l] addu t5, t0, n1 addu t6, t1, n1 mul.s f22, f30, f12 # s*xit add.s f14, f0, f2 # xr[i] + xr[l] addu t7, t2, n1 addu t8, t3, n1 mul.s f24, f28, f12 # c*xit add.s f16, f4, f6 # xi[i] + xi[l] s.s f14, 0(t0) # xr[i] = xr[i] + xr[l] l.s f0, 0(t5) # xr[i+1] mul.s f26, f30, f10 # s*xrt sub.s f20, f20, f22 # c*xrt - s*xit s.s f16, 0(t2) # xi[i] = xi[i] + xi[l] s.s f20, 0(t1) # xr[l] = c*xrt - s*xit add.s f24, f24, f26 # c*xit + s*xrt be t5, t4, e s.s f24, 0(t3) # xi[l] = c*xit + s*xrt l.s f2, 0(t6) # xr[l] l.s f4, 0(t7) # xi[i] sub.s f10, f0, f2 # xrt = xr[i] - xr[l] l.s f6, 0(t8) # xi[l] mul.s f20, f28, f10 # c*xrt sub.s f12, f4, f6 # xit = xi[i] - xi[l] addu t0, t5, n1 addu t1, t6, n1 mul.s f22, f30, f12 # s*xit add.s f14, f0, f2 # xr[i] + xr[l] addu t2, t7, n1 addu t3, t8, n1 mul.s f24, f28, f12 # c*xit add.s f16, f4, f6 # xi[i] + xi[l] s.s f14, 0(t5) # xr[i] = xr[i] + xr[l] l.s f0, 0(t0) # xr[i+1] mul.s f26, f30, f10 # s*xrt sub.s f20, f20, f22 # c*xrt - s*xit s.s f16, 0(t7) # xi[i] = xi[i] + xi[l] s.s f20, 0(t6) # xr[l] = c*xrt - s*xit add.s f24, f24, f26 # c*xit + s*xrt bne t0, t4, l s.s f24, 0(t3) # xi[l] = c*xit + s*xrt e: This code has exactly the same performance as earl's code except that is also handles the non-unity case. Yet another example of needing lots more registers to write good code in pipelined machines. Can I have lots more registers please? -- John Danskin | decwrl!jmd DEC Technology Development | (415) 853-6724 100 Hamilton Avenue | My comments are my own. Palo Alto, CA 94306 | I do not speak for DEC.