wood@dg-rtp.dg.com (Tom Wood) (01/13/90)
In article <TOM.90Jan9101628@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes: > As near as I can tell 68 is not equal to 48. Do you have actual assembler > code that does this inner loop in 48 cycles? Could you post it? Apples and oranges time again. You're right about 68, but the example was single precision. Here's the code for 4 unrolls: ; DO 10 J = I,N ; 10 A(I,J) = A(I,J) + B(I,K) * C(K,J) ; ; r2 = address of A(I,1) ; r3 = address of C(K,1) ; r4 = value of B(I,K) ; r5 = N @L1: ld r6,r3,0x0000 ; D6-- ld r8,r3,0x0004 ; D86- ld r10,r3,0x0008 ; Da86 fmul.sss r6,r4,r6 ; D-a8 F6 WB: r6 FF ld r12,r3,0x000c ; Dc-a *6--- WB: r8 fmul.sss r8,r4,r8 ; D-c- F8 *-6-- WB: r10 ld r7,r2,0x0000 ; D7-c *8-6- fmul.sss r10,r4,r10 ; D-7- Fa *-8-6 WB: r12 ld r9,r2,0x0004 ; D9-7 *a-8- L6 fmul.sss r12,r4,r12 ; D-97 Fc *-a-8 WB: r6 ld r11,r2,0x0008 ; Db-9 *c-a- L8 WB: r7 fadd.sss r7,r6,r7 ; D-b9 F7 *-c-a WB: r8 ld r13,r2,0x000c ; Dd-b +7-- *--c- La WB: r9 fadd.sss r9,r8,r9 ; D-db F9 +-7- *---c WB: r10 fadd.sss r11,r10,r11 ; D--d Fb +9-7 Lc WB: r11 FF ;stall mem,no fp ff ; D--d +b9- L7 WB: r12 ;stall mem ; D--d +-b9 WB: r7 fadd.sss r13,r12,r13 ; Fd +--b L9 WB: r13 FF st r7,r2,0x0000 ; D7-- +d-- Lb WB: r9 ;stall pipe full ; D7-- +-d- WB: r11 st r9,r2,0x0004 ; D97- +--d WB:<st> st r11,r2,0x0008 ; Db97 Ld WB:<st> ;stall pipe full ; Db-9 WB: r13 st r13,r2,0x000c ; Ddb- WB:<st> addu r2,r2,0x0010 ; I2 D-db WB:<st> cmp r14,r2,r5 ; Ie D--d WB: r2 FF bb0.n hi,r14,@L1 ; WB: r14 FF addu r3,r3,0x0010 ; I3 (Disclaimer: We don't have 100% confidence in the above annotation, however the 28 cycle count agrees with the simulator.) I'd also like to acknowledge Dave Morey (a former DG employee) for the idea that the 88k is a good vector processor. The timings and examples are his. --- Tom Wood (919) 248-6067 Data General, Research Triangle Park, NC {the known world}!rti!xyzzy!wood, wood@dg-rtp.dg.com