[comp.sys.m88k] Effective use of the 88k pipeline

wood@dg-rtp.dg.com (Tom Wood) (01/13/90)
In article <TOM.90Jan9101628@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes:

> As near as I can tell 68 is not equal to 48. Do you have actual assembler
> code that does this inner loop in 48 cycles? Could you post it?

Apples and oranges time again.  You're right about 68, but the example was
single precision.  Here's the code for 4 unrolls:

;	DO 10 J = I,N
;   10	    A(I,J) = A(I,J) + B(I,K) * C(K,J)
;
;	r2 = address of A(I,1)
;	r3 = address of C(K,1)
;	r4 = value of B(I,K)
;	r5 = N

@L1:
	ld	 r6,r3,0x0000		;    D6--
	ld	 r8,r3,0x0004		;    D86-
	ld	 r10,r3,0x0008		;    Da86
	fmul.sss r6,r4,r6		;    D-a8 F6               WB:  r6 FF
	ld	 r12,r3,0x000c		;    Dc-a         *6---    WB:  r8
	fmul.sss r8,r4,r8		;    D-c- F8      *-6--    WB: r10
	ld	 r7,r2,0x0000		;    D7-c         *8-6-
	fmul.sss r10,r4,r10		;    D-7- Fa      *-8-6    WB: r12
	ld	 r9,r2,0x0004		;    D9-7         *a-8- L6
	fmul.sss r12,r4,r12		;    D-97 Fc      *-a-8    WB:  r6
	ld	 r11,r2,0x0008		;    Db-9         *c-a- L8 WB:  r7
	fadd.sss r7,r6,r7		;    D-b9 F7      *-c-a    WB:  r8
	ld	 r13,r2,0x000c		;    Dd-b    +7-- *--c- La WB:  r9
	fadd.sss r9,r8,r9		;    D-db F9 +-7- *---c    WB: r10
	fadd.sss r11,r10,r11		;    D--d Fb +9-7       Lc WB: r11 FF
	;stall	 mem,no fp ff		;    D--d    +b9-       L7 WB: r12
	;stall	 mem			;    D--d    +-b9          WB:  r7
	fadd.sss r13,r12,r13		;         Fd +--b       L9 WB: r13 FF
	st	 r7,r2,0x0000		;    D7--    +d--       Lb WB:  r9
	;stall	 pipe full		;    D7--    +-d-          WB: r11
	st	 r9,r2,0x0004		;    D97-    +--d          WB:<st>
	st	 r11,r2,0x0008		;    Db97               Ld WB:<st>
	;stall	 pipe full		;    Db-9                  WB: r13
	st	 r13,r2,0x000c		;    Ddb-                  WB:<st>
	addu	 r2,r2,0x0010		; I2 D-db                  WB:<st>
	cmp	 r14,r2,r5		; Ie D--d                  WB:  r2 FF
	bb0.n	 hi,r14,@L1		;                          WB: r14 FF
	addu	 r3,r3,0x0010		; I3

(Disclaimer: We don't have 100% confidence in the above annotation, however
the 28 cycle count agrees with the simulator.)

I'd also like to acknowledge Dave Morey (a former DG employee) for the idea
that the 88k is a good vector processor.  The timings and examples are his.
---
			Tom Wood	(919) 248-6067
			Data General, Research Triangle Park, NC
			{the known world}!rti!xyzzy!wood, wood@dg-rtp.dg.com