[comp.sys.atari.st] improved long multiply for GCC -m68000

egisin@watmath.waterloo.edu (Eric Gisin) (11/30/88)

Here's an improved long multiply for the 68000.
The original came from some program in comp.sources.amiga.
It uses a single hardware multiply when the arguments are
the range 0 to 2^^16 - 1. The signed version is the
same as the unsigned, since a non-widening twos complement
multiply is the same for signed and unsigned operands
(except for overflow detection, which we ignore).

| optimized long multiply - author unknown

.globl	__mulsi3, __umulsi3

	.text; .even
__mulsi3:
__umulsi3:
	movel	SP@(4), D0
	movel	SP@(8), D1
	movel	D2, A1
	movel	D0, D2
	mulu	D1, D2
	movel	D2, A0
	movel	D0, D2
	orl	D1, D2
	swap	D2
	tstw	D2		| if (D0.hi | D1.hi) == 0
	beq	L1		| return A0
	movel	D0, D2
	swap	D2
	mulu	D1, D2
	swap	D1
	mulu	D1, D0
	addl	D2, D0
	swap	D0
	clrw	D0
	addl	D0, A0
L1:	movel	A0, D0
	movel	A1, D2
	rts

leo@philmds.UUCP (Leo de Wit) (12/06/88)

In article <22486@watmath.waterloo.edu> egisin@watmath.waterloo.edu (Eric Gisin) writes:
|Here's an improved long multiply for the 68000.
|The original came from some program in comp.sources.amiga.
|It uses a single hardware multiply when the arguments are
|the range 0 to 2^^16 - 1. The signed version is the
|same as the unsigned, since a non-widening twos complement
|multiply is the same for signed and unsigned operands
|(except for overflow detection, which we ignore).
|
|| optimized long multiply - author unknown
	[source left out]...

It is even feasible to do some more optimization:
(if we call the upper half of d0 d0.hi, the lower half d0.lo etc):
the routine currently calculates d0.lo x d1.lo, and, if one of (d0.hi,
d1.hi) is non-zero, also d0.lo x d1.hi and d0.hi x d1.lo. That is, 1
mult.  for d0, d1 in the 0..65535 range, 3 mult.'s in all other cases.

Since multiplications are rather expensive (as compared to many other
operations) some more testing for special cases come to mind:
If EITHER one of (d0.hi, d1.hi) is zero, we can leave out the
multiplications that use them (for obvious reasons). If either one of
(d0.hi, d1.hi) is -1 (that is, 0xffff in this case), we can reduce this
to the 'zero case', after doing a neg.l d0 (or d1, whichever it was),
and negating the result in the end, or alternatively subtracting d1.lo
or d0.lo from the partial sum, since adding a number multiplied by -1
is the same as subtracting it.

This isn't such a special case if you consider the fact that either one of
(d0.hi, d1.hi) must be zero (possibly after the 'negation trick') for the
result not to overflow. The conclusion is that two multiplications should
suffice (in case of no overflow).

If someone isn't already hacking on this one, I'll take a look at it and
post the results when I'm done.

                                 Leo.

P.S. Lattice C passes the arguments for multiplication in registers
(d0 and d1 I think). This can further increase the routine's speed,
since you save 4 stack references (and a stack justification ala 
addq.l #8,sp), although I don't know if it is possible to force the
compiler into this action for user supplied library routines.

leo@philmds.UUCP (Leo de Wit) (12/12/88)

After doing some homework I can now present a long multiply that is
time efficient also in cases of small negative operands, and only needs
one or two unsigned hardware multiplies plus additional overhead.

Since the discussion also started here I post it to this newsgroup
(instead of submitting to comp.sources); besides I'm having trouble
getting stuff posted in that group (still waiting to see 'nlqps2' and
'life' appear in the binaries group, Steven!).  Take it ('s') or leave
it ('n').

I tested it for all combinations of operands that are positive or
negative powers of two, and furthermore for a number of random cases.
This revealed no errors (except of course the cases that overflowed).
Enjoy!
          Leo.

N.B. Assemble with GST Assembler, or modify to suit your favorite asm.
-------------- s t a r t   o f   m u l l i ( ) ----------------------
******************************************************************************
*                                                                            *
*    mulli.asm  version 1.0 of 12 December 1988   (C) L.J.M. de Wit 1988     *
*                                                                            *
* This software may be used and distributed freely if not used commercially  *
* and the originator (me) is mentioned in the source (just leave this 9 line *
* header intact).                                                            *
*                                                                            *
******************************************************************************
*
* NAME
*    (u)mulli - improved long (un)signed multiply
*
* SYNTAX
*    long mulli(a,b) long a, b;
*    unsigned long umulli(a,b) unsigned long a, b;
*
* DESCRIPTION
*    Mulli was designed after a proposal presented by Eric Gisin to provide a
*    fast long multiply (signed & unsigned); thanks, Eric!
*    Like in the original, since a non-sign extending multiply is the same for
*    signed and unsigned cases, one version suffices. The drawback is that
*    there is no overflow detection, because this depends on the case, e.g.
*
*        0x00000010 x 0x01000000 == 0x10000000
*
*    which is correct in the unsigned, but overflows in the signed case.
*    IN CASE OF OVERFLOW, NO ASSUMPTIONS SHOULD BE MADE UPON THE RESULT.
*
*    A long x long -> long multiply can be written as follows (p,q,r,s
*    are 16 bit numbers):
*
*                     High word    Low word
*    First operand:     q           p
*    Second operand:    s           r
*    --------------------------------------------- *
*    (q * s << 32) + (q * r + p * s << 16) + p * r
*
*    Since the result is a 32 bit number, this can be written:
*        ((LOW(q * r) + LOW(p*s)) << 16) + p * r
*    or
*        MED(q * r) + MED(p*s) + p * r.
*    where LOW() denotes the low order word (16 bits) of a 16 or 32 bit number,
*    and MED() is this low order word shifted 16 places (so a 32 bit number).
*    This demands effectively three (or less) multiplies (and some additional
*    stuff like shifts and extended adds).
*    It is clear that whenever the high word part of an input operand is 0,
*    as in the case of small positive numbers, some part of the result does not
*    need to be calculated and so one multiply less is required (the original
*    used this in the case both operands had their high parts 0, so that the
*    operation degrades to p * r).
*    For instance, if q == 0, the operation degrades to
*        MED(p*s) + p * r.
*    An other simplification can be made for small negative numbers, that is,
*    when the high part of an operand is 0xffff. For instance, when q == 0xffff
*    the part involving q can be written:
*        LOW(0xffff * r) << 16  ==
*        LOW(0x10000 * r - r) << 16 ==
*        LOW(-r) << 16 ==
*        MED(-r) ==
*        -MED(r)
*    So in this case we can also save a multiply.
*    As in the original, shifting by 16 is accomplished by swapping a register
*    and clearing the low order word, which is faster than doing a
*        asl.l #16,d0.
*    Since the result goes into a 32 bit longword, add is used instead of addx
*    for addition of partial results (the carry would go into the next longword
*    anyway).
*    The routine splits up into the combinations of the following cases:
*        a) the high word of an operand is 0 (0x0000).
*        b) the high word of an operand is -1 (0xffff).
*        c) the high word of an operand is neither 0 nor -1.
*    which amounts to 9 different cases (of which one will always overflow and
*    is not split up as a separate case).
*
* SPEED
*    As in the original, register d2 is used as an alternative calculation
*    register and its contents saved in a1. In the case of small x small
*    multiplication, there is no need for d2 and we save 12 ticks by not using
*    it. In the other cases we save, depending on the case, from 44 to 148
*    ticks (with respect to the original). This effectively means that all 
*    small x small multiplies with at least one negative operand are now twice
*    as fast.
*
*    Further speed increases could be gained by:
*    a) Passing the operands in registers. Lattice C uses this default
*        for its builtin multiply (CXM33 for signed and CXM22 for unsigned
*        multiply). This saves 2 x 16 == 32 ticks in the callee
*            (move.l 4(sp),d0 and move.l 8(sp),d1)
*        and 2 x 12 + 4 == 28 ticks in the caller
*            (move.l d1,-(sp) , move.l d2,-(sp) and addq.l #8,sp)
*        , thus gaining 60 ticks!
*    b) Having d2 as a scratch register. Lattice C does not need to save
*        the d2 register, since it is used as a scratch (temporary, calculation)
*        register. This means that all move.l a1,d2 and move.l d2,a1 sequences
*        can be removed. Since most cases require one of each (see code), this
*        saves 2 x 4 == 8 ticks on these cases.
*    c) Inlining.
*    d) Whatever shrewd idea you might have. I'm very interested!
*
* BUGS/SHORTCOMINGS
*    Since the code was created with speed as main motive, no objections should
*    be made against e.g. the code having multiple exit points (this saves us
*    another 12 ticks!) and having separate regions for parts that could easily
*    be shared (again saving branch instructions).
*
*    No overflow detection, and incorrect results in case of overflow.
*
*    Care has been taken to use meaningless labels 8-).
*
* NOTATION
*    The notation Dreg = (a,b) is used to mean Dreg.hi = a, Dreg.lo = b.
*

   section s.ccode

   xdef mulli
   xdef umulli

mulli
umulli
   move.l   4(sp),d0    ; q = d0.hi, p = d0.lo, d0 = (q,p)
   move.l   8(sp),d1    ; s = d1.hi, r = d1.lo, d1 = (s,r)
   swap     d0          ; d0 = (p,q)
   tst.w    d0          ;
   bne.s    alfa        ;
   swap     d0          ;
   swap     d1          ;
   tst.w    d1          ;
   bne.s    beta        ;
   swap     d1          ; d0 = (0,p), d1 = (0,r), q = 0, s = 0
   mulu.w   d1,d0       ; d0 = p * q
   rts                  ; case: q == 0 && s == 0

alfa
   move.l   d2,a1       ; save d2
   addq.w   #1,d0       ; d0 = (p,q+1), d1 = (s,r)
   bne.s    gamma       ;
   swap     d0          ; d0 = (0,p), q = -1
   move.w   d0,d2       ; 
   mulu.w   d1,d2       ; d2 = p * r
   swap     d1          ;
   tst.w    d1          ;
   bne.s    delta       ;
   move.l   d2,d0       ; d0 = p * r, d1 = (r, 0), s = 0
   sub.l    d1,d0       ; d0 = p * r - MED(r)
   move.l   a1,d2       ; restore d2
   rts                  ; case: q == -1 && s == 0

delta
   addq.w   #1,d1       ; d0 = (0,p), d1 = (r,s+1), d2 = p * r, q = -1
   bne.s    epsilon     ;
   swap     d0          ; d0 = (p,0), d1 = (r,0), s = 0
   sub.l    d0,d2       ; d2 = p * r - MED(p)
   sub.l    d1,d2       ; d2 = p * r - MED(p) - MED(r)
   move.l   d2,d0       ; d0 = p * r - MED(p) - MED(r)
   move.l   a1,d2       ; restore d2
   rts                  ; case: q == -1 && s == -1

epsilon
   subq.w   #1,d1       ; d0 = (0,p), d1 = (r,s), d2 = p * r
   mulu.w   d1,d0       ; d0 = p * s
   swap     d0          ;
   clr.w    d0          ; d0 = MED(p * s)
   clr.w    d1          ; d1 = MED(r)
   add.l    d0,d2       ; d2 = p * r + MED(p * s)
   sub.l    d1,d2       ; d2 = p * r + MED(p * s) - MED(r)
   move.l   d2,d0       ; d0 = p * r + MED(p * s) - MED(r)
   move.l   a1,d2       ; restore d2
   rts                  ; case: q == -1 && s == n

gamma
   subq.w   #1,d0       ; d0 = (p,q), d1 = (s,r)
   move.w   d0,d2       ;
   mulu.w   d1,d2       ; d2 = q * r
   swap     d0          ; d0 = (q,p)
   tst.l    d1          ; 
   bmi.s    zeta        ;
   mulu.w   d1,d0       ; d0 = p * r, s >= 0
   swap     d2          ;
   clr.w    d2          ; d2 = MED(q * r)
   add.l    d2,d0       ; d0 = p * r + MED(q * r)
   move.l   a1,d2       ; restore d2
   rts                  ; case: q == n && (s == 0 || s == n)
; note that MED(p,s) is ignored since if s != 0 there will be overflow anyway
; and thus s is assumed to be 0

zeta
   mulu.w   d0,d1       ; d0 = (q,p), d1 = p * r, d2 = q * r, s < 0
   sub.w    d0,d2       ; 
   swap     d2          ;
   clr.w    d2          ; d2 = MED(q * r) - MED(p)
   add.l    d2,d1       ; d1 = p * r + MED(q * r) - MED(p)
   move.l   d1,d0       ; d0 = p * r + MED(q * r) - MED(p)
   move.l   a1,d2       ; restore d2
   rts                  ; case: q == n && (s == -1 || s == n)
; note that MED(p,s) is ignored since if s != -1 there will be overflow anyway
; and thus s is assumed to be -1

beta
   addq.w   #1,d1       ; d0 = (0,p), d1 = (r,s+1), q = 0
   bne.s    theta       ;
   swap     d1          ; d1 = (0,r), s = -1
   mulu.w   d0,d1       ; d1 = p * r
   swap     d0          ; d0 = MED(p)
   sub.l    d0,d1       ; d1 = p * r - MED(p)
   move.l   d1,d0       ; d0 = p * r - MED(p)
   rts                  ; case: q == 0 && s == -1

theta
   subq.w   #1,d1       ; d0 = (0,p), d1 = (r,s), q = 0
   move.l   d2,a1       ; save d2
   move.w   d1,d2       ; 
   mulu.w   d0,d2       ; d2 = p * s
   swap     d1          ; d1 = (s,r)
   mulu.w   d1,d0       ; d0 = p * r
   swap     d2          ;
   clr.w    d2          ; d2 = MED(p * s)
   add.l    d2,d0       ; d0 = p * r + MED(p * s)
   move.l   a1,d2       ; restore d2
   rts                  ; case: q == 0 && s == n

   end