egisin@watmath.waterloo.edu (Eric Gisin) (11/30/88)
Here's an improved long multiply for the 68000.
The original came from some program in comp.sources.amiga.
It uses a single hardware multiply when the arguments are
the range 0 to 2^^16 - 1. The signed version is the
same as the unsigned, since a non-widening twos complement
multiply is the same for signed and unsigned operands
(except for overflow detection, which we ignore).
| optimized long multiply - author unknown
.globl __mulsi3, __umulsi3
.text; .even
__mulsi3:
__umulsi3:
movel SP@(4), D0
movel SP@(8), D1
movel D2, A1
movel D0, D2
mulu D1, D2
movel D2, A0
movel D0, D2
orl D1, D2
swap D2
tstw D2 | if (D0.hi | D1.hi) == 0
beq L1 | return A0
movel D0, D2
swap D2
mulu D1, D2
swap D1
mulu D1, D0
addl D2, D0
swap D0
clrw D0
addl D0, A0
L1: movel A0, D0
movel A1, D2
rts
leo@philmds.UUCP (Leo de Wit) (12/06/88)
In article <22486@watmath.waterloo.edu> egisin@watmath.waterloo.edu (Eric Gisin) writes: |Here's an improved long multiply for the 68000. |The original came from some program in comp.sources.amiga. |It uses a single hardware multiply when the arguments are |the range 0 to 2^^16 - 1. The signed version is the |same as the unsigned, since a non-widening twos complement |multiply is the same for signed and unsigned operands |(except for overflow detection, which we ignore). | || optimized long multiply - author unknown [source left out]... It is even feasible to do some more optimization: (if we call the upper half of d0 d0.hi, the lower half d0.lo etc): the routine currently calculates d0.lo x d1.lo, and, if one of (d0.hi, d1.hi) is non-zero, also d0.lo x d1.hi and d0.hi x d1.lo. That is, 1 mult. for d0, d1 in the 0..65535 range, 3 mult.'s in all other cases. Since multiplications are rather expensive (as compared to many other operations) some more testing for special cases come to mind: If EITHER one of (d0.hi, d1.hi) is zero, we can leave out the multiplications that use them (for obvious reasons). If either one of (d0.hi, d1.hi) is -1 (that is, 0xffff in this case), we can reduce this to the 'zero case', after doing a neg.l d0 (or d1, whichever it was), and negating the result in the end, or alternatively subtracting d1.lo or d0.lo from the partial sum, since adding a number multiplied by -1 is the same as subtracting it. This isn't such a special case if you consider the fact that either one of (d0.hi, d1.hi) must be zero (possibly after the 'negation trick') for the result not to overflow. The conclusion is that two multiplications should suffice (in case of no overflow). If someone isn't already hacking on this one, I'll take a look at it and post the results when I'm done. Leo. P.S. Lattice C passes the arguments for multiplication in registers (d0 and d1 I think). This can further increase the routine's speed, since you save 4 stack references (and a stack justification ala addq.l #8,sp), although I don't know if it is possible to force the compiler into this action for user supplied library routines.
leo@philmds.UUCP (Leo de Wit) (12/12/88)
After doing some homework I can now present a long multiply that is time efficient also in cases of small negative operands, and only needs one or two unsigned hardware multiplies plus additional overhead. Since the discussion also started here I post it to this newsgroup (instead of submitting to comp.sources); besides I'm having trouble getting stuff posted in that group (still waiting to see 'nlqps2' and 'life' appear in the binaries group, Steven!). Take it ('s') or leave it ('n'). I tested it for all combinations of operands that are positive or negative powers of two, and furthermore for a number of random cases. This revealed no errors (except of course the cases that overflowed). Enjoy! Leo. N.B. Assemble with GST Assembler, or modify to suit your favorite asm. -------------- s t a r t o f m u l l i ( ) ---------------------- ****************************************************************************** * * * mulli.asm version 1.0 of 12 December 1988 (C) L.J.M. de Wit 1988 * * * * This software may be used and distributed freely if not used commercially * * and the originator (me) is mentioned in the source (just leave this 9 line * * header intact). * * * ****************************************************************************** * * NAME * (u)mulli - improved long (un)signed multiply * * SYNTAX * long mulli(a,b) long a, b; * unsigned long umulli(a,b) unsigned long a, b; * * DESCRIPTION * Mulli was designed after a proposal presented by Eric Gisin to provide a * fast long multiply (signed & unsigned); thanks, Eric! * Like in the original, since a non-sign extending multiply is the same for * signed and unsigned cases, one version suffices. The drawback is that * there is no overflow detection, because this depends on the case, e.g. * * 0x00000010 x 0x01000000 == 0x10000000 * * which is correct in the unsigned, but overflows in the signed case. * IN CASE OF OVERFLOW, NO ASSUMPTIONS SHOULD BE MADE UPON THE RESULT. * * A long x long -> long multiply can be written as follows (p,q,r,s * are 16 bit numbers): * * High word Low word * First operand: q p * Second operand: s r * --------------------------------------------- * * (q * s << 32) + (q * r + p * s << 16) + p * r * * Since the result is a 32 bit number, this can be written: * ((LOW(q * r) + LOW(p*s)) << 16) + p * r * or * MED(q * r) + MED(p*s) + p * r. * where LOW() denotes the low order word (16 bits) of a 16 or 32 bit number, * and MED() is this low order word shifted 16 places (so a 32 bit number). * This demands effectively three (or less) multiplies (and some additional * stuff like shifts and extended adds). * It is clear that whenever the high word part of an input operand is 0, * as in the case of small positive numbers, some part of the result does not * need to be calculated and so one multiply less is required (the original * used this in the case both operands had their high parts 0, so that the * operation degrades to p * r). * For instance, if q == 0, the operation degrades to * MED(p*s) + p * r. * An other simplification can be made for small negative numbers, that is, * when the high part of an operand is 0xffff. For instance, when q == 0xffff * the part involving q can be written: * LOW(0xffff * r) << 16 == * LOW(0x10000 * r - r) << 16 == * LOW(-r) << 16 == * MED(-r) == * -MED(r) * So in this case we can also save a multiply. * As in the original, shifting by 16 is accomplished by swapping a register * and clearing the low order word, which is faster than doing a * asl.l #16,d0. * Since the result goes into a 32 bit longword, add is used instead of addx * for addition of partial results (the carry would go into the next longword * anyway). * The routine splits up into the combinations of the following cases: * a) the high word of an operand is 0 (0x0000). * b) the high word of an operand is -1 (0xffff). * c) the high word of an operand is neither 0 nor -1. * which amounts to 9 different cases (of which one will always overflow and * is not split up as a separate case). * * SPEED * As in the original, register d2 is used as an alternative calculation * register and its contents saved in a1. In the case of small x small * multiplication, there is no need for d2 and we save 12 ticks by not using * it. In the other cases we save, depending on the case, from 44 to 148 * ticks (with respect to the original). This effectively means that all * small x small multiplies with at least one negative operand are now twice * as fast. * * Further speed increases could be gained by: * a) Passing the operands in registers. Lattice C uses this default * for its builtin multiply (CXM33 for signed and CXM22 for unsigned * multiply). This saves 2 x 16 == 32 ticks in the callee * (move.l 4(sp),d0 and move.l 8(sp),d1) * and 2 x 12 + 4 == 28 ticks in the caller * (move.l d1,-(sp) , move.l d2,-(sp) and addq.l #8,sp) * , thus gaining 60 ticks! * b) Having d2 as a scratch register. Lattice C does not need to save * the d2 register, since it is used as a scratch (temporary, calculation) * register. This means that all move.l a1,d2 and move.l d2,a1 sequences * can be removed. Since most cases require one of each (see code), this * saves 2 x 4 == 8 ticks on these cases. * c) Inlining. * d) Whatever shrewd idea you might have. I'm very interested! * * BUGS/SHORTCOMINGS * Since the code was created with speed as main motive, no objections should * be made against e.g. the code having multiple exit points (this saves us * another 12 ticks!) and having separate regions for parts that could easily * be shared (again saving branch instructions). * * No overflow detection, and incorrect results in case of overflow. * * Care has been taken to use meaningless labels 8-). * * NOTATION * The notation Dreg = (a,b) is used to mean Dreg.hi = a, Dreg.lo = b. * section s.ccode xdef mulli xdef umulli mulli umulli move.l 4(sp),d0 ; q = d0.hi, p = d0.lo, d0 = (q,p) move.l 8(sp),d1 ; s = d1.hi, r = d1.lo, d1 = (s,r) swap d0 ; d0 = (p,q) tst.w d0 ; bne.s alfa ; swap d0 ; swap d1 ; tst.w d1 ; bne.s beta ; swap d1 ; d0 = (0,p), d1 = (0,r), q = 0, s = 0 mulu.w d1,d0 ; d0 = p * q rts ; case: q == 0 && s == 0 alfa move.l d2,a1 ; save d2 addq.w #1,d0 ; d0 = (p,q+1), d1 = (s,r) bne.s gamma ; swap d0 ; d0 = (0,p), q = -1 move.w d0,d2 ; mulu.w d1,d2 ; d2 = p * r swap d1 ; tst.w d1 ; bne.s delta ; move.l d2,d0 ; d0 = p * r, d1 = (r, 0), s = 0 sub.l d1,d0 ; d0 = p * r - MED(r) move.l a1,d2 ; restore d2 rts ; case: q == -1 && s == 0 delta addq.w #1,d1 ; d0 = (0,p), d1 = (r,s+1), d2 = p * r, q = -1 bne.s epsilon ; swap d0 ; d0 = (p,0), d1 = (r,0), s = 0 sub.l d0,d2 ; d2 = p * r - MED(p) sub.l d1,d2 ; d2 = p * r - MED(p) - MED(r) move.l d2,d0 ; d0 = p * r - MED(p) - MED(r) move.l a1,d2 ; restore d2 rts ; case: q == -1 && s == -1 epsilon subq.w #1,d1 ; d0 = (0,p), d1 = (r,s), d2 = p * r mulu.w d1,d0 ; d0 = p * s swap d0 ; clr.w d0 ; d0 = MED(p * s) clr.w d1 ; d1 = MED(r) add.l d0,d2 ; d2 = p * r + MED(p * s) sub.l d1,d2 ; d2 = p * r + MED(p * s) - MED(r) move.l d2,d0 ; d0 = p * r + MED(p * s) - MED(r) move.l a1,d2 ; restore d2 rts ; case: q == -1 && s == n gamma subq.w #1,d0 ; d0 = (p,q), d1 = (s,r) move.w d0,d2 ; mulu.w d1,d2 ; d2 = q * r swap d0 ; d0 = (q,p) tst.l d1 ; bmi.s zeta ; mulu.w d1,d0 ; d0 = p * r, s >= 0 swap d2 ; clr.w d2 ; d2 = MED(q * r) add.l d2,d0 ; d0 = p * r + MED(q * r) move.l a1,d2 ; restore d2 rts ; case: q == n && (s == 0 || s == n) ; note that MED(p,s) is ignored since if s != 0 there will be overflow anyway ; and thus s is assumed to be 0 zeta mulu.w d0,d1 ; d0 = (q,p), d1 = p * r, d2 = q * r, s < 0 sub.w d0,d2 ; swap d2 ; clr.w d2 ; d2 = MED(q * r) - MED(p) add.l d2,d1 ; d1 = p * r + MED(q * r) - MED(p) move.l d1,d0 ; d0 = p * r + MED(q * r) - MED(p) move.l a1,d2 ; restore d2 rts ; case: q == n && (s == -1 || s == n) ; note that MED(p,s) is ignored since if s != -1 there will be overflow anyway ; and thus s is assumed to be -1 beta addq.w #1,d1 ; d0 = (0,p), d1 = (r,s+1), q = 0 bne.s theta ; swap d1 ; d1 = (0,r), s = -1 mulu.w d0,d1 ; d1 = p * r swap d0 ; d0 = MED(p) sub.l d0,d1 ; d1 = p * r - MED(p) move.l d1,d0 ; d0 = p * r - MED(p) rts ; case: q == 0 && s == -1 theta subq.w #1,d1 ; d0 = (0,p), d1 = (r,s), q = 0 move.l d2,a1 ; save d2 move.w d1,d2 ; mulu.w d0,d2 ; d2 = p * s swap d1 ; d1 = (s,r) mulu.w d1,d0 ; d0 = p * r swap d2 ; clr.w d2 ; d2 = MED(p * s) add.l d2,d0 ; d0 = p * r + MED(p * s) move.l a1,d2 ; restore d2 rts ; case: q == 0 && s == n end