[misc.wanted] MIPS Assembler Procedure

bs@linus.UUCP (Robert D. Silverman) (06/21/89)

The MIPS processor has a multiply instruction which will produce a double
length product from two 32 bit words, storing the result in the special
LO and HI registers.

However, there is no instruction that will divide that double length
product by a 32 bit word, giving a 32 bit quotient and/or remainder.

One can, of couse, program a nitty gritty double precision divide routine.

However, I don't know the MIPS assembler well enough to be confident that 
I would write an efficient routine.

Does anyone have a routine that will take a double length product out
of the LO/HI register pair and divide it by a 32 bit (unsigned) integer,
yielding a quotient and/or remainder?

This, in my opinion is one of the major faults of RISC processors. They
do not provide basic arithmetic instructions. 

Note that the above operation is trivial on many microprocessors. I have
routines which do it well for the 68020/68030, 80286/80386, NSC 320000,
VAX, etc.

Bob Silverman

henry@utzoo.uucp (Henry Spencer) (06/25/89)

In article <57125@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>This, in my opinion is one of the major faults of RISC processors. They
>do not provide basic arithmetic instructions. 

When the list of "basic" arithmetic instructions is pages long, one starts
to wonder how many of them are really "basic".  The instruction you ask
for -- divide double length by single length yielding single-length result --
is not exactly frequently needed.  Just how much silicon is it worth to make
it run faster than an implementation as a subroutine?
-- 
NASA is to spaceflight as the  |     Henry Spencer at U of Toronto Zoology
US government is to freedom.   | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

mash@mips.COM (John Mashey) (06/25/89)

In article <1989Jun24.230056.27774@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <57125@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>>This, in my opinion is one of the major faults of RISC processors. They
>>do not provide basic arithmetic instructions. 

>When the list of "basic" arithmetic instructions is pages long, one starts
>to wonder how many of them are really "basic".  The instruction you ask
>for -- divide double length by single length yielding single-length result --
>is not exactly frequently needed.  Just how much silicon is it worth to make
>it run faster than an implementation as a subroutine?

Actually, for many languages, on 32-bit machines, use of 32-bit
divisor and 64-bit dividend is often worse than useless, especially
if it's the only one you have.  For example, consider S/360 & followons:
	1) The divisor is 64-bits, and must occupy an even-odd register
	pair.
	2) The divide leaves remainder in the even and quotient in the odd.

1) If the dividend (which naturally starts as 32-bits in most cases)
	is already in an even register (R)
	is a value not needed any more
	has an odd-register partner that isn't need any more
then
	you need one instruction, a shift-right-double algebraic (SRDA)
	to prepare the divisor, i.e., to get:
	SRDA	R,32
	DR	R,divisor
	LR	result,R	(reminder, or R+1 for quotient)

2) If the dividend
	is in an odd register
	has a value that should be retained because it's useful later,
		i.e., J = I/7... use of I
	has an odd-register partner whose value should be retained
then
	you probably end up generating usage of a scratch register pair:
	LR	RTEMP,R
	SRDA	RTEMP,32
	DR	RTEMP,divisor
	LR	result,RTEMP	(or RTEMP+1, dep. on d

3) Needless to say, the double-register stuff is highly disliked by
optimizing compiler writers, as it disrupts the regularity of some of
the algorithms; as a result, many compilers end up having to use version 2),
at least some of the time.  (maybe people current on S/360 architectures
might say how often?  I'm too rusty to know.)

4) The MIPS sequence is basically what's done by 2) above:
	div	R,divisor
	.....		other instructions, if compiler can find some
			and zero-divide check if needed
	mfhi	result		(or mflo, to get quotient or remainder)

5) Now, none of this is a terrible deal, as divides aren't that frequent,
which is why some RISC architectures omit them entirely.  [It depends
on which benchmarks you chose whether you think this is good or not.]
But note that something that hardware designers might (or might not)
think was helping the compiler folks was really getting in their road....

6) When we were doing this, we couldn't think of any languages whose
natural implementation on a 32-bit machine would expect generation
of a 64 by 32 bit divide as the expected operation.  Does anybody
know of any such, or is thsi strictly something one would use in
handcoded routines?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

slackey@bbn.com (Stan Lackey) (06/26/89)

In article <1989Jun24.230056.27774@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <57125@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>>This, in my opinion is one of the major faults of RISC processors. They
>>do not provide basic arithmetic instructions. 
>
>When the list of "basic" arithmetic instructions is pages long, one starts
>to wonder how many of them are really "basic".  The instruction you ask
>for -- divide double length by single length yielding single-length result --
>is not exactly frequently needed.  Just how much silicon is it worth to make
>it run faster than an implementation as a subroutine?

The RISC suppliers would like us (users) to believe that RISC is
ALWAYS faster.  I can't help thinking back to when Patterson's first
RISC papers came out.  He managed to tune the code on his RISC until
all of the benchmarks ran faster than the VAX 11/780, except for one,
which happened to exercise one of the microcoded character string
instructions a lot.

The moral of the story: If your application needs mul and div
performance, get a micro with hardware support for them.  If that
means using a CISC, so be it.

BTW:  divide-double-by-single-yielding-single is the one the 68000
included.  Is there a chicken-and-egg situation here?
-Stan

bs@linus.UUCP (Robert D. Silverman) (06/27/89)

In article <41957@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
:In article <1989Jun24.230056.27774@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
:>In article <57125@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
:>>This, in my opinion is one of the major faults of RISC processors. They
:>>do not provide basic arithmetic instructions. 
:>
:>When the list of "basic" arithmetic instructions is pages long, one starts
:>to wonder how many of them are really "basic".  The instruction you ask
:>for -- divide double length by single length yielding single-length result --
:>is not exactly frequently needed.  Just how much silicon is it worth to make
:>it run faster than an implementation as a subroutine?
:
:The RISC suppliers would like us (users) to believe that RISC is
:ALWAYS faster.  I can't help thinking back to when Patterson's first
:RISC papers came out.  He managed to tune the code on his RISC until
:all of the benchmarks ran faster than the VAX 11/780, except for one,
:which happened to exercise one of the microcoded character string
:instructions a lot.
:
:The moral of the story: If your application needs mul and div
:performance, get a micro with hardware support for them.  If that
:means using a CISC, so be it.
:
:BTW:  divide-double-by-single-yielding-single is the one the 68000
:included.  Is there a chicken-and-egg situation here?
 
No, to the last question. Such an instruction is required to compute
AB/C or AB MOD C exactly, when AB can overflow and C > A,B. 

The problem with CISC is that no one is likely to come out with a 40MIP
CISC in the near future [e.g. i860]. I simply contend that operations
like multiplication and division should be fundamental to ANY computer.
I also include add, subtract and add/subtract with carry. How else
would you compute AB/C or AB % C exactly?

Note that I've also seen some pretty arcane instructions on so called
RISC processors. The i860 even has a custom purpose piece of real estate
on the chip to support special graphics operations. However, it can't
multiply or divide!!!! This is RISC?

Bob Silverman
:-Stan

firth@sei.cmu.edu (Robert Firth) (06/27/89)

In article <22218@winchester.mips.COM> mash@mips.COM (John Mashey) writes:

>6) When we were doing this, we couldn't think of any languages whose
>natural implementation on a 32-bit machine would expect generation
>of a 64 by 32 bit divide as the expected operation.  Does anybody
>know of any such, or is thsi strictly something one would use in
>handcoded routines?

Nope, I don't know of any high-level language either that requires
64/32 => 32,32 division.  For expressions such as (a*b)/c, all 32 bit,
some languages permit the compiler to perform the rule-of-three
operation, ie hold the intermediate product in 64 bits.  Ada is one
such.  But no language requires it.

And, of course, the operation is pretty useless for true 64-bit
arithmetic since there you want a 64-bit quotient.  Performing
full 64/32 => 64,32 on an MC68020, for instance, is pretty tedious
even with the extended division operations, since one has to correct
signs and remainders manually.

slackey@bbn.com (Stan Lackey) (06/29/89)

In article <57725@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>In article <41957@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>:The RISC suppliers would like us (users) to believe that RISC is
>:ALWAYS faster.

>:The moral of the story: If your application needs mul and div
>:performance, get a micro with hardware support for them.  If that
>:means using a CISC, so be it.

>The problem with CISC is that no one is likely to come out with a 40MIP
>CISC in the near future [e.g. i860].

I believe that now that the transistors/chip is high enough, you WILL
see CISC's that perform equal to or greater than RISC's of the same
technology.  Wait and see.  I'll wait and see if there is really ever
an i860 that delivers a consistent 40MIPS across a broad range of
applications.

>I also include add, subtract and add/subtract with carry. How else
>would you compute AB/C or AB % C exactly?

This is my point - it's OK to have slower cycles, if each cycle does
more work.  If it takes a RISC 100 20ns cycles to do the operation you
need, and the CISC 10 50ns cycles, the CISC is faster.  Microcoded
transcendentals are an example.
-Stan