[comp.arch] MIPS Assembler Procedure

bs@linus.UUCP (Robert D. Silverman) (06/21/89)

The MIPS processor has a multiply instruction which will produce a double
length product from two 32 bit words, storing the result in the special
LO and HI registers.

However, there is no instruction that will divide that double length
product by a 32 bit word, giving a 32 bit quotient and/or remainder.

One can, of couse, program a nitty gritty double precision divide routine.

However, I don't know the MIPS assembler well enough to be confident that 
I would write an efficient routine.

Does anyone have a routine that will take a double length product out
of the LO/HI register pair and divide it by a 32 bit (unsigned) integer,
yielding a quotient and/or remainder?

This, in my opinion is one of the major faults of RISC processors. They
do not provide basic arithmetic instructions. 

Note that the above operation is trivial on many microprocessors. I have
routines which do it well for the 68020/68030, 80286/80386, NSC 320000,
VAX, etc.

Bob Silverman

henry@utzoo.uucp (Henry Spencer) (06/25/89)

In article <57125@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>This, in my opinion is one of the major faults of RISC processors. They
>do not provide basic arithmetic instructions. 

When the list of "basic" arithmetic instructions is pages long, one starts
to wonder how many of them are really "basic".  The instruction you ask
for -- divide double length by single length yielding single-length result --
is not exactly frequently needed.  Just how much silicon is it worth to make
it run faster than an implementation as a subroutine?
-- 
NASA is to spaceflight as the  |     Henry Spencer at U of Toronto Zoology
US government is to freedom.   | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

mash@mips.COM (John Mashey) (06/25/89)

In article <1989Jun24.230056.27774@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <57125@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>>This, in my opinion is one of the major faults of RISC processors. They
>>do not provide basic arithmetic instructions. 

>When the list of "basic" arithmetic instructions is pages long, one starts
>to wonder how many of them are really "basic".  The instruction you ask
>for -- divide double length by single length yielding single-length result --
>is not exactly frequently needed.  Just how much silicon is it worth to make
>it run faster than an implementation as a subroutine?

Actually, for many languages, on 32-bit machines, use of 32-bit
divisor and 64-bit dividend is often worse than useless, especially
if it's the only one you have.  For example, consider S/360 & followons:
	1) The divisor is 64-bits, and must occupy an even-odd register
	pair.
	2) The divide leaves remainder in the even and quotient in the odd.

1) If the dividend (which naturally starts as 32-bits in most cases)
	is already in an even register (R)
	is a value not needed any more
	has an odd-register partner that isn't need any more
then
	you need one instruction, a shift-right-double algebraic (SRDA)
	to prepare the divisor, i.e., to get:
	SRDA	R,32
	DR	R,divisor
	LR	result,R	(reminder, or R+1 for quotient)

2) If the dividend
	is in an odd register
	has a value that should be retained because it's useful later,
		i.e., J = I/7... use of I
	has an odd-register partner whose value should be retained
then
	you probably end up generating usage of a scratch register pair:
	LR	RTEMP,R
	SRDA	RTEMP,32
	DR	RTEMP,divisor
	LR	result,RTEMP	(or RTEMP+1, dep. on d

3) Needless to say, the double-register stuff is highly disliked by
optimizing compiler writers, as it disrupts the regularity of some of
the algorithms; as a result, many compilers end up having to use version 2),
at least some of the time.  (maybe people current on S/360 architectures
might say how often?  I'm too rusty to know.)

4) The MIPS sequence is basically what's done by 2) above:
	div	R,divisor
	.....		other instructions, if compiler can find some
			and zero-divide check if needed
	mfhi	result		(or mflo, to get quotient or remainder)

5) Now, none of this is a terrible deal, as divides aren't that frequent,
which is why some RISC architectures omit them entirely.  [It depends
on which benchmarks you chose whether you think this is good or not.]
But note that something that hardware designers might (or might not)
think was helping the compiler folks was really getting in their road....

6) When we were doing this, we couldn't think of any languages whose
natural implementation on a 32-bit machine would expect generation
of a 64 by 32 bit divide as the expected operation.  Does anybody
know of any such, or is thsi strictly something one would use in
handcoded routines?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

slackey@bbn.com (Stan Lackey) (06/26/89)

In article <1989Jun24.230056.27774@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <57125@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>>This, in my opinion is one of the major faults of RISC processors. They
>>do not provide basic arithmetic instructions. 
>
>When the list of "basic" arithmetic instructions is pages long, one starts
>to wonder how many of them are really "basic".  The instruction you ask
>for -- divide double length by single length yielding single-length result --
>is not exactly frequently needed.  Just how much silicon is it worth to make
>it run faster than an implementation as a subroutine?

The RISC suppliers would like us (users) to believe that RISC is
ALWAYS faster.  I can't help thinking back to when Patterson's first
RISC papers came out.  He managed to tune the code on his RISC until
all of the benchmarks ran faster than the VAX 11/780, except for one,
which happened to exercise one of the microcoded character string
instructions a lot.

The moral of the story: If your application needs mul and div
performance, get a micro with hardware support for them.  If that
means using a CISC, so be it.

BTW:  divide-double-by-single-yielding-single is the one the 68000
included.  Is there a chicken-and-egg situation here?
-Stan

culberts@hpccc.HP.COM (Bruce Culbertson) (06/27/89)

mash@mips.COM (John Mashey) writes:

> 6) When we were doing this, we couldn't think of any languages whose
> natural implementation on a 32-bit machine would expect generation
> of a 64 by 32 bit divide as the expected operation.  Does anybody
> know of any such, or is thsi strictly something one would use in
> handcoded routines?

A frequently used application that occurs to me is conversion of floating
point numbers to text.  A mantissa of more than 32 bits must be divided
by 10 repeatedly to convert it from base 2 to base 10.

Bruce Culbertson
culberts@hplabs.hp.com

bs@linus.UUCP (Robert D. Silverman) (06/27/89)

In article <41957@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
:In article <1989Jun24.230056.27774@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
:>In article <57125@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
:>>This, in my opinion is one of the major faults of RISC processors. They
:>>do not provide basic arithmetic instructions. 
:>
:>When the list of "basic" arithmetic instructions is pages long, one starts
:>to wonder how many of them are really "basic".  The instruction you ask
:>for -- divide double length by single length yielding single-length result --
:>is not exactly frequently needed.  Just how much silicon is it worth to make
:>it run faster than an implementation as a subroutine?
:
:The RISC suppliers would like us (users) to believe that RISC is
:ALWAYS faster.  I can't help thinking back to when Patterson's first
:RISC papers came out.  He managed to tune the code on his RISC until
:all of the benchmarks ran faster than the VAX 11/780, except for one,
:which happened to exercise one of the microcoded character string
:instructions a lot.
:
:The moral of the story: If your application needs mul and div
:performance, get a micro with hardware support for them.  If that
:means using a CISC, so be it.
:
:BTW:  divide-double-by-single-yielding-single is the one the 68000
:included.  Is there a chicken-and-egg situation here?
 
No, to the last question. Such an instruction is required to compute
AB/C or AB MOD C exactly, when AB can overflow and C > A,B. 

The problem with CISC is that no one is likely to come out with a 40MIP
CISC in the near future [e.g. i860]. I simply contend that operations
like multiplication and division should be fundamental to ANY computer.
I also include add, subtract and add/subtract with carry. How else
would you compute AB/C or AB % C exactly?

Note that I've also seen some pretty arcane instructions on so called
RISC processors. The i860 even has a custom purpose piece of real estate
on the chip to support special graphics operations. However, it can't
multiply or divide!!!! This is RISC?

Bob Silverman
:-Stan

firth@sei.cmu.edu (Robert Firth) (06/27/89)

In article <22218@winchester.mips.COM> mash@mips.COM (John Mashey) writes:

>6) When we were doing this, we couldn't think of any languages whose
>natural implementation on a 32-bit machine would expect generation
>of a 64 by 32 bit divide as the expected operation.  Does anybody
>know of any such, or is thsi strictly something one would use in
>handcoded routines?

Nope, I don't know of any high-level language either that requires
64/32 => 32,32 division.  For expressions such as (a*b)/c, all 32 bit,
some languages permit the compiler to perform the rule-of-three
operation, ie hold the intermediate product in 64 bits.  Ada is one
such.  But no language requires it.

And, of course, the operation is pretty useless for true 64-bit
arithmetic since there you want a 64-bit quotient.  Performing
full 64/32 => 64,32 on an MC68020, for instance, is pretty tedious
even with the extended division operations, since one has to correct
signs and remainders manually.

scarter@gryphon.COM (Scott Carter) (06/28/89)

In article <22218@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <1989Jun24.230056.27774@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>>In article <57125@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>>>This, in my opinion is one of the major faults of RISC processors. They
>>When the list of "basic" arithmetic instructions is pages long, one starts
>>to wonder how many of them are really "basic".  The instruction you ask
>>for -- divide double length by single length yielding single-length result --
>>is not exactly frequently needed.  Just how much silicon is it worth to make
>>it run faster than an implementation as a subroutine?

Given the SRT divide algorithm we used (in the MD484 GaAs micro) the extra
silicon is rather trivial.  I don't have exact figures, but I would guess
<100-200 devices, probably much less.  Guy who designed that part of the
CPU is not available at the moment or I would give more precise numbers.
Performance difference, IF you need the operation frequently (see below)
is enormous, i.e 35 cycles vs hundreds. 

>6) When we were doing this, we couldn't think of any languages whose
>natural implementation on a 32-bit machine would expect generation
>of a 64 by 32 bit divide as the expected operation.  Does anybody
>know of any such, or is thsi strictly something one would use in
>handcoded routines?

64 divide by 32 giving 32 bit result occurs rather frequently when you
are using fixed point numbers whose "engineering unit" representation
is not a binary radix.  This can be expressed directly in Ada by

type foo is delta 0.001 range 0.0 .. 100.0;
for foo'SMALL use 0.001;

(There was a way to do this in JOVIAL too but I forget what it was)

This sort of thing shows up in e.g. missile guidance (not a very high-
volume application :) ) a great deal.  Some signal processing functions
use it too (don't ask me - I don't have the compartment to know).

Also, if your integer multiply is significantly faster than your floating
multiply (integer multiply giving 64-bit product, that is), there can
be some nice performance advantage to using (binary radix!!) fixed point
rather than floating point.  When this is the case we see enough fixed-
point divides to make the extra silicon worth it.

Scott Carter
McDonnell Douglas Electronics Systems Co.
(714)-896-3097

using CTS until our Usenet connection arrives

slackey@bbn.com (Stan Lackey) (06/29/89)

In article <57725@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>In article <41957@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>:The RISC suppliers would like us (users) to believe that RISC is
>:ALWAYS faster.

>:The moral of the story: If your application needs mul and div
>:performance, get a micro with hardware support for them.  If that
>:means using a CISC, so be it.

>The problem with CISC is that no one is likely to come out with a 40MIP
>CISC in the near future [e.g. i860].

I believe that now that the transistors/chip is high enough, you WILL
see CISC's that perform equal to or greater than RISC's of the same
technology.  Wait and see.  I'll wait and see if there is really ever
an i860 that delivers a consistent 40MIPS across a broad range of
applications.

>I also include add, subtract and add/subtract with carry. How else
>would you compute AB/C or AB % C exactly?

This is my point - it's OK to have slower cycles, if each cycle does
more work.  If it takes a RISC 100 20ns cycles to do the operation you
need, and the CISC 10 50ns cycles, the CISC is faster.  Microcoded
transcendentals are an example.
-Stan

bb@tetons.UUCP (Bob Blau) (06/30/89)

In article <22218@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
>In article <1989Jun24.230056.27774@utzoo.uucp> Henry Spencer writes:
>>When the list of "basic" arithmetic instructions is pages long, one starts
>>to wonder how many of them are really "basic".  The instruction you ask
>>for -- divide double length by single length yielding single-length result -
>>is not exactly frequently needed.  Just how much silicon is it worth to make
>>it run faster than an implementation as a subroutine?
> 
> Actually, for many languages, on 32-bit machines, use of 32-bit
> divisor and 64-bit dividend is often worse than useless, especially
> if it's the only one you have.  For example, consider S/360 & followons:
> 	1) The divisor is 64-bits, and must occupy an even-odd register
> 	pair.
> 	2) The divide leaves remainder in the even and quotient in the odd.
>   ...
> 	you probably end up generating usage of a scratch register pair:
> 	LR	RTEMP,R
> 	SRDA	RTEMP,32
> 	DR	RTEMP,divisor
> 	LR	result,RTEMP
>   ...
> 5) Now, none of this is a terrible deal, as divides aren't that frequent,

I dug up some ancient (10 year old) instruction mix statistics from my
bookshelf which back up the observation that integer divides aren't too
common in the 370 environment.  I didn't look at actual compiler output to see
the relationship between DR and SRDA, but it seems that the FORTRAN compiler
used the D instruction (with the dividend in memory) more often than DR.
This makes sense, since a memory access out of cache can be considerably
faster than a double word arithmetic shift right (SRDA). Note that even
two Load Registers (LR's) and a single word arithmetic shift right (SRA)
can be faster than an SRDA depending on the implementation.

These numbers are the mean instruction frequency out of traces of over
15,000,000 instructions.
                            DR        SRDA          D
Commercial COBOL Mix      .00003     .00007      .00000 
Scientific FORTRAN Mix    .00016     .00115      .00071

				           -Bob

(Std. disclaimers apply)
-- 
  Bob Blau       Amdahl Corporation    143 N. 2 E., Rexburg, Idaho 83440
  UUCP:{ames,decwrl,uunet}!amdahl!tetons!bb               (208) 356-8915
  INTERNET: bb@tetons.idaho.amdahl.com

bb@tetons.UUCP (Bob Blau) (07/01/89)

In article <812@tetons.UUCP>, I write:
> ... it seems that the FORTRAN compiler
> used the D instruction (with the dividend in memory) more often than DR.
> This makes sense, since a memory access out of cache can be considerably
> faster than a double word arithmetic shift right (SRDA). ...

ack - This is irrelevant since both the D and DR insructions have the
dividend in an even odd register pair, so you would need the SRDA in 
either case - sorry for the misinformation.

More to the point perhaps is that the SRDA time is small compared to the
divide time, so the overhead is not too significant, especially given
the rarity of the operation.

			-Bob

-- 
  Bob Blau       Amdahl Corporation    143 N. 2 E., Rexburg, Idaho 83440
  UUCP:{ames,decwrl,uunet}!amdahl!tetons!bb               (208) 356-8915
  INTERNET: bb@tetons.idaho.amdahl.com