[comp.arch] Integer Multiply/Divide on Sparc

bs@linus.UUCP (Robert D. Silverman) (12/23/89)

Does any have, of know of software for the SPARC [SUN-4] that will
perform the following:

(1) Multiply two unsigned 32-bit integers, yielding a 64 bit product
    stored in two registers?

(2) Take a 64 bit product and divide it by a 32 bit (unsigned) integer
    yielding a 32-bit quotient and remainder?
 
or (3) Compute A*B Mod C directly?


The SPARC is brain dead [as were its designers] when it comes to doing
integer arithmetic. It can't multiply and it can't divide.

Trying to convert to floating point, do the arithmetic, then convert
back is far too slow. (I've already tried).

-- 
Bob Silverman
#include <std.disclaimer>
Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
Mitre Corporation, Bedford, MA 01730

dgr@hpfcso.HP.COM (Dave Roberts) (12/27/89)

>The SPARC is brain dead [as were its designers] when it comes to doing
>integer arithmetic. It can't multiply and it can't divide.

>-- 
>Bob Silverman
>#include <std.disclaimer>
>Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
>Mitre Corporation, Bedford, MA 01730
>----------


Geeze Bob,
	The thing is a SPARC.  It's a RISC machine.  Integer mult and
divide are the first things to go when you design a RISC.  There should
be some funky instructions to help you out, like "shift and add" for
multiplication.  Trust me, you're better off (in speed, that is) for
not having those functions, and I'll be that you can write a routine
that can do them just about as fast as they could internally.
I don't really know much about SPARCs but I know that the designers
at Sun weren't "brain dead".

Note that I am in no way endorsing the purchase of any SPARC system.
Quite the contrary, of course, competition being what it is. :-)


Dave Roberts
Hewlett-Packard Co.
dgr@hpfcla.hp.com
dgr%hpfcla@hplabs.hp.com

davidc@vlsisj.VLSI.COM (David Chapman) (12/27/89)

In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes:
>Does any have, of know of software for the SPARC [SUN-4] that will
>perform the following:
>
> [standard multiply and divide]
>
>The SPARC is brain dead [as were its designers] when it comes to doing
>integer arithmetic. It can't multiply and it can't divide.

There should be instructions on the order of "multiply step" and "divide 
step", each of which will do one of the 32 adds/subtracts and then shift.  
I'm not particularly fond of the SPARC architecture (don't like register 
windows), but this is a theoretical viewpoint and is not based on any 
direct exposure to assembly-language programming for it (translation:
sorry, I can't give you any more help).

Neither SPARC nor its designers were brain-dead when it was built.  It's just
that it is difficult to get multiplication and division (especially the 
latter) to run in 1 or 2 clock cycles.  All instructions are supposed to
execute in the ALU in 1 cycle; if the multiply and divide instructions take
more time then the front of the processor pipeline has to be able to stall
and this added complexity will slow down the entire processor.

Thus they provide you with the tools to do your own multiply and divide.  
One of the benefits is that a compiler can optimize small multiplies and 
divides to make them execute quicker (i.e. multiply by 10 takes 4 steps 
instead of 32).

It is important that you understand this if you are to write assembly
language programs for a SPARC.  If your instructions are not carefully
optimized, the result could be slower than if you write in a high-level
language and compile with its optimizer!  (Unless the SPARC assembler
performs instruction reordering.)

P.S.  Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be
      incredibly slow.  Unroll the loop 4 or 8 times (MULSTEP, MULSTEP,
      MULSTEP, MULSTEP, SUB 4, BNZ).  Branches are expensive.
-- 
		David Chapman

{known world}!decwrl!vlsisj!fndry!davidc
vlsisj!fndry!davidc@decwrl.dec.com

cik@l.cc.purdue.edu (Herman Rubin) (12/27/89)

In article <8840004@hpfcso.HP.COM>, dgr@hpfcso.HP.COM (Dave Roberts) writes:
> 
> >The SPARC is brain dead [as were its designers] when it comes to doing
> >integer arithmetic. It can't multiply and it can't divide.
> 
> >-- 
> >Bob Silverman
> >#include <std.disclaimer>
> >Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
> >Mitre Corporation, Bedford, MA 01730
> >----------
> 
> 
> Geeze Bob,
> 	The thing is a SPARC.  It's a RISC machine.  Integer mult and
> divide are the first things to go when you design a RISC.  There should
> be some funky instructions to help you out, like "shift and add" for
> multiplication.  Trust me, you're better off (in speed, that is) for
                   ^^^^^^^^
> not having those functions, and I'll be that you can write a routine
> that can do them just about as fast as they could internally.
> I don't really know much about SPARCs but I know that the designers
> at Sun weren't "brain dead".

It is clear that you are not to be trusted (see above).  To multiply
two 32 bit numbers to get a 64 bit product on a 32x32 -> 32 machine,
the 32 bit numbers must be divided into 16 bit parts.  The whole operation
takes about 20 operations (count them).  Shift and add are far slower.
Divide is even worse.   Also, there is considerable overhead in a
subroutine call; there are registers to save and restore.  Open
subroutines (in-line functions) are a way around it, but they still
have the problem.

I am sure that Bob Silverman knows how to write efficient subroutines.
He has to use them anyhow, as he is multiplying and dividing numbers of
several hundred bits.  But even if less is wanted, good integer arithmetic
is needed.  If more precision than is designed for is wanted in floating
operations, integer arithmetic must be used.

There are also many other kinds of operations cheap in hardware and
expensive in software.  RISC machines may be good for the types of
operations the designers anticipated, but it is difficult to do much
about the ones left out.  The CRAYs can be considered RISC vector 
machines, and the vector operations omitted are extremely difficult
to get around.  The above instruction count for double precision was
derived from the CRAY.

We even have a chicken-and-egg problem.  Any fairly good programmer
designs the program to take into account the capabilities of the machine.
I know that the gurus claim that this should not be so, but it is not
unusual for me to think of modifications or even totally new ways of
doing things which the compiler cannot unless those specific ways are
put into the compiler.  If a machine does not have hardware square roots,
one avoids square roots, as there are usually faster ways.

One thing which might help is if there were a mailing list to discuss these
ideas, and to collect the numerous operations efficient in hardware and
expensive in software.  Those who know me will agree that I am not the
person to run this.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

bs@linus.UUCP (Robert D. Silverman) (12/27/89)

In article <8840004@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
:
 
I wrote:

:>The SPARC is brain dead [as were its designers] when it comes to doing
:>integer arithmetic. It can't multiply and it can't divide.
:
:Geeze Bob,
:	The thing is a SPARC.  It's a RISC machine.  Integer mult and
:divide are the first things to go when you design a RISC.  There should
 
Huh? A computer that can't perform basic arithmetic? [add, subtract,
multiply, divide]. What have you been smoking? Integer mult/divide
should NOT be the first things to go. [look at the MIPS chip, a far
superior design in my opinion]. Integer mult/divide may not be much
used for most applications, but when they are required, not having
hardware support is an absolute KILLER of algorithms. I have code that
runs **slower** on the newest SPARCstation [SUN-4/330] than it does on
a SUN-3/60, ONLY BECAUSE it can't multiply and divide. This is
despite the fact that the SPARC is purported to be at least 10x faster.
The RISC designers who put together SPARC may have been looking at the 
'broad picture', but they sure as hell didn't know a lot about
computer arithmetic or algorithms.

:be some funky instructions to help you out, like "shift and add" for
:multiplication.  Trust me, you're better off (in speed, that is) for
:not having those functions, and I'll be that you can write a routine
 
I AM NOT BETTER OFF, and don't speak for me. The fact that you say
'trust me ....', indicates that you don't know diddly when it comes
to the implementation and design of numerical and semi-numerical
algorithms.
 
Shift and add to perform general multiplication is hopelessly
slow because of the overhead/bookkeeping/control mechanisms involved.


:that can do them just about as fast as they could internally.
 
Huh? The MIPS-R3000 does integer multiplies in hardware in just a
couple of cycles. The SPARC takes a minimum of 47 cycles using
its so-called multiply-step function to multiply two integers.
Division is even worse by almost an order of magnitude.

It never ceases to amaze me how RISC zealots immediately jump
up anytime anyone should [GOD forbid!] criticize a RISC instruction
set. Especially when those zealots know very little about numerical
algorithms.

:I don't really know much about SPARCs but I know that the designers
 
Yet, you are willing to say 'trust me I know better' in a knee-jerk
response to my posting.

:at Sun weren't "brain dead".

I didn't say they were. I say that they were with respect to their
knowledge of how arithmetic should be done. The fact that they
perceived integer mult/divide as unimportant is indicative that they
were unaware how important they are for applications that require
them. On the other hand, the SPARC may have been designed ONLY for
the majority of programs that don't need mult/divide. If so, its
design was driven by MARKETING decisions, not technical ones.
Criticism of the chip on a theoretical basis is more than justified.
 
-- 
Bob Silverman
#include <std.disclaimer>
Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
Mitre Corporation, Bedford, MA 01730

hui@joplin.mpr.ca (Michael Hui) (12/28/89)

For a frame of reference, T.I.'s DSP chip TMS320C25 does a 16x16
multiply in a single cycle.

AMD's Am29000 uses multiple instructions to accomplish a general 32x32
multiply. I say general because, quoating from 7.1.6 of the User's
Manual (c)1987:

"It may be beneficial to precede a full multiply procedure with a routine
to discover whether or not the number of multiply steps may be reduced."

In other words, the compiler is given more room for optimization here
when faced with two arguments of different precision.

Michael Hui  604-985-4214  hui@joplin.mpr.ca

reha@cbnewsi.ATT.COM (reha.gur) (12/28/89)

In article <1804@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:

> It is clear that you are not to be trusted (see above).  To multiply
> two 32 bit numbers to get a 64 bit product on a 32x32 -> 32 machine,
> the 32 bit numbers must be divided into 16 bit parts.  The whole operation
> takes about 20 operations (count them).  Shift and add are far slower.
> Divide is even worse.   Also, there is considerable overhead in a
> subroutine call; there are registers to save and restore.  Open
> subroutines (in-line functions) are a way around it, but they still
> have the problem.
> 
> Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
> Phone: (317)494-6054
> hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

The numbers I get (from looking at the data sheets and other info) for
two machines: a 25Mhz i486 and a 25Mhz SPARC are as below:

Assuming no cache hits and various other items:

i486: 	18-31 cycles for signed 32 x 32 bit multipication (reg to reg)
SPARC:	48-52 cycles for same (including subroutine call and return time)

i486:	32 cycles for signed 32 bit division (acc by reg)
SPARC:	41 (approximate best case) to 211 (approximate worst case)
	(depends on bits in dividend and divisor)

The numbers above are approximate and results may vary.

The SPARC subroutine call does not need any registers saved across the call.
The code for multiplication is as given in the sparc architecture manual.

Also note that some SPARC machines do have (or might have) integer mul and
divide in hardware.

reha gur
attunix!reha

krueger@tms390.Micro.TI.COM (Steve Krueger) (12/28/89)

OK, enough already.  I'll tell.

While no SPARC microprocessors to date implement them, there are
integer multiply instructions which have been specified for the SPARC
architecture.  Some (possibly many) future implementations will find
it advantageous to implement them.  They are intended to produce
results in very few cycles on the "fastest" implementations.  Current
implementations and those that don't support these instructions will
trap and presumably emulate the operation.  The multiply step
instruction will remain as well.

Breifly, a little more detail:

Multiply is 32x32 -> 64.  The low order portion of the result goes
into the destination register of the instruction and the high order
part goes into the Y register (same one used by multiply step
instruction).  The instruction comes in signed and unsigned forms and
each may set the condition codes or not.  (All of that is standard for
arithmetic instructions in SPARC.)

There is a similar set of divide instructions that are pretty much the
inverse of the multiply instructions.

In article <1979@eric.mpr.ca>, hui@joplin.mpr.ca (Michael Hui) writes:
> 
> For a frame of reference, T.I.'s DSP chip TMS320C25 does a 16x16
> multiply in a single cycle.

I need to comment on this since I know a little about these chips.

The fastest TMS320C25 is available with an 80nS cycle time and the
'C25 can do multiply and accumulate at that rate.  The 32x32 -> 64
integer multiply that is desired for 32-bit processors is 4 times as
complex as the 16x16 -> 32 in the `C25.  The DSP's allocate a lot of
their chip area to multiply.  GP micros have other priorities and a
full 32x32 -> 64 multiplier array can take {\em serious} Si area.  So
you can expect that DSPs will have better performance/price for
multiply than GP micros.  All of that said, SPARC needs integer
multiply instructions and so it will get them.

BTW, there are several TI parts with higher integer multiply
performance than the `C25.  I don't mean this as a commercial, just
for informational purposes.  The TMS320C50, another integer DSP chip,
was announced last year.  The 'C50 does an integer multiply in 50nS.
A TI floating point datapath chip SN74ACT8847 can also perform integer
multiply.  A 32x32 -> 64 integer multiply takes just 30nS on the 33MHz
version of that chip.  It does pretty well on floating point stuff too :-)
but its not a processor.

	Steve Krueger		krueger@micro.ti.com
	SPARC Applications
	Texas Instruments
	Houston, Texas
	(713) 274-2479

** Mod any actual facts found above, these are my thoughts alone. **

rec@dg.dg.com (Robert Cousins) (12/28/89)

In article <84983@linus.UUCP> bs@linus.UUCP (Robert D. Silverman) writes:
>In article <8840004@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
>:be some funky instructions to help you out, like "shift and add" for
>:multiplication.  Trust me, you're better off (in speed, that is) for
>:not having those functions, and I'll be that you can write a routine
> 
>I AM NOT BETTER OFF, and don't speak for me. The fact that you say
>'trust me ....', indicates that you don't know diddly when it comes
>to the implementation and design of numerical and semi-numerical
>algorithms.
>Shift and add to perform general multiplication is hopelessly
>slow because of the overhead/bookkeeping/control mechanisms involved.
>:that can do them just about as fast as they could internally.
>It never ceases to amaze me how RISC zealots immediately jump
>up anytime anyone should [GOD forbid!] criticize a RISC instruction
>set. Especially when those zealots know very little about numerical
>algorithms.
>Bob Silverman
>#include <std.disclaimer>
>Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
>Mitre Corporation, Bedford, MA 01730

Actually, this debate is now closing in on 45 years old.  If you will simply
go pull a copy of the original ENIAC papers by Von Neuman and Co. you will
find a long and drawn out discussion of the pros and cons of adding a number
of instructions to the basic architecture: multiplication, division, square root,
etc. The conclusions were based upon a then totally different set of criteria
where gates were made of vacuum tubes and microcode had not been invented yet.
Given the constraints of the time, both multiply and divide were considered
quite justifiable.  However, it is interesting to note that Von Neuman didn't
believe in floating point.  He only believed in signed fixed point.

Futhermore, in his justifications for future projects, he used the multiplication
time of various machines as a first order approximation of the performance
of the machines since the jobs which were then commonly run on the machines
were numerical in nature and tended to scale on various machines more or less
proportionally to the multiply speed.

This is not to say that multiply and divide are REQUIRED, it is just to say
that many of these considerations are not new and should be perhaps viewed
in a more historical context.  After all, we know that technology has changed
massively in the past 45 years. By discussing the implications of the changes,
it is reasonable to come up with some criteria underwhich these changes are justified
and not.

Robert Cousins
Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.

beyer@cbnewsh.ATT.COM (jean-david.beyer) (12/28/89)

In article <1535@cbnewsi.ATT.COM>, reha@cbnewsi.ATT.COM (reha.gur) writes:
> In article <1804@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:
> 
> > To multiply
> > two 32 bit numbers to get a 64 bit product on a 32x32 -> 32 machine,
> > the 32 bit numbers must be divided into 16 bit parts.  The whole operation
> > takes about 20 operations (count them).  Shift and add are far slower.
> 
> The numbers I get (from looking at the data sheets and other info) for
> two machines: a 25Mhz i486 and a 25Mhz SPARC are as below:
> 
> Assuming no cache hits and various other items:
> 
> i486: 	18-31 cycles for signed 32 x 32 bit multipication (reg to reg)
> SPARC:	48-52 cycles for same (including subroutine call and return time)
> 
> i486:	32 cycles for signed 32 bit division (acc by reg)
> SPARC:	41 (approximate best case) to 211 (approximate worst case)
> 	(depends on bits in dividend and divisor)

I do not have the numbers for SPARC handy, but would tend to trust reha.

However, having spent some time working on optimizers that work at the
assembly level for machines, I notice that, for the benchmarks we use,
anyway, the result of running our C compilers, the most integer multiplies
seem to be when dealing with the subscripts of arrays or some kinds of
pointer dereferencing (those pointing to structures). In these cases,
the multiply instructions are mostly multiplications by constants. These
constants frequently have a small number of 1's. Consequently, instead of
calling a general purpose multiply subroutine, it suffices to insert a
special purpose inline code that multiplies by the desired constant.
(it might even pay to do a strength reduction optimization). This can be
much faster than the general purpose multiply subroutine, and may be faster
than a hardware multiply instruction, depending on the design of the
hardware multiply.

-- 
Jean-David Beyer
AT&T Bell Laboratories
Holmdel, New Jersey, 07733
attunix!beyer

mash@mips.COM (John Mashey) (12/29/89)

In article <6908@cbnewsh.ATT.COM> beyer@cbnewsh.ATT.COM (jean-david.beyer) writes:
>pointer dereferencing (those pointing to structures). In these cases,
>the multiply instructions are mostly multiplications by constants. These
>constants frequently have a small number of 1's. Consequently, instead of
>calling a general purpose multiply subroutine, it suffices to insert a
>special purpose inline code that multiplies by the desired constant.
>(it might even pay to do a strength reduction optimization). This can be
>much faster than the general purpose multiply subroutine, and may be faster
>than a hardware multiply instruction, depending on the design of the
>hardware multiply.

Many compilers do this.  For sure, HP PA, MIPS, and SPARC, but I suspect
most of the other current RISCs as well, and I imagine many CISCs.
It's helpful to have the assembler offer a multiply pseudo-op that turns into
the best sequence of shifts/adds/subs (or whatever, such as the combined
ops that HP PA has), up to the point where just issuing a multiply
is better [i.e., on MIPS or 88K, which have H/W multiplies.]

At least in our case, multiply/divide was put in even knowing how much of
the structure-reference multiplies was going to go away, because there
were enough benchmarks where muls & divs wouldn't go away; as usual,
it depends on the benchmark set you use as to which way you jump on this,
and which implementation technologies you've got, etc, and thoughtful
people have picked both ways, recently, meaning it's at least a
yet-to-be-resolved argument :-)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mslater@cup.portal.com (Michael Z Slater) (12/29/89)

With regard to SPARC's lack of integer multiply/divide:

I suspect this is primarily a consequence not of the designer's attitude
toward arithmetic, but was a result of the goal to implement the first
CPU chip in a 20,000-gate ASIC.  They had to be economical with the
transistors.

I expect that future SPARC implementations will add integer multiply and
divide instructions.  Unfortunately, existing software won't use these
instructions.

Michael Slater, Microprocessor Report   mslater@cup.portal.com

cprice@mips.COM (Charlie Price) (12/29/89)

> In article <84983@linus.UUCP> bs@linus.UUCP (Robert D. Silverman) writes:
> >In article <8840004@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
> > 
> >Huh? The MIPS-R3000 does integer multiplies in hardware in just a
> >couple of cycles. The SPARC takes a minimum of 47 cycles using
> >its so-called multiply-step function to multiply two integers.
> >Division is even worse by almost an order of magnitude.

The MIPS R2000, R3000, and R6000 do indeed have multiply and divide,
but your idea of the time they take is not quite right.

proc	mult	div	(times in cycles to put result in special regs)
R2000	12	35
R3000	12	35

I don't believe that the R6000 numbers are yet public,
but I'm willing to say that they aren't better,
in cycle counts, than the R3000.

The operations work the following way:

Multiply:   Rs x Rt -> LO, HI
multiplies two general purpose (32-bit) registers
and produces a 64-bit result that is put into a
couple of special-purpose registers named LO and HI.

Divide:     Rs / Rt -> LO, HI
divides one 32-bit general register by
another 32-bit general register producing a
32-bit quotient (in LO), and a 32-bit remainder (in HI).

The LO and HI register pair are used only for the results
of multiply and divide.  Special instructions exist to move
data from (and to) each of these registers.
The multiply/divide unit operates as an autonomous unit.
After a multiply/divide is issued,
other non-multiply instructions continue to issue and execute.
At some point the program wants the result of the operation,
presumably after everything that can be done in parallel is done,
and you issue a move-from-LO (or HI) instruction to move the result
into a general register.  If the operation is not yet done,
the instruction interlocks and the processor is stalled till the
result is available.

The times above are the time needed to produce the result.
Depending on what you mean by "how fast is X",
you might want to include the instruction(s) necessary
to get the result from LO (and/or HI) and put it (or them
in a general register(s).
This would add 1 cycle for one 32-bit result register
or 2 cycles if you wanted the contents of both LO and HI.

These operation times aren't especially fast.
Being able to execute the op in parallel with "regular" instructions
makes the effect of the operation length a little bit less,
though the amount that helps depends a lot on what you can do
in parallel with any multiply or divide.
Finding 35 real instructions to occupy your time during a divide seems
somewhat unlikely.

One can make these operations faster by spending more hardware on them.
The existing implementations have not chosen (or been able to)
spend a lot of die area on these functions.
For the current MIPS' processors it is generally worth figuring out
whether you can get the multiply/divide job done with some short
sequence of faster instructions rather than use the general instruction.

As John Mashey has mentioned, multiply and divide were included
in the MIPS instruction set because reasonably-fast general
multiply and divide, though unnecessary for many applications
in our architecture benchmark suite, are very important for some of them.
-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086

roy@phri.nyu.edu (Roy Smith) (12/29/89)

	Just to be obnoxious (and thought provoking), postulate a hardware
implementation in which a 32x32->64 bit integer multiply is faster than a
32+32->32+carry bit integer addition.  Estimate the change in coding habits
and compiler technology caused by this new technology.  Use additional
pages if necesary.
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy
"My karma ran over my dogma"

johnl@esegue.segue.boston.ma.us (John R. Levine) (12/29/89)

In article <245@dg.dg.com> uunet!dg!rec (Robert Cousins) writes:
>Actually, this debate is now closing in on 45 years old.  If you will simply
>go pull a copy of the original ENIAC papers by Von Neuman and Co. you will
>find a long and drawn out discussion of the pros and cons of adding a number
>of instructions to the basic architecture: multiplication ...

>Given the constraints of the time, both multiply and divide were considered
>quite justifiable.  However, it is interesting to note that Von Neuman didn't
>believe in floating point.  He only believed in signed fixed point.

Well, yes and no.  When von Neumann went back to Princeton to build his
own computer, the IAS machine, he used some surprisingly modern sounding
arguments ("make a small set of operations as fast as possible") to decide not
to include a hardware multiplier.  I'll dig up the citation, it's quite
interesting.

Remember that the ENIAC was nearly impossible to program, and if they had
to implement software multiplication, they'd have never gotten anything done
at all.  It was quite common to spend three days setting up a program that
would take a few seconds to run.  The IAS machine had a modern stored program
architecture so a multiplication subroutine was no big deal.
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650
johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl
"Now, we are all jelly doughnuts."

cik@l.cc.purdue.edu (Herman Rubin) (12/29/89)

In article <6908@cbnewsh.ATT.COM>, beyer@cbnewsh.ATT.COM (jean-david.beyer) writes:
> In article <1535@cbnewsi.ATT.COM>, reha@cbnewsi.ATT.COM (reha.gur) writes:
> > In article <1804@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:

			........................

> However, having spent some time working on optimizers that work at the
> assembly level for machines, I notice that, for the benchmarks we use,
> anyway, the result of running our C compilers, the most integer multiplies
> seem to be when dealing with the subscripts of arrays or some kinds of
> pointer dereferencing (those pointing to structures). In these cases,
> the multiply instructions are mostly multiplications by constants. These
> constants frequently have a small number of 1's. Consequently, instead of
> calling a general purpose multiply subroutine, it suffices to insert a
> special purpose inline code that multiplies by the desired constant.
> (it might even pay to do a strength reduction optimization). This can be
> much faster than the general purpose multiply subroutine, and may be faster
> than a hardware multiply instruction, depending on the design of the
> hardware multiply.

The idea that special multiplications should be inlined is nothing new.  The
Fortran compiler on the CC6500 expanded multiplication and exponentiation by
small integers.  Whether this should be done by the hardware for variable
integers with few 1's is not clear.

But multiplication is used for other purposes, and here the loss is huge.
Also, it operations are pipelined, as on many machines, the benefits from
inlining are reduced.

The costs of using a multiply subroutine are quite high compared to using
even a complicated instruction.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

bs@linus.UUCP (Robert D. Silverman) (12/29/89)

In article <15418@vlsisj.VLSI.COM> davidc@vlsisj.UUCP (David Chapman) writes:
:In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes:
:>Does any have, of know of software for the SPARC [SUN-4] that will
:>perform the following:
:>
:> [standard multiply and divide]
:>
:>The SPARC is brain dead [as were its designers] when it comes to doing
:>integer arithmetic. It can't multiply and it can't divide.
:
:There should be instructions on the order of "multiply step" and "divide 
:step", each of which will do one of the 32 adds/subtracts and then shift.  
 
There is a multiply step instruction. There is no such support for division.
It can take 200+ cycles to do a division on the SPARC [worst case]. 
A 32 x 32 bit unsigned multiply takes 45-47 cycles. Programs that have a
significant number of multiplies and divides can run SLOWER on a SPARC
than on a SUN-3. [I have such!]  ONLY because of the slow multiply/divides.

:I'm not particularly fond of the SPARC architecture (don't like register 
:windows), but this is a theoretical viewpoint and is not based on any 
:direct exposure to assembly-language programming for it (translation:
:sorry, I can't give you any more help).
:
:Neither SPARC nor its designers were brain-dead when it was built.  It's just
 
I didn't say they were. I said they were with respect to arithmetic. I stand
by that assertion. Most programs may not need multiply/divide in hardware.
However, for those that do require it, not having it is a real KILLER 
of algorithms.

:that it is difficult to get multiplication and division (especially the 
:latter) to run in 1 or 2 clock cycles.  All instructions are supposed to
 
I know of quite a few DSP chips that do multiplies in 1 cycles. Divides
take a little longer [but not much; Ernie Brickell of SANDIA invented a
hardware divide that works much faster than standard conditional-shift/
subtract].

:execute in the ALU in 1 cycle; if the multiply and divide instructions take
:more time then the front of the processor pipeline has to be able to stall
:and this added complexity will slow down the entire processor.
:
:Thus they provide you with the tools to do your own multiply and divide.  
 
See above. They are too slow.

:One of the benefits is that a compiler can optimize small multiplies and 
:divides to make them execute quicker (i.e. multiply by 10 takes 4 steps 
 
That's fine for multiply-by-constant. Most programs that NEED multiply/divide
are multiplying variables.

:P.S.  Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be
:      incredibly slow.  Unroll the loop 4 or 8 times (MULSTEP, MULSTEP,
:      MULSTEP, MULSTEP, SUB 4, BNZ).  Branches are expensive.
 
Agreed. In fact my 32 x 32 bit multiply consists of 32 calls to multstep
and no looping at all. It is still slow.

-- 
Bob Silverman
#include <std.disclaimer>
Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
Mitre Corporation, Bedford, MA 01730

bs@linus.UUCP (Robert D. Silverman) (12/29/89)

In article <34000@mips.mips.COM> cprice@mips.COM (Charlie Price) writes:
:> In article <84983@linus.UUCP> bs@linus.UUCP (Robert D. Silverman) writes:
:> >In article <8840004@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
:> > 
:> >Huh? The MIPS-R3000 does integer multiplies in hardware in just a
:> >couple of cycles. The SPARC takes a minimum of 47 cycles using
:> >its so-called multiply-step function to multiply two integers.
:> >Division is even worse by almost an order of magnitude.
:
:The MIPS R2000, R3000, and R6000 do indeed have multiply and divide,
:but your idea of the time they take is not quite right.
:
:proc	mult	div	(times in cycles to put result in special regs)
:R2000	12	35
:R3000	12	35
:
 
The MIPS takes 4 times fewer cycles to do a multiply than the SPARC.
It also yields a 64 bit product.

Division takes 6 times fewer cycles AND ALSO gives the remainder
[this is a big plus]. A 64 bit by 32 bit divide on the SPARC can take
500-600 cycles in the worst case. The SUN-3 does a 32x32 multiply in 41
cycles and a 64 by 32 bit divide [with remainder] in 76. 

The MIPS only does a 32 by 32 bit divide. 64 bit by 32 bit takes extra
coding.


:I don't believe that the R6000 numbers are yet public,
:but I'm willing to say that they aren't better,
:in cycle counts, than the R3000.
:
:The operations work the following way:
:
Yes, I have a MIPS architecture book. I have benchmarked some multiple
precision integer arithmetic code on an R-2000 and on a SUN-4/280.
The R2000 system [a DECstation] was 2x faster.
Note that this wasn't even the R3000. The time difference is mostly
due to integer multiply/divides.

 
-- 
Bob Silverman
#include <std.disclaimer>
Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
Mitre Corporation, Bedford, MA 01730

beyer@cbnewsh.ATT.COM (jean-david.beyer) (12/29/89)

As I recall, the Naval Ordonance Research Computer (NORC) had a completely
combinatorial multiplier: it could do a 10 digit by 10 digit multiply in
the same number of cycles as an add. My recollection is that the multiplier
box took a refrigerator size box. The rest of the cpu  took another refrigerator
size box. And that  was in vacuum tube days.

So multiply instructions need take no longer than the total number of
gate delays to make such a thing. The main design question remains:
for the kind of work to be done with the processor, is that the best place
to put the complexity and delays and chip area. One must know the work to
be done; i.e, the benchmarks to be run (and whether they  are really
representative of the work to be run).
-- 
Jean-David Beyer
AT&T Bell Laboratories
Holmdel, New Jersey, 07733
attunix!beyer

tim@nucleus.amd.com (Tim Olson) (12/29/89)

In article <1535@cbnewsi.ATT.COM> reha@cbnewsi.ATT.COM (reha.gur) writes:
| Also note that some SPARC machines do have (or might have) integer mul and
| divide in hardware.

If they do, then they are not Instruction Set compatible.  My copy of
the SPARC architecture manual lists only a MULScc (Multiply Step and
modify icc), and no divide or divide step instructions.  Even if the
hardware is there, it is hard to use without an instruction to specify
the exact semantics of the operation!



	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

slackey@bbn.com (Stan Lackey) (12/30/89)

In article <1804@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
 bunch of stuff deleted ...
>I know that the gurus claim that this should not be so, but it is not
>unusual for me to think of modifications or even totally new ways of
>doing things which the compiler cannot unless those specific ways are
>put into the compiler.  If a machine does not have hardware square roots,
>one avoids square roots, as there are usually faster ways.

>One thing which might help is if there were a mailing list to discuss these
>ideas, and to collect the numerous operations efficient in hardware and
>expensive in software.  Those who know me will agree that I am not the
>person to run this.

>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907

I agree.  I think that due to (at least) the worsening ratio of CPU
speeds to memory speeds, there will be changes in architecture trends
in the next generation or two.

I would like to be the maintainer of the proposed mailing list.
-Stan

andrew@dtg.nsc.com (Lord Snooty @ The Giant Poisoned Electric Head ) (12/30/89)

I am reading this thread with interest, and would like to see some realistic
numbers on SPARC performance. How does it perform on, say, a large CAD
package compared with the Sun-3? - does anyone have any benchmark results?
-- 
...........................................................................
Andrew Palfreyman	a wet bird never flies at night		time sucks
andrew@dtg.nsc.com	there are always two sides to a broken window

mark@mips.COM (Mark G. Johnson) (12/30/89)

In article <435@berlioz.nsc.com> andrew@dtg.nsc.com (Andrew Palfreyman) writes:
  >
  >I am reading this thread with interest, and would like to see some realistic
  >numbers on SPARC performance. How does it perform on, say, a large CAD
  >package compared with the Sun-3? - does anyone have any benchmark results?
  >

One large CAD package is "SPICE2G6" from U.C. Berkeley.  It's a circuit
simulator.  Here are measurements of a Sun3 and a Sun4, both running
SPICE2G6, on three different input files.  The Sun3 model (3/260) is among
the fastest Sun3's ever made, while the Sun4 (4/260) is a medium-performer,
employing the 16 MHz Fujitsu gate array SPARC.  I apologize for having no
measured data to present on the non gate array (custom) CMOS SPARCs, nor
BIT's SPARC built out of ECL, nor Prisma's SPARC in Gallium Arsenide.

 ********************* SPICE 2G.6   USER CPU TIMES ***********************
  Dataset (input circuit)     Sun 3/260       Sun 4/260    Sun4 faster by?
===========================================================================
       digsr                  361.2 sec        218.3 sec      1.65 X
       bipole                 112.0 sec         61.1 sec      1.83 X
       comparator             129.4 sec         66.0 sec      1.96 X


Other large CAD packages have been measured for the Sun4 but not, to my
knowledge, the Sun3.  For example, the SPEC benchmark suite includes
"Espresso" (a logic equation minimizer and PLA generator).  SPEC
measurements of the SPARCstation_1 have been published, but not the Sun3.
On Espresso the SPARCstation_1 is 8.9X faster than a VAX 11-780, making
it *roughly* 3X faster than a Sun3/260 for Espresso.  (However this is a
guess in the absence of data).
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
	(408) 991-0208    mark@mips.com  {or ...!decwrl!mips!mark}

khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) (12/30/89)

In article <34019@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:


   One large CAD package is "SPICE2G6" from U.C. Berkeley.  It's a circuit
   simulator.  Here are measurements of a Sun3 and a Sun4, both running
   SPICE2G6, ...

Using the SPEC versions of               SPICE2G6    and     EXPRESSO
                                               
Sun3/470 (top of the line sun3 cpu)       4761s       	        461s
Sun4/490 (top of the line sun4 cpu)       1486s                 151s

Which is a factor of 3x. 

Those with a serious interest in such things should examine the
results of commerical codes (which are often much faster, and have
been tuned for various platforms). The folks at HSPICE, for example,
would probably be happy to chat with you about your CAD needs.


--
Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"

"There is NO defense against the attack of the KILLER MICROS!"
			Eugene Brooks

bs@linus.UUCP (Robert D. Silverman) (12/30/89)

In article <34019@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
:In article <435@berlioz.nsc.com> andrew@dtg.nsc.com (Andrew Palfreyman) writes:
:  >I am reading this thread with interest, and would like to see some realistic
:  >numbers on SPARC performance. How does it perform on, say, a large CAD
:  >package compared with the Sun-3? - does anyone have any benchmark results?
:  >
:One large CAD package is "SPICE2G6" from U.C. Berkeley.  It's a circuit
:simulator.  Here are measurements of a Sun3 and a Sun4, both running
 
I may be wrong but.....

So far as I know [and I've seen the code] SPICE is not a heavy user
of integer multiply/divides. It does use a fair amount of floating 
point. Therefore, trying to get a measure of integer multiply/divide
performance using SPICE is grasping at straws at best.

-- 
Bob Silverman
#include <std.disclaimer>
Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
Mitre Corporation, Bedford, MA 01730

roy@phri.nyu.edu (Roy Smith) (12/31/89)

	I've only been loosely following this thread for a while, so excuse
me if this has been gone through already.  I assume that SPARC based machines
have (at least as an option) some sort of floating point coprocessor, which
obviously would be used for floating point multiplies.  Most (handwave)
non-computational integer multiplies (by which I mean operations invented by
the compiler for array subscripting, pointer arithmetic, etc) involve one
constant factor with few ones, ideal for shift-and-add strength reductions.
Non-power-of-2 integer division is even rarer.

	The bottom line is that I'm hard pressed to think of an application
where full 32 x 32 integer multiplies constitute a significant fraction of
the operations performed.  Most large utility programs (operating systems,
compilers, editors, windowing systems (except maybe NeWS?) don't do many
32x32 multiplies and most scientific number crunchers are all floating point.
In fact, we have one large number cruncher that does macromolecule energy
minimization using fixed point (integral numbers of tenths of kcals/mole) but
that's all addition, subtraction, and table lookups.  What's left?  Digital
signal processing maybe?

	Slightly off the subject, is it possible, using the current SPARC
architecture, to take advantage of the floating point data paths to speed up
integer multiplies for those few applications that really need it?
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy
"My karma ran over my dogma"

irf@kuling.UUCP (Bo Thide') (12/31/89)

In order to put the discussion on SPARC arithmetics slowness into
some perspective, we ran Plum Hall benchmark on one of our CISC
and two of our RISC machines.  The machines we used were:

HP9000/370 MC68030/33 MHz with and without floating point accelerator (fpa)
HP9000/835 HP-PA RISC/15 MHz
Sun SparcStation1 (SS1) SPARC RISC/20 MHz

================================================================================
                      register      auto      auto       int  function      auto
                           int     short      long  multiply  call+ret    double
 HP9000/370 (fpa -O)     0.22      0.26      0.22      1.35      3.96      0.62
       HP9000/370 (-O)   0.21      0.26      0.22      1.35      3.08      1.21
 HP9000/370 (fpa no -O)  0.26      0.40      0.36      1.44      4.42      1.56
    HP9000/370 (no -O)   0.26      0.40      0.37      1.45      3.38      2.72
       HP9000/835 (-O)   0.27      0.29      0.27      5.49      0.31      0.27 
    HP9000/835 (no -O)   0.29      0.53      0.45      5.62      0.31      0.59
          Sun SS1 (-O)   0.29      0.33      0.30     19.5       0.49      0.59
       Sun SS1 (no -O)   0.38      0.40      0.35     19.7       0.51      0.72
================================================================================
There is no question that Sun SparcStation 1 is *extremely* slow on integer
multiply even for a RISC architecture -- scaling the HP-PA results to the
same clock speed as the SPARC (20 MHz) we see that HP-PA is about 470% faster
than SPARC!!  Our results also show that integer arithmetics on CISC (MC68030)
is much faster than on RISC (HP-PA, SPARC).

For those of you not familiar with the Plum-Hall benchmark here is some info:
----------------------------------------------------------------------------

[The following article appeared in  "C Users Journal" May 1988.
 It describes the purpose and use of the enclosed benchmarks. ]


SIMPLE BENCHMARKS FOR C COMPILERS

by Thomas Plum

Dr.Plum is the author of several books on  C,  including  Efficient  C  (co-
authored  with  Jim  Brodie).  He is Vice-Chair of the ANSI X3J11 Committee,
and Chairman of Plum Hall Inc, which offers introductory and  advanced  sem-
inars on C.

Copyright (c) 1988, Plum Hall Inc


We are placing into the public domain some simple  benchmarks  with  several
appealing properties:

    They are short enough to type while browsing at trade shows.

    They are protected against overly-aggressive compiler optimizations.

    They reflect empirically-observed operator frequencies in C programs.

    They give a C programmer information directly relevant to programming.

In Efficient C, Jim Brodie and I described how useful it can be for  a  pro-
grammer  to have a general idea of how many microseconds it takes to execute
the "average operator" on   register  int's,  on   auto  short's,  on   auto
long's,  and  on  double  data, as well as the time for an integer multiply,
and the time to call-and-return from a function.  These six numbers allow  a
programmer  to  make  very good first-order estimates of the CPU time that a
particular algorithm will take.

seanf@sco.COM (Sean Fagan) (12/31/89)

In article <85138@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>:There should be instructions on the order of "multiply step" and "divide 
>:step", each of which will do one of the 32 adds/subtracts and then shift.  
> 
>There is a multiply step instruction. There is no such support for division.
>It can take 200+ cycles to do a division on the SPARC [worst case]. 
>A 32 x 32 bit unsigned multiply takes 45-47 cycles. 

Oh, posh.  A CDC Cyber 170/760 can do a 60x60 -> 60 multiply (fp) in 5
cycles, worst case.  Divide is atrocious, being the only instruction slower
than a load (a load is 26 cycles, but I forget exactly how long a divide
takes).  However, proper instruction ordering means you don't have to worry
too much about how long it takes.  (Multiply is pipelined [does two 30-bit
multiplies at the same time], but divide isn't.  Sequential algorithm, if I
remember correctly.  Apparantly Seymour *hates* divides 8-).)

Just had to plug Seymour 8-).

-- 
Sean Eric Fagan  | "Time has little to do with infinity and jelly donuts."
seanf@sco.COM    |    -- Thomas Magnum (Tom Selleck), _Magnum, P.I._
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

raob@mullian.ee.mu.oz.au (Richard Oxbrow) (12/31/89)

In article <34019@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
:In article <435@berlioz.nsc.com> andrew@dtg.nsc.com (Andrew Palfreyman) writes:
:  >I am reading this thread with interest, and would like to see some realistic
:  >numbers on SPARC performance. How does it perform on, say, a large CAD
:  >package compared with the Sun-3? - does anyone have any benchmark results?
:  >
:One large CAD package is "SPICE2G6" from U.C. Berkeley.  It's a circuit
:simulator.  Here are measurements of a Sun3 and a Sun4, both running
 
	Spice2G/3C1 is mainly floating point program, while espresso is a
logic minimization program. It should also be kept in mind that
for one reason or another the attached Floating Point Units
on the sun3 series are always slower than their sun4 cousins. If I remember
correctly the latest FPU+ on a sun3/470 is only rated at 1.2MFlops while
the sun 4/390 (FPU2+?)  is rated at something like 2.2 MFlops peak ! ( I don't 
think Sun wants to see its sun3 series giving the sun4s a hard time) 
	
richard ..
Richard Oxbrow			   |ACSnet	raob@mullian.oz
dept. of ee eng ,uni of melbourne  |Internet	raob@mullian.ee.mu.OZ.AU
parkville 3052 australia 	   |Arpa-relay  raob%mullian.oz@uunet.uu.net
fax   +[061][03]344 6678	   |Uunet	uunet!munnari!mullian!raob

dtynan@altos86.Altos.COM (Dermot Tynan) (01/01/90)

In article <KHB.89Dec29150110@chiba.kbierman@sun.com>, khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) writes:
| 
| Using the SPEC versions of               SPICE2G6    and     EXPRESSO
|                                                
| Sun3/470 (top of the line sun3 cpu)       4761s       	        461s
| Sun4/490 (top of the line sun4 cpu)       1486s                 151s
| 
| Which is a factor of 3x. 
|
| Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com

Wait a minute though.  Doesn't SPICE use mostly floating-point for its
simulations (I don't have a copy, so I don't know)??  If so, it's rather
foolish to benchmark Integer M/D, with a floating-point package...
						- Der
-- 
	dtynan@altos86.Altos.COM		(408) 946-6700 x4237
	Dermot Tynan,  Altos Computer Systems,  San Jose, CA   95134

    "Far and few, far and few, are the lands where the Jumblies live..."

ok@quintus.UUCP (Richard A. O'Keefe) (01/01/90)

In article <1989Dec30.161619.22893@phri.nyu.edu> roy@phri.nyu.edu (Roy Smith) writes:
>                                                        Most (handwave)
>non-computational integer multiplies (by which I mean operations invented by
>the compiler for array subscripting, pointer arithmetic, etc) involve one
>constant factor with few ones, ideal for shift-and-add strength reductions.

Roy Smith isn't the only one to say this recently in comp.arch.
What I'm afraid of here is the coupling I see between RISC design and
crippled programming languages.  Smith's claim about array subscripting
is true in some pathetically crippled languages where array bounds are
completely known at compile time (C and Pascal, and some derivatives of
Pascal).  It wasn't true in Algol 60 or Simula 67, and it isn't true in
Fortran 77.  (The reason that the multiplications in Fortran 77 tend to
involve "hard" multiplication is that good compilers already use
strength reduction to remove multiplication entirely when they can.)
Fortran 8X and Ada also allow arrays with dynamic bounds.
Actually, there _is_ a way that addressing polynomials could be handled
efficiently on machines which have slow multiplication, and that is for
a dynamic array declaration like
	real array a[1:P,1:Q,1:R];
to generate a local "addressing subroutine" at _run_ time
	a_poly := lambda(i,j,k) ((check(1,P,i)*Q+check(1,Q,j))*R+
				  check(1,R,k))*sizeof real + &a_data;
using whatever multiply-step sequence was best.  But that, of course,
requires a cheap way of generating code on the run.  And there are some
RISC systems that make _that_ hard too.  The remaining alternative is to
precompute the offsets in some way, so that a[i,j,k] turns into
	a_data[QR_times[i] + R_times[j] + k]
which is all very well, but think about Fortran 8X cross-sections...

(This article is not a claim that fast integer multiplication and division
 is necessary to support Fortran and Ada well.  I haven't the figures to
 prove this.
)

mcdonald@aries.scs.uiuc.edu (Doug McDonald) (01/01/90)

In article <1296@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>In article <1989Dec30.161619.22893@phri.nyu.edu> roy@phri.nyu.edu (Roy Smith) writes:
>>                                                        Most (handwave)
>>non-computational integer multiplies (by which I mean operations invented by
>>the compiler for array subscripting, pointer arithmetic, etc) involve one
>>constant factor with few ones, ideal for shift-and-add strength reductions.
>
>Roy Smith isn't the only one to say this recently in comp.arch.
>What I'm afraid of here is the coupling I see between RISC design and
>crippled programming languages.  Smith's claim about array subscripting
>is true in some pathetically crippled languages where array bounds are
>completely known at compile time (C and Pascal, and some derivatives of
>Pascal).  It wasn't true in Algol 60 or Simula 67, and it isn't true in
>Fortran 77.

>Actually, there _is_ a way that addressing polynomials could be handled
>efficiently on machines which have slow multiplication, and that is for
>a dynamic array declaration like
>	real array a[1:P,1:Q,1:R];
>to generate a local "addressing subroutine" at _run_ time
>using whatever multiply-step sequence was best.  But that, of course,
>requires a cheap way of generating code on the run.  And there are some
>RISC systems that make _that_ hard too.
>

These are two interesting and important points. Especially the latter one:
some machine architectures may NEED the ability to use self-modifying
code to get good speed. 

********This should be the subject of an entire new flame-thread. I
have found that architectures (or OS's) that don't allow EASY 
self-modifying code (or for most of my stuff, incremental compilation)
are sufficiently brain-dead to be unuseable for many purposes.*****

Doug McDonald (mcdonald@aries.scs.uiuc.edu)

tot@frend.fi (Teemu Torma) (01/02/90)

In article <1313@kuling.UUCP> irf@kuling.UUCP (Bo Thide') writes:

   ================================================================================
			 register      auto      auto       int  function      auto
			      int     short      long  multiply  call+ret    double
    HP9000/370 (fpa -O)     0.22      0.26      0.22      1.35      3.96      0.62
	  HP9000/370 (-O)   0.21      0.26      0.22      1.35      3.08      1.21
    HP9000/370 (fpa no -O)  0.26      0.40      0.36      1.44      4.42      1.56
       HP9000/370 (no -O)   0.26      0.40      0.37      1.45      3.38      2.72
	  HP9000/835 (-O)   0.27      0.29      0.27      5.49      0.31      0.27 
       HP9000/835 (no -O)   0.29      0.53      0.45      5.62      0.31      0.59
	     Sun SS1 (-O)   0.29      0.33      0.30     19.5       0.49      0.59
	  Sun SS1 (no -O)   0.38      0.40      0.35     19.7       0.51      0.72
   ================================================================================
   There is no question that Sun SparcStation 1 is *extremely* slow on integer
   multiply even for a RISC architecture -- scaling the HP-PA results to the
   same clock speed as the SPARC (20 MHz) we see that HP-PA is about 470% faster
   than SPARC!!  Our results also show that integer arithmetics on CISC (MC68030)
   is much faster than on RISC (HP-PA, SPARC).

Strange. I got much better int multiply results from Sun SS1. Gcc
version was 1.36.

			 register      auto      auto       int  function      auto
			      int     short      long  multiply  call+ret    double
Sun 4/60 (cc, no -O)	  0.38      0.40      0.36      3.52      0.28      0.72    
Sun 4/60 (cc, -O)	  0.32      0.35      0.32      3.62      0.28      0.62    
Sun 4/60 (cc, -O2)	  0.30      0.33      0.30      3.32      0.26      0.61    
Sun 4/60 (cc, -O3)	  0.30      0.33      0.30      3.45      0.28      0.63    
Sun 4/60 (gcc, no -O)	  0.21      0.38      0.38      3.50      0.27      0.77    
Sun 4/60 (gcc, -O)	  0.18      0.21      0.18      3.57      0.27      0.42    
Sun 4/60 (gcc, <2>)	  0.17      0.22      0.19      3.67      0.28      0.42    
Sun 4/60 (gcc, <3>)	  0.17      0.21      0.18      3.52      0.27      0.40    

gcc <2>: -fstrength-reduce -fcombine-regs -fomit-frame-pointer
gcc <3>: as <2> including -mno-epologue
--
Teemu Torma
Front End Oy, Helsinki, Finland
Internet: tot@nyse.frend.fi

vestal@SRC.Honeywell.COM (Steve Vestal) (01/03/90)

In article <1296@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>   Fortran 8X and Ada also allow arrays with dynamic bounds.
>   [...]
>	   real array a[1:P,1:Q,1:R];

I believe that induction variable optimizations, which replace multiplications
by additions for certain array address computations within loops, work just
fine even with dynamic bounds.  My limited experience evaluating such
optimizations has been that they are surprisingly widely applicable for
matrix/vector manipulation code.  Personally, I suspect explicit
multiplies/divides/mods/rems in the application code might be the hazard to
watch out for rather than integer multiplications used in addressing
computations.

Steve Vestal
Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418 
Phone: (612) 782-7049                    Internet: vestal@src.honeywell.com

khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) (01/03/90)

In article <135@altos86.Altos.COM> dtynan@altos86.Altos.COM (Dermot Tynan) writes:

.....

   Wait a minute though.  Doesn't SPICE use mostly floating-point for its
   simulations (I don't have a copy, so I don't know)??  If so, it's rather
   foolish to benchmark Integer M/D, with a floating-point package...

I didn't chose the code, another poster did. As it happens, the
circuit selected by the SPEC group causes SPICE to spend its time in a
very different fashion than one usually expects (most of the
commerical SPICEs have lavished great attention on fast solution of
systems of equations (viz. linpack problem space stuff)) and that
isn't true for this data set. Had it been true, the speed differential
would be much more favorable for SPARC.
--
Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"

"There is NO defense against the attack of the KILLER MICROS!"
			Eugene Brooks

alan@oz.nm.paradyne.com (Alan Lovejoy) (01/05/90)

In article <1535@cbnewsi.ATT.COM< reha@cbnewsi.ATT.COM (reha.gur) writes:
<In article <1804@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:
<
<> It is clear that you are not to be trusted (see above).  To multiply
<> two 32 bit numbers to get a 64 bit product on a 32x32 -> 32 machine,
<> the 32 bit numbers must be divided into 16 bit parts.  The whole operation
<> takes about 20 operations (count them).  Shift and add are far slower.
<> Divide is even worse.   Also, there is considerable overhead in a
<> subroutine call; there are registers to save and restore.  Open
<> subroutines (in-line functions) are a way around it, but they still
<> have the problem.
<> 
<> Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
<> Phone: (317)494-6054
<> hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
<
<The numbers I get (from looking at the data sheets and other info) for
<two machines: a 25Mhz i486 and a 25Mhz SPARC are as below:
<
<Assuming no cache hits and various other items:

-- did you mean "no cache misses"?

<i486: 	18-31 cycles for signed 32 x 32 bit multipication (reg to reg)
<SPARC:	48-52 cycles for same (including subroutine call and return time)
<
<i486:	32 cycles for signed 32 bit division (acc by reg)
<SPARC:	41 (approximate best case) to 211 (approximate worst case)
<	(depends on bits in dividend and divisor)
<

Sounds like an interesting contest to me!  Here's my try (for multiply, 
anyway) using the mc88k instruction set:

(NOTE: r0 is the constant 0 (a hardware protocol); jsr sets r1 with the 
return addrss)

;dmulu -- multiply unsigned 32x32=<64
;r2 and r3 contain two 32-bit unsigned integers to be multiplied
;r12 will contain the high word (32 bits) of the product (r2 * r3)
;r13 will contain the low word (32 bits) of the product (r2 * r3)

	.proc	dmulu

dmulu:
;divide the two 32-bit numbers into 4 16-bit parts:
extu 	r4, r2, 16<16<; 	r4 = r2 >> 16; (1 cycle)
extu    r5, r3, 16<16<;		r5 = r3 >> 16; (1 cycle)
mask 	r2, r2, $FFFF;		r2 = r2 & 0xFFFF (1 cycle)
mask 	r3, r3, $FFFF;		r3 = r3 & 0xFFFF (1 cycle)

;calculate partial products:
mul	r6, r3, r3;		r6 = r2 * r3; (4 cycles!)
mul	r7, r3, r4;             r7 = r3 * r4; (4 cycles!)
mul     r8, r2, r5;             r8 = r2 * r5; (4 cycles!)
mul	r9, r4, r5;             r9 = r4 * r5; (4 cycles!)

;sum partial products:
extu	r10, r6, 16<16<;	r10 = r6 >> 16; (1 cycle)
addu	r7, r7, r10;		r7 = r7 + r10; (1 cycle)
addu.co	r10, r7, r8;		r10 = r7 + r8; generate carry bit (1 cycle)
addu.ci	r12, r0, r0;		r12 = carry from previous addu (1 cycle)
mak	r12, r12, 16<16<;	r12 = r12 << 16; (1 cycle)
addu  	r12, r12, r9;		r12 = r12 + r9; (1 cycle)
mak	r10, r10, 16<16<;	r10 = r10 << 16; (1 cycle) 
mask	r13, r6, $FFFF;		r13 = r6 & 0xFFFF; (1 cycle)
jmp.n	r1;                     return to caller after next instruction (1cycle)
or	r13, r13, r10;		r13 = r13 | r10; (1 cycle)
; 	done: 			30 cycles total (without short circuit code)
 	.end

; invoking dmulu:
;       it is the caller's responsibility to save registers r2-13, which
;       the caller may or may not need to do...
or	r25, r0, #dmuluLo16;    r25 = low 16 bits of dmulu address (1 cycle) 
or.u	r25, r25, #dmuluHi16;   r25 = r25 | high 16 bits of dmulu address (1 c)
ld	r2, r30, #factor1;	r2 = *(framePtr + offset of factor1) (1 cycle) 
jsr.n	r25;			call dmulu after next instruction (1 cycle) 
ld	r3, r30, #factor2;	r3 = *(framePtr + offset of factor2) (1 cycle) 
st.d	r12, r30, #product;	*(frametPtr + offset of product) = r12, r13

; register to register: 1 cycle call excluding execution time for dmulu 
; memory to memory: 6 cycle call excluding execution time for dmulu
; GRAND TOTAL register to register: 31 cycles
; GRAND TOTAL memory to memory: 36 cycles 

____"Congress shall have the power to prohibit speech offensive to Congress"____
Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne.  They do not speak for me. 
Mottos:  << Many are cold, but few are frozen. >>     << Frigido, ergo sum. >>

alan@oz.nm.paradyne.com (Alan Lovejoy) (01/05/90)

Excuuuuuuuuuuse me! Three errors, one very slight, two not so slight!

In article <6903@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes:
>Sounds like an interesting contest to me!  Here's my try (for multiply, 
>anyway) using the mc88k instruction set:
>
>(NOTE: r0 is the constant 0 (a hardware protocol); jsr sets r1 with the 
>return addrss)

;dmulu -- multiply unsigned 32x32=<64
;r2 and r3 contain two 32-bit unsigned integers to be multiplied
;r12 will contain the high word (32 bits) of the product (r2 * r3)
;r13 will contain the low word (32 bits) of the product (r2 * r3)

	.proc	dmulu

dmulu:
;divide the two 32-bit numbers into 4 16-bit parts:
extu 	r4, r2, 16<16>; 	r4 = r2 >> 16; (1 cycle)

*** errror #1:
The original posting consistently had "16<16<" instead of "16<16>".  The
wonders of global replace... (The very slight error).

extu	r5, r3, 16<16>;		r5 = r3 >> 16; (1 cycle)
mask 	r2, r2, $FFFF;		r2 = r2 & 0xFFFF (1 cycle)
mask 	r3, r3, $FFFF;		r3 = r3 & 0xFFFF (1 cycle)

;calculate partial products:
>mul	r6, r3, r3;		r6 = r2 * r3; (4 cycles!)
 
*** error #2:
One of the r3's in the above instruction must be an r2 (as the comment shows).

>mul	r7, r3, r4;             r7 = r3 * r4; (4 cycles!)
>mul    r8, r2, r5;             r8 = r2 * r5; (4 cycles!)
>mul	r9, r4, r5;             r9 = r4 * r5; (4 cycles!)

*** error #3:
I completely forgot that the mul instruction is fully pipelined!
It takes 4 cycles to complete, yes, but a new instruction can be
issued on the very next cycle. Upto six mul and/or fmul instructions
can be in the pipeline at one time.  So, (after some reordering to
avoid stalls) we have instead:

;calculate partial products:
mul	r6, r2, r3;		r6 = r2 * r3; (1 cycle)
mul	r8, r2, r5;             r8 = r2 * r5; (1 cycle)
mul	r7, r3, r4;             r7 = r3 * r4; (1 cycle)
mul	r9, r4, r5;             r9 = r4 * r5; (1 cycle)

;sum partial products:
extu	r10, r6, 16<16>;	r10 = r6 >> 16; (1 cycle)
addu	r8, r8, r10;		r8 = r8 + r10; (1 cycle)
*** note that r7 changed to r8 to avoid a stall
addu.co	r10, r7, r8;		r10 = r7 + r8; generate carry bit (1 cycle)
addu.ci	r12, r0, r0;		r12 = carry from previous addu (1 cycle)
mak	r12, r12, 16<16>;	r12 = r12 << 16; (1 cycle)
addu  	r12, r12, r9;		r12 = r12 + r9; (1 cycle)
mak	r10, r10, 16<16>;	r10 = r10 << 16; (1 cycle) 
mask	r13, r6, $FFFF;		r13 = r6 & 0xFFFF; (1 cycle)
jmp.n	r1;                     return to caller after next instruction (1cycle)
or	r13, r13, r10;		r13 = r13 | r10; (1 cycle)
; 	done: 			18 cycles total (without short circuit code)
 	.end

; invoking dmulu:
;       it is the caller's responsibility to save registers r2-13, which
;       the caller may or may not need to do...
or	r25, r0, #dmuluLo16;    r25 = low 16 bits of dmulu address (1 cycle) 
or.u	r25, r25, #dmuluHi16;   r25 = r25 | high 16 bits of dmulu address (1 c)
ld	r2, r30, #factor1;	r2 = *(framePtr + offset of factor1) (1 cycle) 
jsr.n	r25;			call dmulu after next instruction (1 cycle) 
ld	r3, r30, #factor2;	r3 = *(framePtr + offset of factor2) (1 cycle) 
st.d	r12, r30, #product;	*(frametPtr + offset of product) = r12, r13

; register to register: 1 cycle call excluding execution time for dmulu 
; memory to memory: 6 cycle call excluding execution time for dmulu
; GRAND TOTAL register to register: 19 cycles (was 30)
; GRAND TOTAL memory to memory: 24 cycles (was 36) 

*** A significant performance improvement of 33 per cent.



____"Congress shall have the power to prohibit speech offensive to Congress"____
Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL.
Disclaimer: I do not speak for AT&T Paradyne.  They do not speak for me. 
Mottos:  << Many are cold, but few are frozen. >>     << Frigido, ergo sum. >>

dgr@hpfcso.HP.COM (Dave Roberts) (01/05/90)

Sorry Guys and Gals,
	I didn't intend to start a pounce on RISC thread.  When I answered
Bob's question it was intended to show him that there was a way to do
multiply and divide.
	Now for some comments:
	(1) SPARCs will get multiply and divide.  This is from a guy at
	    Sun.  Coming soon to a SPARC station near you...
	(2) By suggesting that Bob was "much better off" (unclear on my
	    part) I didn't mean to suggest that he was going to get steller
	    integer performance all the time.  Rather, in general, his
	    whole program should run faster.  I guess it didn't, but then
	    again I didn't look at the code.
	(3) As some have pointed out, the reason for removing those
	    instructions from a RISC architecture is because *most* programs
	    don't do a whole lot of multiplications between arbitrary 32 bit
	    integers.  Usually it is an arbitrary integer and a known (though
	    not necessarily small) integer constant.  With the known constant
	    you can reduce the mult to a known sequence of shift and adds,
	    which a good compiler will do (in fact, many CISC machines would
	    run faster if the compilers would do this for them also instead
	    of just inserting a XX cycle multiply instruction).
	(4) If you need the speed, you write the code inline.  Loops kill
	    you in whatever architecture you use.  If you do huge numbers
	    of arbitrary 32x32 mults, you're code will explode, but hey,
	    this is a RISC machine and your code size is already through
	    the roof, right?  If you call a subroutine everytime you want
	    to do a multiply the overhead of the call will kill you.  But
	    notice that this wasn't what I suggested, either.
	(5) The original point was that most programs don't need the kind
	    of integer numerical performance that, I guess, Bob's does,
	    and in general the shift and adds (for computing things like
	    array indices and so forth) are just fine.
	    It's a (semi)pathological case in the whole universe of computer
	    programs.  As a user who doesn't generate programs like that,
	    I'd rather all the other instructions be speeded up a bit by
	    allowing higher clock speeds, etc.  And most users don't generate
	    or use programs like that.
	(6) If you really need the blazing integer speed, buy a coprocessor.
	    That is also one of the fundemental RISC ideas.  There are times
	    when things just aren't done well by software and do need hardware
	    help.  This option also allows you to get *really, really* fast
	    integer speed by using a multiplier array (works by generating
	    all the product terms all at once and then adding the whole
	    sh'bang together.  It's fast as hell but it uses ton's of chip
	    area.  Perfect for a coprocessor).  Someone else pointed this
	    out a few postings back (in the DSP entries, I think).  Sure it
	    costs more for this, but I'd rather save the cost when I don't
	    need it.  (Remember that floating point is also a coprocessor.
	    Only naivity would hold that interger operations can't be also.)



Dave Roberts
Hewlett-Packard Co.
dgr@hpfcla.hp.com

tim@nucleus.amd.com (Tim Olson) (01/05/90)

In article <8840005@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
| 	(4) If you need the speed, you write the code inline.  Loops kill
| 	    you in whatever architecture you use.  If you do huge numbers
| 	    of arbitrary 32x32 mults, you're code will explode, but hey,
| 	    this is a RISC machine and your code size is already through
| 	    the roof, right?  If you call a subroutine everytime you want
| 	    to do a multiply the overhead of the call will kill you.  But
| 	    notice that this wasn't what I suggested, either.

Inlining code for performing multiplies is an option, but the call
overhead isn't going to "kill you" -- the overhead would probably be
less than 10% -- especially if these kind of routines used a special
calling sequence that the compiler knows about which doesn't have the
overhead of a standard call.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

bs@linus.UUCP (Robert D. Silverman) (01/09/90)

In article <8840005@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
 
stuff deleted..

:	(6) If you really need the blazing integer speed, buy a coprocessor.
:	    That is also one of the fundemental RISC ideas.  There are times
:	    when things just aren't done well by software and do need hardware
:	    help.  This option also allows you to get *really, really* fast
 
I'd love to. The only problem is: How do I integrate (say) an integer DSP
chip into my SUN so it will act as a co-processor. Who modifies the compilers
to use it, etc. etc... I could rewrite the compilers but I don't have the
time. Do you know of a commercially available integer coprocessor that I 
can plug into a workstation? I don't. 

:	    integer speed by using a multiplier array (works by generating
:	    all the product terms all at once and then adding the whole
 
etc.

-- 
Bob Silverman
#include <std.disclaimer>
Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
Mitre Corporation, Bedford, MA 01730

gregw@otc.otca.oz (Greg Wilkins) (01/11/90)

in article <8840005@hpfcso.HP.COM>, dgr@hpfcso.HP.COM (Dave Roberts) says:
> 	Now for some comments:
> 	(1) SPARCs will get multiply and divide.  This is from a guy at
> 	    Sun.  Coming soon to a SPARC station near you...

OK Sun Microsystems!  we know you out there listening.  How about an
official comment on this????     What about the ABI (application binary
interface) standard you guys are pushing.  Signing up all those
clone manufacturers etc etc.  So your going to spring a multiply
instruction on them and make it all out dated (when compared to the sun5???)

Even if it is not a new instruction but an integer co-processor, it
still needs to be included in the ABI.

Without an official yeh or nay on this one, nobody can have confidence in
Suns commitment to open systems and the ABI.


DISCLAIMER:  I do not know my employers opinion on these matters.

Greg Wilkins              ACSnet:gregw@otc.oz.au  igc nets:   igc:peg:gwilkins
 "To sin by silence when  Phone(w):  (02)2874862  Telex:           OTCAA120591
  they should  speak out  Phone(h):  (02)8104592  Snail:OTC Services R&D,
  makes  cowards of men"  Fax:       (02)2874990        GPO Box 7000,
           - Abe Lincoln  O/S ph: (prefix) 612 #        Sydney 2001, Australia

dgr@hpfcso.HP.COM (Dave Roberts) (01/11/90)

Bob Silverman writes,
>
>In article <8840005@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
> 
>stuff deleted..
>
>:	(6) If you really need the blazing integer speed, buy a coprocessor.
>:	    That is also one of the fundemental RISC ideas.  There are times
>:	    when things just aren't done well by software and do need hardware
>:	    help.  This option also allows you to get *really, really* fast
> 
>I'd love to. The only problem is: How do I integrate (say) an integer DSP
>chip into my SUN so it will act as a co-processor. Who modifies the compilers
>to use it, etc. etc... I could rewrite the compilers but I don't have the
>time. Do you know of a commercially available integer coprocessor that I 
>can plug into a workstation? I don't. 
>
>:	    integer speed by using a multiplier array (works by generating
>:	    all the product terms all at once and then adding the whole
> 
>etc.
>
>-- 
>Bob Silverman
>#include <std.disclaimer>
>Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
>Mitre Corporation, Bedford, MA 01730
>----------

Sorry Bob,
	I don't have all the answers.  I guess you do.  I was just trying
to propose some solutions.  I guess you're perfectly right.  The SPARC
is a joke.  Any machine that doesn't have integer multiply and divide is
crippled for life.  Whoops, sorry, any machine that doesn't meet your
performance requirements in any way is crippled for life, and all the
people who designed it are sick in the head.  You're right Bob.  How
could I have been so stupid.  I guess all those people who are building
chips without those functions you seem to need require are just silly,
never mind that they meet the needs of thousands of people.  Honestly,
what was I thinkin', huh?

I just thought that since you seem to have bought the wrong machine (in
your opinion), I'd try and help you make it work.  I'm sorry I even suggested
you doing any extra work.  I guess I should have included another option:
	(7) Sell the damn thing since you seem to hate it so much and buy
	    something else.  Of course this time, try running some of your
	    code on it first so you'll know that it performs up to your
	    standards.


Sorry people.  No smiley faces on this one.  I'm just too tired of trying
to deal with a bunch of whiners who'll never be satisfied with anything
I suggest.

People, RISC is not the cat's meow to everybody.  If you're one of those
for whom it doesn't work, feel free to try something else, but don't
tell me that it doesn't work for me either.

- Dave Roberts
"Just another `brain dead' designer of RISC machines"

mike@umn-cs.CS.UMN.EDU (Mike Haertel) (01/11/90)

In article <1249@otc.otca.oz> gregw@otc.otca.oz (Greg Wilkins) writes:
>Even if it is not a new instruction but an integer co-processor, it
>still needs to be included in the ABI.
>
>Without an official yeh or nay on this one, nobody can have confidence in
>Suns commitment to open systems and the ABI.

I don't see how ABI is related to "open systems."

Source level compatibility ought to be enough for anyone.  With the
emergence of, e.g., ANSI C, POSIX.1, and (less desirably) the
X window system, it will be substantially easier to write portable
programs, and these standards are likely to have a far longer viable
life than any ABI.

I see source level standards as being far less likely than ABIs
to kill future innovation.  RISC architectures, and surely the
SPARC, are far from the last word in computer architecture!

How will your ABI relate to the massively parallel machines
that are likely to become increasinbly prevalent?  And yet the
source level standards will still be useful, particularly
as even some of today's programs might conceivably benefit from
parallelizing compilers.
-- 
Mike Haertel <mike@ai.mit.edu>
"Everything there is to know about playing the piano can be taught
 in half an hour, I'm convinced of it." -- Glenn Gould

andrew@frip.WV.TEK.COM (Andrew Klossner) (01/12/90)

[]

	"What about the ABI (application binary interface) standard you
	guys are pushing ... Even if it is not a new instruction but an
	integer co-processor, it still needs to be included in the
	ABI."

This is backward.  There is no need to include a new, commonly
implemented instruction in the ABI, and there are good reasons not to
do it.  The ABI details a common subset, the intersection of conformant
systems, not the union.

The SPARC ABI describes the virtual machine which is (will be?)
implemented on all SPARC systems.  It gives the interface which an
ABI-conformant application can depend on.  If a machine implements both
the ABI and the new multiply instruction, programs which conform to the
ABI (and therefore do not use multiply) will still execute correctly.
This is all that matters.

  -=- Andrew Klossner   (uunet!tektronix!frip.WV.TEK!andrew)    [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

henry@utzoo.uucp (Henry Spencer) (01/12/90)

In article <1249@otc.otca.oz> gregw@otc.otca.oz (Greg Wilkins) writes:
>OK Sun Microsystems!  we know you out there listening.  How about an
>official comment on this????     What about the ABI (application binary
>interface) standard you guys are pushing.  Signing up all those
>clone manufacturers etc etc.  So your going to spring a multiply
>instruction on them and make it all out dated (when compared to the sun5???)

Why are you so surprised?  Some of us have been wondering about this
all along.  Sun as a company is *not* committed to open systems, despite
their marketing hype; just try to get hardware documentation out of them
for the Sun 3 line.  When the party line on SPARC, S-bus, etc. is "well,
we're not really a workstation company, we see no problem in everybody
being able to build compatible hardware", and the party line on the Sun 3
is "the innards of that workstation are secret, and we can't let out any
information on it, and we can't understand why you persist in making such
ridiculous requests for what is obviously necessarily Top Secret stuff",
then either the left hand doesn't know what the right hand is doing or
there are some really heavy-duty ulterior motives involved.
-- 
1972: Saturn V #15 flight-ready|     Henry Spencer at U of Toronto Zoology
1990: birds nesting in engines | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

kck@mips.COM (Ken Klingman) (01/12/90)

From article <18132@umn-cs.CS.UMN.EDU>, by mike@umn-cs.CS.UMN.EDU (Mike Haertel):
> I don't see how ABI is related to "open systems."
> 
> Source level compatibility ought to be enough for anyone.  With the

Source level compatibility is not enough.  An applications developer
wants to minimize the number of ports and different copies of the 
same package that must be produced.  Most applications developers do
not distribute source and would like to only distribute one version
per architecture.  That's the purpose of an Application Binary
Interface: standardize the binary interface to a system so that an
application developer need only distribute one version per architectural
version of a system.

This means that if a processor supplier adds some whizzy new feature
or adds some hardware performance enhancement in an upward compatible
fashion, but which requires recompilation to take advantage of, then
the chip vendor is effectively creating a new architecture.  An application
developer has to weigh the cost of producing a new version of an application
for a new architecture.  If the new features or the improved performance
warrants it, then the developer will take advantage of it.

This doesn't stifle creativity, but it tends to require much more
substantial architectural improvements than just minor tweaks.


Ken Klingman				Mips Computer Systems
kck@mips.com				928 Arques Avenue
{uunet,decwrl,pyramid,ames}!mips!kck	Sunnyvale, CA  94086
					(408)991-7826

khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) (01/12/90)

In article <1249@otc.otca.oz> gregw@otc.otca.oz (Greg Wilkins) writes:

   in article <8840005@hpfcso.HP.COM>, dgr@hpfcso.HP.COM (Dave Roberts) says:
   > 	Now for some comments:
   > 	(1) SPARCs will get multiply and divide.  This is from a guy at
   > 	    Sun.  Coming soon to a SPARC station near you...

No, it was a guy at TI.

   OK Sun Microsystems!  we know you out there listening.  How about

Yes. Didn't you notice my posting BEFORE the fellow from TI's .... I
explained a perfectly ABI complaint way to use such instructions w/o
mandating ABI noncompliance (though one might want to go for full
speedup and not care about ABI compliance for some application).

>   interface) standard you guys are pushing.  Signing up all those
>   clone manufacturers etc etc.  So your going to spring a multiply
>   instruction on them and make it all out dated (when compared to

Ours ?? TI doesn't belong to Sun. Chip vendors sign up to build chips,
as long as they don't break specs (and there are compliant ways to add
instructions ... buy a license and learn how :>) who are we to stop
them ????

>   Even if it is not a new instruction but an integer co-processor, it
>   still needs to be included in the ABI.

ABI compliant codes can get use of additional instructions .... shared
libraries have their uses :>

>   Without an official yeh or nay on this one, nobody can have confidence in
>   Suns commitment to open systems and the ABI.

I'm no offical.

But I don't recall seeing the formal ABI publication .... there is a de
facto ABI (Solb works or haven't you noticed). Unless you have seen an
ABI spec which asserts that there is such and such an instruction,
you'd better not generate that instruction directly (if you care about
complying with the ABI). If you use the current Sun compilers, you get
.mul, .div, .rem etc. .... and if you have a chip which had those as
hw instructions you (as an end user, or software author or whathave
you) could replace those library codes with ones which employed your
nifty instruction(s). No ABI breakage.

You jump to some amazing conclusions from a posting from _TI_ about
Sun's plans and business practices.

--
Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"

"There is NO defense against the attack of the KILLER MICROS!"
			Eugene Brooks

bs@linus.UUCP (Robert D. Silverman) (01/12/90)

In article <8840008@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
:
:Bob Silverman writes,
:>
:>In article <8840005@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
:
:Sorry Bob,
:	I don't have all the answers.  I guess you do.  I was just trying
:to propose some solutions.  I guess you're perfectly right.  The SPARC
:is a joke.  Any machine that doesn't have integer multiply and divide is
:crippled for life.  Whoops, sorry, any machine that doesn't meet your
 
(1) I never said the SPARC was crippled for life. Here we have the old
slippery slope argument. I criticize one aspect of SPARC. One which makes
it unsuitable for a class of applications and this clown immediately
extrapolates the criticism to my saying that the SPARC is worthless.
Your arguments also lack credibility when you start attacking me personally.

:performance requirements in any way is crippled for life, and all the
:people who designed it are sick in the head.  You're right Bob.  How
:could I have been so stupid.  I guess all those people who are building
:chips without those functions you seem to need require are just silly,
:never mind that they meet the needs of thousands of people.  Honestly,
:what was I thinkin', huh?
:
:I just thought that since you seem to have bought the wrong machine (in
:your opinion), I'd try and help you make it work.  I'm sorry I even suggested
:you doing any extra work.  I guess I should have included another option:
 
(2) I didn't buy any SPARC's. I am fully aware of their defects for my type
of applications. However, others with whom I work are gradually acquiring
SPARC based workstations and replacing other (68020/80386) workstations that
are better suited. What are those of us to do who want the greater speed of 
newer processors, yet who find that the architecture of the new products
is deficient for their application?

:	(7) Sell the damn thing since you seem to hate it so much and buy
:	    something else.  Of course this time, try running some of your
:	    code on it first so you'll know that it performs up to your
 
Try getting the facts before shooting your mouth off. You'll look less like
a fool that way. I do not own a SPARC, I have not bought one, nor have I
ever said I bought one. Yet, you jump to the conclusion that since I have
a criticism of a computer architecture it must be because I am dissatisfied
with one I bought.

:
:
:Sorry people.  No smiley faces on this one.  I'm just too tired of trying
:to deal with a bunch of whiners who'll never be satisfied with anything
 
Incredible. I try to carry on a conversation about the scientific merits
of a processor, and point out that the market doesn't have an answer to the 
criticism and this retarded dweeb takes it personally!!

He calls me a 'whiner' when I simply point out that the SPARC is defficient in a
particular area.

This is whining? It sounds to me like a statement of fact.
 
-- 
Bob Silverman
#include <std.disclaimer>
Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
Mitre Corporation, Bedford, MA 01730

gerry@zds-ux.UUCP (Gerry Gleason) (01/12/90)

In article <18132@umn-cs.CS.UMN.EDU> mike@umn-cs.cs.umn.edu (Mike Haertel) writes:
>In article <1249@otc.otca.oz> gregw@otc.otca.oz (Greg Wilkins) writes:
>>Even if it is not a new instruction but an integer co-processor, it
>>still needs to be included in the ABI.

>>Without an official yeh or nay on this one, nobody can have confidence in
>>Suns commitment to open systems and the ABI.

>I don't see how ABI is related to "open systems."

Well, it is and it isn't.

>Source level compatibility ought to be enough for anyone.  With the
>emergence of, e.g., ANSI C, POSIX.1, and (less desirably) the
>X window system, it will be substantially easier to write portable
>programs, and these standards are likely to have a far longer viable
>life than any ABI.

  While I agree that source level standards are what allow for open
systems in the broadest sense, "enough for anyone," no.  ABI's are
important in relationship to "shrink-wrap" selling of software, the
selling of software in pre-packaged executable form for standard
platforms.  There must be enough machines out there compliant
with any particular ABI for the market to be big enough for
distributers to stock products for it.

  A related issue is how many different ABI's does the market have
room for.  I suspect the number is quit small (maybe only 2, certainly
not more than 5), so anyone care to guess which architectures will
win?  For my money, I don't see how Intel 386/486 could not be on
the list, there are simply too many roughly compatible machines out
there already, and then I would be on one or two of SPARC, MIPS,
or 88000.  Greg is right that it is critical to lock down all
aspects of the hardware (and systems software standards) that
effect the ABI.  If Sun or any of the others falter on these points
the marketplace gets devided and will never develop.

>I see source level standards as being far less likely than ABIs
>to kill future innovation.  RISC architectures, and surely the
>SPARC, are far from the last word in computer architecture!

Standards that do not evolve to include future innovation, die
off, but that's not the point.  Source standards do a lot to
preserve/extend software investment, particularly over time,
but all of us know that there is no such thing as just compiling
any source over about 10k (and few smaller) on a different
family of machine (even different '386 PC's you must recite
the proper incantations for things to work).  If there is not
a commodity market in software for a particular architecture,
the only company guarenteen to have an incentive to compile
and package binaries would be to manufacturer.  Admittedly,
the situation is different in the upper ranges of the market;
MIPS R6000 machines as an example, they won't likely be in
environments where the user is the system administrator, and
package installer, so installation and maintainance doesn't
have to be simple.  But for the high-volume low-end of the
market the requirements for strict standards that support
drop-in compatibility will become ever greater.

What I am curious about is whether there have been any reasonable
proposals for the RFP that OSF put out last year.  The one for
an architecture independent distribution standard, or whatever
they called it.  To me it sounded like a great idea that might
be impossible in practice.

>How will your ABI relate to the massively parallel machines
>that are likely to become increasinbly prevalent?  And yet the
>source level standards will still be useful, particularly
>as even some of today's programs might conceivably benefit from
>parallelizing compilers.

For the short term, ABI is not that relevant to parallel machines
because they are mid to high end machines at this point.  Also,
to my knowledge, UNIX has a very narrow range of parallel
architectures that it maps well to (various closely and loosly
coupled MIMD architectures).  So for mainsteam machines (where
ABI's are relevant) I expect that parallelism will be supported
in standards by MACH features or MACH like features, in other
words parallelism isn't really a problem now.

Gerry Gleason

rogerk@mips.COM (Roger B.A. Klorese) (01/13/90)

In article <18132@umn-cs.CS.UMN.EDU> mike@umn-cs.cs.umn.edu (Mike Haertel) writes:
>Source level compatibility ought to be enough for anyone.  With the
>emergence of, e.g., ANSI C, POSIX.1, and (less desirably) the
>X window system, it will be substantially easier to write portable
>programs, and these standards are likely to have a far longer viable
>life than any ABI.

At least 90% of computer users do not have the source of the applications
they use.  And most software houses will not recompile their products for
an arbitrarily large number of platforms, because each compilation also
requires a (long, one hopes) test cycle.  API's are useful, but ABI's are
most important to open systems.

-- 
ROGER B.A. KLORESE      MIPS Computer Systems, Inc.      phone: +1 408 720-2939
928 E. Arques Ave.  Sunnyvale, CA  94086                        rogerk@mips.COM
{ames,decwrl,pyramid}!mips!rogerk
"Two guys, one cart, fresh pasta... *you* figure it out." -- Suzanne Sugarbaker

dmocsny@uceng.UC.EDU (daniel mocsny) (01/13/90)

In article <96@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes:
>What I am curious about is whether there have been any reasonable
>proposals for the RFP that OSF put out last year.  The one for
>an architecture independent distribution standard, or whatever
>they called it.  To me it sounded like a great idea that might
>be impossible in practice.

From Byte Magazine, Nov. 1989, p. 368, "DOS at RISC" (ouch!),
by Colin Hunter:

"...Hunter Systems has submitted the Analyzer and XDOS to the OSF
as an Architecturally Neutral Distribution Format, or ANDF."

The interesting article describes XDOS, a system that the article
claims generates "true binary ports" of MS-DOS programs to many
UNIX platforms. I have no experience with this product. Does
anyone else?

Dan Mocsny
dmocsny@uceng.uc.edu

gregw@otc.otca.oz (Greg Wilkins) (01/13/90)

In article <5837@orca.wv.tek.com> andrew@frip.wv.tek.com writes:
>The SPARC ABI describes the virtual machine which is (will be?)
>implemented on all SPARC systems.  It gives the interface which an
>ABI-conformant application can depend on.  If a machine implements both
>the ABI and the new multiply instruction, programs which conform to the
>ABI (and therefore do not use multiply) will still execute correctly.
>This is all that matters.
>
>  -=- Andrew Klossner   (uunet!tektronix!frip.WV.TEK!andrew)    [UUCP]
>                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

This is not all that matters.  Lets pretend you have just bought a
brand new sparc station X, complete with integer multipy.  You now want to
by some software for it: You have a choice of paying $300 for the ABI
version, which cannot use the multiply instruction (which is not part of
the ABI), but which is ready to run (be it very slowly as it is a multiply
intensive application).  OR you can pay $100,000 and by the source code,
then study law for a while so you can understand the licence agreement,
then hire another system administator (the existing one is too busy finding
disk space to install and compile applications for every tom dick and
harry).

Basicly it has to be in the ABI or it might as well not be in the machine.
That is if this ABI thing gets off the ground, which it will never do if
every so often a new instruction/co-processor is added.

-gregw@otc.oz

mash@mips.COM (John Mashey) (01/14/90)

In article <1253@otc.otca.oz> gregw@hls0.oz.au writes:

>Basicly it has to be in the ABI or it might as well not be in the machine.
>That is if this ABI thing gets off the ground, which it will never do if
>every so often a new instruction/co-processor is added.

Well, let us observe that:
	a) Most architecture families have added user-visible instructions
	over time, haven't they:
	S/360 -> S/370 -> etc
	68000->68010->68020
	8086->80286->80386->80486
	for example.
	b) For microprocessors, there are PLENTY of oppurtunities to use
	new things while still retatining backwards binary compatibility,
	like:
		use them in the kernel [50% of your cycles, often....]
		(as khb mentions) shared libraries
		kernel emulation of instructions
		translation [remember that people often translate RISC
		instructions on the fly, becasue it's epseiclaly easy]
	and finally:
	c) Many microprocessors are also used in embedded control, where
	ABI's are mostly irrelevant.
	d) Finally, some programs are ONLY compiled and run on local
	machines, and if somebody wants to compile with appropriate
	flags, so be it.  [After all, consider the history of floating
	point accelerators and Sun & elsewhere, and with Weitek chips that
	go in place of 80387s, etc, etc].

Clearly, one wants to avoid gratuitous changes, and too many different
versions of things, but it is certainly a fact of life that:
	a) at any give time, you may not have enough silicon to do everything
	you'd like to.
	b) Successive generations will almost always have some differences
	relative to the kernel.
	c) And finally, as families evolve, the optimal code generation
	may well change from machine to machine anyway, even if they're
	binary compatible in both directions.  This effect has been seen for
	a long time on S/360.... and VAXen, and 386vs486, etc ,etc.

Anyway, let's observe that what's IMPORTANT is getting enough volume
on anything to make it interesting.... the fact that some 386s run MSDOS,
and some run UNIX, and some run OS/2 doesn't stop people from buying them.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

gregw@otc.otca.oz (Greg Wilkins) (01/15/90)

In article <KHB.90Jan11212327@chiba.kbierman@sun.com> you write:
>In article <1249@otc.otca.oz> gregw@otc.otca.oz (Greg Wilkins) writes:
>   in article <8840005@hpfcso.HP.COM>, dgr@hpfcso.HP.COM (Dave Roberts) says:
>   > 	Now for some comments:
>   > 	(1) SPARCs will get multiply and divide.  This is from a guy at
>   > 	    Sun.  Coming soon to a SPARC station near you...
>
>No, it was a guy at TI.

Well I didn't know that!!!!!

>complying with the ABI). If you use the current Sun compilers, you get
>.mul, .div, .rem etc. .... and if you have a chip which had those as
>hw instructions you (as an end user, or software author or whathave
>you) could replace those library codes with ones which employed your
>nifty instruction(s). No ABI breakage.

Well the compiler I use on a sun 4/110 doesn't produce .mul, .div etc
but then it is running an ancient version of the operating system (3.2!!!)
But assuming that newer compilers will generate them, surely they are
expanded when the a.out file is generated, so they cannot be evaluated at
load time.    I guess they can be expanded to a call to a shared library
routine (I don't know how this mechanism works), so libC.a is not linked
in until run time, But I don't know if shared libraries are included in
the ABI. If it is not, then libC.a will be linked in before an application
is distributed, thus local libC.a routines will be ignored. 

Well lets assume that via some mechanism, multiplies are performed by an
undefined function that is linked in at load time. Then the best you can do
is cop a function call then a multiply instruction (possibly with moves to
and from a co processor). I guess this is not too bad as function calls are
pretty fast on a SPARC anyway.

But what about multiplies by constants, the compiler will have turned these
into wonderful sequences of shifts and adds.  If a mul instruction became
available, these should be replaced by a load with constant followed by a
multiply.  But then I guess nothing can solve this one (without greatly
slowing the performance of all machines without a multiply instruction).

>You jump to some amazing conclusions from a posting from _TI_ about
>Sun's plans and business practices.

I was not jumping to amazing conclusions, but simply inviting Sun to
clarify a point of obvious public concern.  Your posting has indicated that
maybe it is not such a big deal, but it still would be better to get an
official position from sun on the future a sparc and multiply.

It is Obvious that 
  1) SPARC multiply performance can be improved.
  2) There is a class of users who would greatly benefit from it.
  3) There is a larger class of user who probably wont notice it, but
     would sure as hell feel nicer about their machines when reading
     comparative benchmarks.
  4) There are many rumours about how/when/if the multiply will be
     speeded up.
  5) Everybody would appreciate a definitive statement from Sun

DISCLAIMER:  I do not know my employers opinion on these matters.

Greg Wilkins              ACSnet:gregw@otc.oz.au  igc nets:   igc:peg:gwilkins
 "To sin by silence when  Phone(w):  (02)2874862  Telex:           OTCAA120591
  they should  speak out  Phone(h):  (02)8104592  Snail:OTC Services R&D,
  makes  cowards of men"  Fax:       (02)2874990        GPO Box 7000,
           - Abe Lincoln  O/S ph: (prefix) 612 #        Sydney 2001, Australia

preston@titan.rice.edu (Preston Briggs) (01/16/90)

In article <1255@otc.otca.oz> gregw@otc.UUCP (Greg Wilkins) writes:

>But what about multiplies by constants, the compiler will have turned these
>into wonderful sequences of shifts and adds.  If a mul instruction became
>available, these should be replaced by a load with constant followed by a
>multiply.  But then I guess nothing can solve this one (without greatly
>slowing the performance of all machines without a multiply instruction).

Converting I x C into a sequence of shifts, adds, and subtracts
is really a machine-dependent optimization -- it's not always
profitable.  In a short paper

	Multiplication by Integer Constants
	Robert Bernstein
	Software -- Practice and Experience, July 1986

Bernstein describes the method used by the PL.8 compiler to
handle 32x32 signed multiplies.  The cost of a multiply,
on an 801, using multiply-step instructions is 19 cycles.
The cost, using his scheme, is summarized below

	constant		# instructions
	---------------------------------------
	2			1
	3			2
	6			3
	11			4
	22			5
	43			6
	86			7
	171			8
	342			9
	683			10
	1366			11
	2731			12
	5462			13
	...

that is, multiplications by 2731 require 12 instructions.
Multiplications by C < 2731 require 11 or fewer instructions.
So, if you've got a multiplier that's quicker than 10 cycles,
you'd rather not convert some constants >= 683.

On the other hand, if you have lots of multiplies by
strange integer constants, converting them all
might expose opportunities for common subexpression
elimination.

Note that the above numbers are misleading for some machines.
If you don't have a barrel shifter or 3-address instructions or
adequate registers, the cost will be higher.

Preston Briggs
preston@titan.rice.edu

khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) (01/16/90)

In article <1253@otc.otca.oz> gregw@otc.otca.oz (Greg Wilkins) writes:
....
   brand new sparc station X, complete with integer multipy.  You now want to
   by some software for it: You have a choice of paying $300 for the ABI
   version, which cannot use the multiply instruction (which is not part of
   the ABI), but which is ready to run (be it very slowly as it is a multiply
   intensive application).
...

No. As I pointed out in an earlier posting, it is possible to add
instructions AND to get the benefit delievered to ABI compliant users.

Consider the following gedanken experiment:

ACME HiTech Corp's VP of engineering (Wily E. Coyote) concludes that
some nifty mass market project of theirs (say HDTV/Workstation combo,
or part of a navigation system which gets its data 1 point at a time)
absolutely must have a general purpose CPU ... which can execute the
following function at full hw speed

	f(ix,ifac) returns  int(ifac*cos(x))
		            int(ifac*sin(x))
			    int(ifac*cos(-x))
			    int(ifac*sin(-x))

(viz. mutant given's transformation)

Clearly no merchant chip (well, aside from some nifty CORDIC chips)
has this now.

Coyote contacts the SPARC licensing board (or its real life equivalent
:>) and negoiates an opcode (perhaps in supervisor space ... where it
doesn't affect anyone else) to be available only in their
chips/systems (I have no idea how this is done, but assume for money
it can be arranged).

Consider the following cases:

1)	ACME engineers build their software on Sun 4/60's which lack the
	instruction. 

2)	First spin of the chip can't fit the whole thing ... so they
	compute sincos(x) in hw.

3)	Second spin does it all .. but has a bug which requires that
	arg reduction be performed before the rest of the instruction.

4)	Third spin does it all ... correctly.

Quiz:

1) 	How many a.out's (to use SunOS 4.x lingo) are necessary ? 
	
2)	How many can be ABI compliant ?


Answers:

1)	Only 1 a.out is required .. and can benefit from the hw.
2)	It can be ABI compliant.
3)	Yes, being non-ABI compliant might improve performance.

The solution is via shared libraries. The a.out only knows that at
runtime there will be a routine called wily_givens.

At runtime the runtime loader links in the "right" shared library ...
where right means the one which matches the local hardware.

1 ABI compliant a.out runs on 4, and goes faster as the hw improves.

Best performance (probably 1-5% faster in this case) would be obtained
by generating the wily_givens instruction directly in the 4th case.

If it were to happen that SPARC's with wily_givens caputured a huge
chunk of the market (say 90% for grins) perhaps the ABI would be
altered .... causing the 10% to trap to the OS for emulation. This
would only adversely impacted the 10% of user's who lacked the
hardware, but employed codes rich in the instruction .... most of the
10% probably wouldn't notice or care.

Clearly integer multiply would be somewhat more popular than
wily_givens and is less computationally intense .... thus the
tradeoffs aren't so obvious. But the general case of how to add an
instruction, yet be ABI compliant, is easily handled.
--
Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"

"There is NO defense against the attack of the KILLER MICROS!"
			Eugene Brooks

guy@auspex.auspex.com (Guy Harris) (01/16/90)

>But assuming that newer compilers will generate them, surely they are
>expanded when the a.out file is generated, so they cannot be evaluated at
>load time.

Newer compilers will presumably include a command-line flag instructing
them to either produce the multiply/divide instructions themselves or
calls to ".mul"/".div" and company.  And such calls are surely *not*
expanded with the "a.out" file is generated, unless you linked with
"-Bstatic" - shared libraries, remember?

>I guess they can be expanded to a call to a shared library routine (I
>don't know how this mechanism works), so libC.a is not linked in until
>run time,

Exactly.

>But I don't know if shared libraries are included in the ABI.

If the ABI is anything like the ones for which I've seen drafts, not
only are they included, they are *required* - i.e., the way you do a
"stat()" call in an ABI-conforming application is you make a call to the
dynamically-linked routine "stat()" in the appropriate library, passing
it certain arguments.  You don't shove specified stuff into registers
and execute trap # N.

The same could apply to ".mul"/".div", and probably *would* apply (the
drafts I saw hadn't gotten around to specifying those particular
routines yet; they came in the processor-specific part of the ABI).

(It would also apply to, e.g.  "getpwnam()" and company, so ABI
implementations will pick up the local "getpwname()"-and-company
implementation, whether it be a linear scan through "/etc/passwd", a
"dbm"-based implementation like 4.3BSD, a Hesiod-based implementation,
an HPollo Registry-based implementation, a YP-based implementation,
etc..)

>Well lets assume that via some mechanism, multiplies are performed by an
>undefined function that is linked in at load time. Then the best you can do
>is cop a function call then a multiply instruction (possibly with moves to
>and from a co processor). I guess this is not too bad as function calls are
>pretty fast on a SPARC anyway.

Yes, if you want an ABI-conforming program.

If speed, rather than shrink-wrap portability, is important, you could
build the program with whatever the "generate multiply/divide in line"
flag is. 

If both are important, either:

	1) portability to *all* SPARC-based machines *isn't* important
	   (e.g., an application that won't work fast enough if you
	   don't have multiply/divide instructions), in which case you
	   might be able to build your program with the "generate
	   multiply/divide in line" flag, and label it "this should run
	   on any ABI-conforming machine that *also* has the standard
	   multiply/divide instructions".

	2) portability to all SPARC-based machines is important, but so
	   is getting the extra performance from the instructions if
	   they're there, in which case you may build two versions.

These do somewhat conflict with the principle of the ABI, but life isn't
always perfect (are there PC applications that demand floating-point
hardware, or applications that come in multiple versions, one of which
does and one of which doesn't?).

>But what about multiplies by constants, the compiler will have turned these
>into wonderful sequences of shifts and adds.  If a mul instruction became
>available, these should be replaced by a load with constant followed by a
>multiply.

Why?  Do you have hard evidence (not guesses) to suggest that a load
with constant followed by a multiply will be faster than the sequence of
shifts and adds?  Note that the Sun 68020 compiler generates sequences
of shifts and adds for multiplies by constants *even though the 68020
has a 32x32 multiply instruction*; I don't think this was done because
the compiler writers were too lazy to change the compiler, I think it
was done because it was *still* faster to do the shifts and adds.  I
can't speak for MIPS, but I wouldn't be surprised to hear that even
though they had a multiply instruction since Day 1 they still did shifts
and adds for multiplies by constants.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/16/90)

In article <2819@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:

| Newer compilers will presumably include a command-line flag instructing
| them to either produce the multiply/divide instructions themselves or
| calls to ".mul"/".div" and company.  And such calls are surely *not*
| expanded with the "a.out" file is generated, unless you linked with
| "-Bstatic" - shared libraries, remember?

  Has anyone measured the time taken to just generate the mpy and trap
it vs the time for a procedure call? We used to trap some instructions
on the old GE series 20 years ago, and the time to trap and decode
(table lookup for decode) was only a few % slower than a call, when the
total time to execute the "instruction" was taken into account.

  Would it be better to just generate the instruction all the time and
trap it, rather than use the various libraries? It would certainly give
better performance on the machines with the mpy hardware, and based on
the very slow times reported here might not be a notable loss on
standard ABI SPARC.

  Has anyone measured these numbers to get a ballpark figure? I don't
have a good feel for how long the partial context change would take on
the trap.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

mash@mips.COM (John Mashey) (01/17/90)

In article <2819@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
....
>the compiler writers were too lazy to change the compiler, I think it
>was done because it was *still* faster to do the shifts and adds.  I
>can't speak for MIPS, but I wouldn't be surprised to hear that even
>though they had a multiply instruction since Day 1 they still did shifts
>and adds for multiplies by constants.

Yes, of course: shifts, adds, and subtracts, up to the point where
it takes about as long as a multiply. (I think there's a bias to use
multiply if its equal or close, since the code sequence is shorter,
andthere may be code scheduling opportunities in the shadow of the
multiply, although not many.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

sjc@key.COM (Steve Correll) (01/17/90)

In article <2819@auspex.auspex.com>, guy@auspex.auspex.com (Guy Harris) writes:
> ...I can't speak for MIPS, but I wouldn't be surprised to hear that even
> though they had a multiply instruction since Day 1 they still did shifts
> and adds for multiplies by constants.

Right. (As of 2 years ago, anyway) the MIPSCo compilers compare the number of
cycles for hardware multiply against the cycles for the necessary series of
shifts and adds, and emit the cheaper alternative. They also shift and mask
(adjusting for the sign bit as needed) in lieu of div and rem by powers of 2.
This happens in the non-ASCII back end of the assembler, which the compilers
employ to create the object file, so compilers and humans alike can code "mul"
and let the assembler figure out the tradeoff.
-- 
...{sun,pyramid}!pacbell!key!sjc 				Steve Correll

whit@milton.acs.washington.edu (John Whitmore) (01/24/90)

In article <15418@vlsisj.VLSI.COM> davidc@vlsisj.UUCP (David Chapman) writes:
>In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes:
>>Does any have, of know of software for the SPARC [SUN-4] that will
>>perform the following:
>>
>> [standard multiply and divide]
> . . .
>There should be instructions on the order of "multiply step" and "divide 
>step", each of which will do one of the 32 adds/subtracts and then shift.  
>
	I strongly disagree.  Smarter routines can optimize a lot of 
that kind of "microcode" away, and should do so given the opportunity.

>Thus they provide you with the tools to do your own multiply and divide.  
>One of the benefits is that a compiler can optimize small multiplies and 
>divides to make them execute quicker (i.e. multiply by 10 takes 4 steps 
>instead of 32).
	Yes, EXACTLY.  So extend the principle; take four-bit nibbles
of the argument and do a 16-way JUMP (whatever the equivalent is
on a SPARC) to sixteen cases like
CASE x0000:	RTN
CASE x0001:	add (to accumulator)
CASE x0010:	shift +1, add
CASE x0011:	subtract, shift+3, add
CASE x0100:	shift+3, add
CASE x0101:	add, shift+3, add
CASE x0110:	shift+2, add, shift+3, add
CASE x0111:	subtract, shift+4, add
	and if I can figure it out, you experts are getting bored
by now.  Four operations MAX for a four-bit multiplicand, opposed
to 12 operations (estimated) for the one-bit "MULSTEP" approach.
>
>P.S.  Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be
>      incredibly slow.  Unroll the loop 4 or 8 times (MULSTEP, MULSTEP,
>      MULSTEP, MULSTEP, SUB 4, BNZ).  Branches are expensive.

	Hm.  The principle is good, but don't think small.  Unroll it
to really large chunks of code.  The "BNZ" is a bottleneck that
shouldn't be employed when really large fanout of the code path
can be done in ONE step.
	I seem to recall that the trick (above) was employed by a
hardware multiplier IBM made, some decade or more ago.  It still works.

I am known for my brilliance,                 John Whitmore
 by those who do not know me well.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/25/90)

  Actually memory is cheap enough to use 64k for a lookup table. Make a
16 bit address from the two bytes to multiply as
	0      78      15
	|_______|______|
	| byte1 | byte2|
	|_______|______|
and just pull the answer out of memory.

  Actually, the Intel practice of treating a 32 bit register as lots of
little registers, with the 8 and 16 bit portions addressible by name,
would be handy for some of this stuff, eliminating a shift. It should
still be quite fast.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me