bs@linus.UUCP (Robert D. Silverman) (12/23/89)
Does any have, of know of software for the SPARC [SUN-4] that will perform the following: (1) Multiply two unsigned 32-bit integers, yielding a 64 bit product stored in two registers? (2) Take a 64 bit product and divide it by a 32 bit (unsigned) integer yielding a 32-bit quotient and remainder? or (3) Compute A*B Mod C directly? The SPARC is brain dead [as were its designers] when it comes to doing integer arithmetic. It can't multiply and it can't divide. Trying to convert to floating point, do the arithmetic, then convert back is far too slow. (I've already tried). -- Bob Silverman #include <std.disclaimer> Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs Mitre Corporation, Bedford, MA 01730
dgr@hpfcso.HP.COM (Dave Roberts) (12/27/89)
>The SPARC is brain dead [as were its designers] when it comes to doing >integer arithmetic. It can't multiply and it can't divide. >-- >Bob Silverman >#include <std.disclaimer> >Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs >Mitre Corporation, Bedford, MA 01730 >---------- Geeze Bob, The thing is a SPARC. It's a RISC machine. Integer mult and divide are the first things to go when you design a RISC. There should be some funky instructions to help you out, like "shift and add" for multiplication. Trust me, you're better off (in speed, that is) for not having those functions, and I'll be that you can write a routine that can do them just about as fast as they could internally. I don't really know much about SPARCs but I know that the designers at Sun weren't "brain dead". Note that I am in no way endorsing the purchase of any SPARC system. Quite the contrary, of course, competition being what it is. :-) Dave Roberts Hewlett-Packard Co. dgr@hpfcla.hp.com dgr%hpfcla@hplabs.hp.com
davidc@vlsisj.VLSI.COM (David Chapman) (12/27/89)
In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes: >Does any have, of know of software for the SPARC [SUN-4] that will >perform the following: > > [standard multiply and divide] > >The SPARC is brain dead [as were its designers] when it comes to doing >integer arithmetic. It can't multiply and it can't divide. There should be instructions on the order of "multiply step" and "divide step", each of which will do one of the 32 adds/subtracts and then shift. I'm not particularly fond of the SPARC architecture (don't like register windows), but this is a theoretical viewpoint and is not based on any direct exposure to assembly-language programming for it (translation: sorry, I can't give you any more help). Neither SPARC nor its designers were brain-dead when it was built. It's just that it is difficult to get multiplication and division (especially the latter) to run in 1 or 2 clock cycles. All instructions are supposed to execute in the ALU in 1 cycle; if the multiply and divide instructions take more time then the front of the processor pipeline has to be able to stall and this added complexity will slow down the entire processor. Thus they provide you with the tools to do your own multiply and divide. One of the benefits is that a compiler can optimize small multiplies and divides to make them execute quicker (i.e. multiply by 10 takes 4 steps instead of 32). It is important that you understand this if you are to write assembly language programs for a SPARC. If your instructions are not carefully optimized, the result could be slower than if you write in a high-level language and compile with its optimizer! (Unless the SPARC assembler performs instruction reordering.) P.S. Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be incredibly slow. Unroll the loop 4 or 8 times (MULSTEP, MULSTEP, MULSTEP, MULSTEP, SUB 4, BNZ). Branches are expensive. -- David Chapman {known world}!decwrl!vlsisj!fndry!davidc vlsisj!fndry!davidc@decwrl.dec.com
cik@l.cc.purdue.edu (Herman Rubin) (12/27/89)
In article <8840004@hpfcso.HP.COM>, dgr@hpfcso.HP.COM (Dave Roberts) writes: > > >The SPARC is brain dead [as were its designers] when it comes to doing > >integer arithmetic. It can't multiply and it can't divide. > > >-- > >Bob Silverman > >#include <std.disclaimer> > >Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs > >Mitre Corporation, Bedford, MA 01730 > >---------- > > > Geeze Bob, > The thing is a SPARC. It's a RISC machine. Integer mult and > divide are the first things to go when you design a RISC. There should > be some funky instructions to help you out, like "shift and add" for > multiplication. Trust me, you're better off (in speed, that is) for ^^^^^^^^ > not having those functions, and I'll be that you can write a routine > that can do them just about as fast as they could internally. > I don't really know much about SPARCs but I know that the designers > at Sun weren't "brain dead". It is clear that you are not to be trusted (see above). To multiply two 32 bit numbers to get a 64 bit product on a 32x32 -> 32 machine, the 32 bit numbers must be divided into 16 bit parts. The whole operation takes about 20 operations (count them). Shift and add are far slower. Divide is even worse. Also, there is considerable overhead in a subroutine call; there are registers to save and restore. Open subroutines (in-line functions) are a way around it, but they still have the problem. I am sure that Bob Silverman knows how to write efficient subroutines. He has to use them anyhow, as he is multiplying and dividing numbers of several hundred bits. But even if less is wanted, good integer arithmetic is needed. If more precision than is designed for is wanted in floating operations, integer arithmetic must be used. There are also many other kinds of operations cheap in hardware and expensive in software. RISC machines may be good for the types of operations the designers anticipated, but it is difficult to do much about the ones left out. The CRAYs can be considered RISC vector machines, and the vector operations omitted are extremely difficult to get around. The above instruction count for double precision was derived from the CRAY. We even have a chicken-and-egg problem. Any fairly good programmer designs the program to take into account the capabilities of the machine. I know that the gurus claim that this should not be so, but it is not unusual for me to think of modifications or even totally new ways of doing things which the compiler cannot unless those specific ways are put into the compiler. If a machine does not have hardware square roots, one avoids square roots, as there are usually faster ways. One thing which might help is if there were a mailing list to discuss these ideas, and to collect the numerous operations efficient in hardware and expensive in software. Those who know me will agree that I am not the person to run this. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
bs@linus.UUCP (Robert D. Silverman) (12/27/89)
In article <8840004@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
:
I wrote:
:>The SPARC is brain dead [as were its designers] when it comes to doing
:>integer arithmetic. It can't multiply and it can't divide.
:
:Geeze Bob,
: The thing is a SPARC. It's a RISC machine. Integer mult and
:divide are the first things to go when you design a RISC. There should
Huh? A computer that can't perform basic arithmetic? [add, subtract,
multiply, divide]. What have you been smoking? Integer mult/divide
should NOT be the first things to go. [look at the MIPS chip, a far
superior design in my opinion]. Integer mult/divide may not be much
used for most applications, but when they are required, not having
hardware support is an absolute KILLER of algorithms. I have code that
runs **slower** on the newest SPARCstation [SUN-4/330] than it does on
a SUN-3/60, ONLY BECAUSE it can't multiply and divide. This is
despite the fact that the SPARC is purported to be at least 10x faster.
The RISC designers who put together SPARC may have been looking at the
'broad picture', but they sure as hell didn't know a lot about
computer arithmetic or algorithms.
:be some funky instructions to help you out, like "shift and add" for
:multiplication. Trust me, you're better off (in speed, that is) for
:not having those functions, and I'll be that you can write a routine
I AM NOT BETTER OFF, and don't speak for me. The fact that you say
'trust me ....', indicates that you don't know diddly when it comes
to the implementation and design of numerical and semi-numerical
algorithms.
Shift and add to perform general multiplication is hopelessly
slow because of the overhead/bookkeeping/control mechanisms involved.
:that can do them just about as fast as they could internally.
Huh? The MIPS-R3000 does integer multiplies in hardware in just a
couple of cycles. The SPARC takes a minimum of 47 cycles using
its so-called multiply-step function to multiply two integers.
Division is even worse by almost an order of magnitude.
It never ceases to amaze me how RISC zealots immediately jump
up anytime anyone should [GOD forbid!] criticize a RISC instruction
set. Especially when those zealots know very little about numerical
algorithms.
:I don't really know much about SPARCs but I know that the designers
Yet, you are willing to say 'trust me I know better' in a knee-jerk
response to my posting.
:at Sun weren't "brain dead".
I didn't say they were. I say that they were with respect to their
knowledge of how arithmetic should be done. The fact that they
perceived integer mult/divide as unimportant is indicative that they
were unaware how important they are for applications that require
them. On the other hand, the SPARC may have been designed ONLY for
the majority of programs that don't need mult/divide. If so, its
design was driven by MARKETING decisions, not technical ones.
Criticism of the chip on a theoretical basis is more than justified.
--
Bob Silverman
#include <std.disclaimer>
Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
Mitre Corporation, Bedford, MA 01730
hui@joplin.mpr.ca (Michael Hui) (12/28/89)
For a frame of reference, T.I.'s DSP chip TMS320C25 does a 16x16 multiply in a single cycle. AMD's Am29000 uses multiple instructions to accomplish a general 32x32 multiply. I say general because, quoating from 7.1.6 of the User's Manual (c)1987: "It may be beneficial to precede a full multiply procedure with a routine to discover whether or not the number of multiply steps may be reduced." In other words, the compiler is given more room for optimization here when faced with two arguments of different precision. Michael Hui 604-985-4214 hui@joplin.mpr.ca
reha@cbnewsi.ATT.COM (reha.gur) (12/28/89)
In article <1804@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes: > It is clear that you are not to be trusted (see above). To multiply > two 32 bit numbers to get a 64 bit product on a 32x32 -> 32 machine, > the 32 bit numbers must be divided into 16 bit parts. The whole operation > takes about 20 operations (count them). Shift and add are far slower. > Divide is even worse. Also, there is considerable overhead in a > subroutine call; there are registers to save and restore. Open > subroutines (in-line functions) are a way around it, but they still > have the problem. > > Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 > Phone: (317)494-6054 > hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP) The numbers I get (from looking at the data sheets and other info) for two machines: a 25Mhz i486 and a 25Mhz SPARC are as below: Assuming no cache hits and various other items: i486: 18-31 cycles for signed 32 x 32 bit multipication (reg to reg) SPARC: 48-52 cycles for same (including subroutine call and return time) i486: 32 cycles for signed 32 bit division (acc by reg) SPARC: 41 (approximate best case) to 211 (approximate worst case) (depends on bits in dividend and divisor) The numbers above are approximate and results may vary. The SPARC subroutine call does not need any registers saved across the call. The code for multiplication is as given in the sparc architecture manual. Also note that some SPARC machines do have (or might have) integer mul and divide in hardware. reha gur attunix!reha
krueger@tms390.Micro.TI.COM (Steve Krueger) (12/28/89)
OK, enough already. I'll tell. While no SPARC microprocessors to date implement them, there are integer multiply instructions which have been specified for the SPARC architecture. Some (possibly many) future implementations will find it advantageous to implement them. They are intended to produce results in very few cycles on the "fastest" implementations. Current implementations and those that don't support these instructions will trap and presumably emulate the operation. The multiply step instruction will remain as well. Breifly, a little more detail: Multiply is 32x32 -> 64. The low order portion of the result goes into the destination register of the instruction and the high order part goes into the Y register (same one used by multiply step instruction). The instruction comes in signed and unsigned forms and each may set the condition codes or not. (All of that is standard for arithmetic instructions in SPARC.) There is a similar set of divide instructions that are pretty much the inverse of the multiply instructions. In article <1979@eric.mpr.ca>, hui@joplin.mpr.ca (Michael Hui) writes: > > For a frame of reference, T.I.'s DSP chip TMS320C25 does a 16x16 > multiply in a single cycle. I need to comment on this since I know a little about these chips. The fastest TMS320C25 is available with an 80nS cycle time and the 'C25 can do multiply and accumulate at that rate. The 32x32 -> 64 integer multiply that is desired for 32-bit processors is 4 times as complex as the 16x16 -> 32 in the `C25. The DSP's allocate a lot of their chip area to multiply. GP micros have other priorities and a full 32x32 -> 64 multiplier array can take {\em serious} Si area. So you can expect that DSPs will have better performance/price for multiply than GP micros. All of that said, SPARC needs integer multiply instructions and so it will get them. BTW, there are several TI parts with higher integer multiply performance than the `C25. I don't mean this as a commercial, just for informational purposes. The TMS320C50, another integer DSP chip, was announced last year. The 'C50 does an integer multiply in 50nS. A TI floating point datapath chip SN74ACT8847 can also perform integer multiply. A 32x32 -> 64 integer multiply takes just 30nS on the 33MHz version of that chip. It does pretty well on floating point stuff too :-) but its not a processor. Steve Krueger krueger@micro.ti.com SPARC Applications Texas Instruments Houston, Texas (713) 274-2479 ** Mod any actual facts found above, these are my thoughts alone. **
rec@dg.dg.com (Robert Cousins) (12/28/89)
In article <84983@linus.UUCP> bs@linus.UUCP (Robert D. Silverman) writes: >In article <8840004@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes: >:be some funky instructions to help you out, like "shift and add" for >:multiplication. Trust me, you're better off (in speed, that is) for >:not having those functions, and I'll be that you can write a routine > >I AM NOT BETTER OFF, and don't speak for me. The fact that you say >'trust me ....', indicates that you don't know diddly when it comes >to the implementation and design of numerical and semi-numerical >algorithms. >Shift and add to perform general multiplication is hopelessly >slow because of the overhead/bookkeeping/control mechanisms involved. >:that can do them just about as fast as they could internally. >It never ceases to amaze me how RISC zealots immediately jump >up anytime anyone should [GOD forbid!] criticize a RISC instruction >set. Especially when those zealots know very little about numerical >algorithms. >Bob Silverman >#include <std.disclaimer> >Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs >Mitre Corporation, Bedford, MA 01730 Actually, this debate is now closing in on 45 years old. If you will simply go pull a copy of the original ENIAC papers by Von Neuman and Co. you will find a long and drawn out discussion of the pros and cons of adding a number of instructions to the basic architecture: multiplication, division, square root, etc. The conclusions were based upon a then totally different set of criteria where gates were made of vacuum tubes and microcode had not been invented yet. Given the constraints of the time, both multiply and divide were considered quite justifiable. However, it is interesting to note that Von Neuman didn't believe in floating point. He only believed in signed fixed point. Futhermore, in his justifications for future projects, he used the multiplication time of various machines as a first order approximation of the performance of the machines since the jobs which were then commonly run on the machines were numerical in nature and tended to scale on various machines more or less proportionally to the multiply speed. This is not to say that multiply and divide are REQUIRED, it is just to say that many of these considerations are not new and should be perhaps viewed in a more historical context. After all, we know that technology has changed massively in the past 45 years. By discussing the implications of the changes, it is reasonable to come up with some criteria underwhich these changes are justified and not. Robert Cousins Dept. Mgr, Workstation Dev't. Data General Corp. Speaking for myself alone.
beyer@cbnewsh.ATT.COM (jean-david.beyer) (12/28/89)
In article <1535@cbnewsi.ATT.COM>, reha@cbnewsi.ATT.COM (reha.gur) writes: > In article <1804@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes: > > > To multiply > > two 32 bit numbers to get a 64 bit product on a 32x32 -> 32 machine, > > the 32 bit numbers must be divided into 16 bit parts. The whole operation > > takes about 20 operations (count them). Shift and add are far slower. > > The numbers I get (from looking at the data sheets and other info) for > two machines: a 25Mhz i486 and a 25Mhz SPARC are as below: > > Assuming no cache hits and various other items: > > i486: 18-31 cycles for signed 32 x 32 bit multipication (reg to reg) > SPARC: 48-52 cycles for same (including subroutine call and return time) > > i486: 32 cycles for signed 32 bit division (acc by reg) > SPARC: 41 (approximate best case) to 211 (approximate worst case) > (depends on bits in dividend and divisor) I do not have the numbers for SPARC handy, but would tend to trust reha. However, having spent some time working on optimizers that work at the assembly level for machines, I notice that, for the benchmarks we use, anyway, the result of running our C compilers, the most integer multiplies seem to be when dealing with the subscripts of arrays or some kinds of pointer dereferencing (those pointing to structures). In these cases, the multiply instructions are mostly multiplications by constants. These constants frequently have a small number of 1's. Consequently, instead of calling a general purpose multiply subroutine, it suffices to insert a special purpose inline code that multiplies by the desired constant. (it might even pay to do a strength reduction optimization). This can be much faster than the general purpose multiply subroutine, and may be faster than a hardware multiply instruction, depending on the design of the hardware multiply. -- Jean-David Beyer AT&T Bell Laboratories Holmdel, New Jersey, 07733 attunix!beyer
mash@mips.COM (John Mashey) (12/29/89)
In article <6908@cbnewsh.ATT.COM> beyer@cbnewsh.ATT.COM (jean-david.beyer) writes: >pointer dereferencing (those pointing to structures). In these cases, >the multiply instructions are mostly multiplications by constants. These >constants frequently have a small number of 1's. Consequently, instead of >calling a general purpose multiply subroutine, it suffices to insert a >special purpose inline code that multiplies by the desired constant. >(it might even pay to do a strength reduction optimization). This can be >much faster than the general purpose multiply subroutine, and may be faster >than a hardware multiply instruction, depending on the design of the >hardware multiply. Many compilers do this. For sure, HP PA, MIPS, and SPARC, but I suspect most of the other current RISCs as well, and I imagine many CISCs. It's helpful to have the assembler offer a multiply pseudo-op that turns into the best sequence of shifts/adds/subs (or whatever, such as the combined ops that HP PA has), up to the point where just issuing a multiply is better [i.e., on MIPS or 88K, which have H/W multiplies.] At least in our case, multiply/divide was put in even knowing how much of the structure-reference multiplies was going to go away, because there were enough benchmarks where muls & divs wouldn't go away; as usual, it depends on the benchmark set you use as to which way you jump on this, and which implementation technologies you've got, etc, and thoughtful people have picked both ways, recently, meaning it's at least a yet-to-be-resolved argument :-) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mslater@cup.portal.com (Michael Z Slater) (12/29/89)
With regard to SPARC's lack of integer multiply/divide: I suspect this is primarily a consequence not of the designer's attitude toward arithmetic, but was a result of the goal to implement the first CPU chip in a 20,000-gate ASIC. They had to be economical with the transistors. I expect that future SPARC implementations will add integer multiply and divide instructions. Unfortunately, existing software won't use these instructions. Michael Slater, Microprocessor Report mslater@cup.portal.com
cprice@mips.COM (Charlie Price) (12/29/89)
> In article <84983@linus.UUCP> bs@linus.UUCP (Robert D. Silverman) writes: > >In article <8840004@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes: > > > >Huh? The MIPS-R3000 does integer multiplies in hardware in just a > >couple of cycles. The SPARC takes a minimum of 47 cycles using > >its so-called multiply-step function to multiply two integers. > >Division is even worse by almost an order of magnitude. The MIPS R2000, R3000, and R6000 do indeed have multiply and divide, but your idea of the time they take is not quite right. proc mult div (times in cycles to put result in special regs) R2000 12 35 R3000 12 35 I don't believe that the R6000 numbers are yet public, but I'm willing to say that they aren't better, in cycle counts, than the R3000. The operations work the following way: Multiply: Rs x Rt -> LO, HI multiplies two general purpose (32-bit) registers and produces a 64-bit result that is put into a couple of special-purpose registers named LO and HI. Divide: Rs / Rt -> LO, HI divides one 32-bit general register by another 32-bit general register producing a 32-bit quotient (in LO), and a 32-bit remainder (in HI). The LO and HI register pair are used only for the results of multiply and divide. Special instructions exist to move data from (and to) each of these registers. The multiply/divide unit operates as an autonomous unit. After a multiply/divide is issued, other non-multiply instructions continue to issue and execute. At some point the program wants the result of the operation, presumably after everything that can be done in parallel is done, and you issue a move-from-LO (or HI) instruction to move the result into a general register. If the operation is not yet done, the instruction interlocks and the processor is stalled till the result is available. The times above are the time needed to produce the result. Depending on what you mean by "how fast is X", you might want to include the instruction(s) necessary to get the result from LO (and/or HI) and put it (or them in a general register(s). This would add 1 cycle for one 32-bit result register or 2 cycles if you wanted the contents of both LO and HI. These operation times aren't especially fast. Being able to execute the op in parallel with "regular" instructions makes the effect of the operation length a little bit less, though the amount that helps depends a lot on what you can do in parallel with any multiply or divide. Finding 35 real instructions to occupy your time during a divide seems somewhat unlikely. One can make these operations faster by spending more hardware on them. The existing implementations have not chosen (or been able to) spend a lot of die area on these functions. For the current MIPS' processors it is generally worth figuring out whether you can get the multiply/divide job done with some short sequence of faster instructions rather than use the general instruction. As John Mashey has mentioned, multiply and divide were included in the MIPS instruction set because reasonably-fast general multiply and divide, though unnecessary for many applications in our architecture benchmark suite, are very important for some of them. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086
roy@phri.nyu.edu (Roy Smith) (12/29/89)
Just to be obnoxious (and thought provoking), postulate a hardware implementation in which a 32x32->64 bit integer multiply is faster than a 32+32->32+carry bit integer addition. Estimate the change in coding habits and compiler technology caused by this new technology. Use additional pages if necesary. -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy "My karma ran over my dogma"
johnl@esegue.segue.boston.ma.us (John R. Levine) (12/29/89)
In article <245@dg.dg.com> uunet!dg!rec (Robert Cousins) writes: >Actually, this debate is now closing in on 45 years old. If you will simply >go pull a copy of the original ENIAC papers by Von Neuman and Co. you will >find a long and drawn out discussion of the pros and cons of adding a number >of instructions to the basic architecture: multiplication ... >Given the constraints of the time, both multiply and divide were considered >quite justifiable. However, it is interesting to note that Von Neuman didn't >believe in floating point. He only believed in signed fixed point. Well, yes and no. When von Neumann went back to Princeton to build his own computer, the IAS machine, he used some surprisingly modern sounding arguments ("make a small set of operations as fast as possible") to decide not to include a hardware multiplier. I'll dig up the citation, it's quite interesting. Remember that the ENIAC was nearly impossible to program, and if they had to implement software multiplication, they'd have never gotten anything done at all. It was quite common to spend three days setting up a program that would take a few seconds to run. The IAS machine had a modern stored program architecture so a multiplication subroutine was no big deal. -- John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650 johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl "Now, we are all jelly doughnuts."
cik@l.cc.purdue.edu (Herman Rubin) (12/29/89)
In article <6908@cbnewsh.ATT.COM>, beyer@cbnewsh.ATT.COM (jean-david.beyer) writes: > In article <1535@cbnewsi.ATT.COM>, reha@cbnewsi.ATT.COM (reha.gur) writes: > > In article <1804@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes: ........................ > However, having spent some time working on optimizers that work at the > assembly level for machines, I notice that, for the benchmarks we use, > anyway, the result of running our C compilers, the most integer multiplies > seem to be when dealing with the subscripts of arrays or some kinds of > pointer dereferencing (those pointing to structures). In these cases, > the multiply instructions are mostly multiplications by constants. These > constants frequently have a small number of 1's. Consequently, instead of > calling a general purpose multiply subroutine, it suffices to insert a > special purpose inline code that multiplies by the desired constant. > (it might even pay to do a strength reduction optimization). This can be > much faster than the general purpose multiply subroutine, and may be faster > than a hardware multiply instruction, depending on the design of the > hardware multiply. The idea that special multiplications should be inlined is nothing new. The Fortran compiler on the CC6500 expanded multiplication and exponentiation by small integers. Whether this should be done by the hardware for variable integers with few 1's is not clear. But multiplication is used for other purposes, and here the loss is huge. Also, it operations are pipelined, as on many machines, the benefits from inlining are reduced. The costs of using a multiply subroutine are quite high compared to using even a complicated instruction. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
bs@linus.UUCP (Robert D. Silverman) (12/29/89)
In article <15418@vlsisj.VLSI.COM> davidc@vlsisj.UUCP (David Chapman) writes: :In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes: :>Does any have, of know of software for the SPARC [SUN-4] that will :>perform the following: :> :> [standard multiply and divide] :> :>The SPARC is brain dead [as were its designers] when it comes to doing :>integer arithmetic. It can't multiply and it can't divide. : :There should be instructions on the order of "multiply step" and "divide :step", each of which will do one of the 32 adds/subtracts and then shift. There is a multiply step instruction. There is no such support for division. It can take 200+ cycles to do a division on the SPARC [worst case]. A 32 x 32 bit unsigned multiply takes 45-47 cycles. Programs that have a significant number of multiplies and divides can run SLOWER on a SPARC than on a SUN-3. [I have such!] ONLY because of the slow multiply/divides. :I'm not particularly fond of the SPARC architecture (don't like register :windows), but this is a theoretical viewpoint and is not based on any :direct exposure to assembly-language programming for it (translation: :sorry, I can't give you any more help). : :Neither SPARC nor its designers were brain-dead when it was built. It's just I didn't say they were. I said they were with respect to arithmetic. I stand by that assertion. Most programs may not need multiply/divide in hardware. However, for those that do require it, not having it is a real KILLER of algorithms. :that it is difficult to get multiplication and division (especially the :latter) to run in 1 or 2 clock cycles. All instructions are supposed to I know of quite a few DSP chips that do multiplies in 1 cycles. Divides take a little longer [but not much; Ernie Brickell of SANDIA invented a hardware divide that works much faster than standard conditional-shift/ subtract]. :execute in the ALU in 1 cycle; if the multiply and divide instructions take :more time then the front of the processor pipeline has to be able to stall :and this added complexity will slow down the entire processor. : :Thus they provide you with the tools to do your own multiply and divide. See above. They are too slow. :One of the benefits is that a compiler can optimize small multiplies and :divides to make them execute quicker (i.e. multiply by 10 takes 4 steps That's fine for multiply-by-constant. Most programs that NEED multiply/divide are multiplying variables. :P.S. Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be : incredibly slow. Unroll the loop 4 or 8 times (MULSTEP, MULSTEP, : MULSTEP, MULSTEP, SUB 4, BNZ). Branches are expensive. Agreed. In fact my 32 x 32 bit multiply consists of 32 calls to multstep and no looping at all. It is still slow. -- Bob Silverman #include <std.disclaimer> Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs Mitre Corporation, Bedford, MA 01730
bs@linus.UUCP (Robert D. Silverman) (12/29/89)
In article <34000@mips.mips.COM> cprice@mips.COM (Charlie Price) writes: :> In article <84983@linus.UUCP> bs@linus.UUCP (Robert D. Silverman) writes: :> >In article <8840004@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes: :> > :> >Huh? The MIPS-R3000 does integer multiplies in hardware in just a :> >couple of cycles. The SPARC takes a minimum of 47 cycles using :> >its so-called multiply-step function to multiply two integers. :> >Division is even worse by almost an order of magnitude. : :The MIPS R2000, R3000, and R6000 do indeed have multiply and divide, :but your idea of the time they take is not quite right. : :proc mult div (times in cycles to put result in special regs) :R2000 12 35 :R3000 12 35 : The MIPS takes 4 times fewer cycles to do a multiply than the SPARC. It also yields a 64 bit product. Division takes 6 times fewer cycles AND ALSO gives the remainder [this is a big plus]. A 64 bit by 32 bit divide on the SPARC can take 500-600 cycles in the worst case. The SUN-3 does a 32x32 multiply in 41 cycles and a 64 by 32 bit divide [with remainder] in 76. The MIPS only does a 32 by 32 bit divide. 64 bit by 32 bit takes extra coding. :I don't believe that the R6000 numbers are yet public, :but I'm willing to say that they aren't better, :in cycle counts, than the R3000. : :The operations work the following way: : Yes, I have a MIPS architecture book. I have benchmarked some multiple precision integer arithmetic code on an R-2000 and on a SUN-4/280. The R2000 system [a DECstation] was 2x faster. Note that this wasn't even the R3000. The time difference is mostly due to integer multiply/divides. -- Bob Silverman #include <std.disclaimer> Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs Mitre Corporation, Bedford, MA 01730
beyer@cbnewsh.ATT.COM (jean-david.beyer) (12/29/89)
As I recall, the Naval Ordonance Research Computer (NORC) had a completely combinatorial multiplier: it could do a 10 digit by 10 digit multiply in the same number of cycles as an add. My recollection is that the multiplier box took a refrigerator size box. The rest of the cpu took another refrigerator size box. And that was in vacuum tube days. So multiply instructions need take no longer than the total number of gate delays to make such a thing. The main design question remains: for the kind of work to be done with the processor, is that the best place to put the complexity and delays and chip area. One must know the work to be done; i.e, the benchmarks to be run (and whether they are really representative of the work to be run). -- Jean-David Beyer AT&T Bell Laboratories Holmdel, New Jersey, 07733 attunix!beyer
tim@nucleus.amd.com (Tim Olson) (12/29/89)
In article <1535@cbnewsi.ATT.COM> reha@cbnewsi.ATT.COM (reha.gur) writes: | Also note that some SPARC machines do have (or might have) integer mul and | divide in hardware. If they do, then they are not Instruction Set compatible. My copy of the SPARC architecture manual lists only a MULScc (Multiply Step and modify icc), and no divide or divide step instructions. Even if the hardware is there, it is hard to use without an instruction to specify the exact semantics of the operation! -- Tim Olson Advanced Micro Devices (tim@amd.com)
slackey@bbn.com (Stan Lackey) (12/30/89)
In article <1804@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: bunch of stuff deleted ... >I know that the gurus claim that this should not be so, but it is not >unusual for me to think of modifications or even totally new ways of >doing things which the compiler cannot unless those specific ways are >put into the compiler. If a machine does not have hardware square roots, >one avoids square roots, as there are usually faster ways. >One thing which might help is if there were a mailing list to discuss these >ideas, and to collect the numerous operations efficient in hardware and >expensive in software. Those who know me will agree that I am not the >person to run this. >Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 I agree. I think that due to (at least) the worsening ratio of CPU speeds to memory speeds, there will be changes in architecture trends in the next generation or two. I would like to be the maintainer of the proposed mailing list. -Stan
andrew@dtg.nsc.com (Lord Snooty @ The Giant Poisoned Electric Head ) (12/30/89)
I am reading this thread with interest, and would like to see some realistic numbers on SPARC performance. How does it perform on, say, a large CAD package compared with the Sun-3? - does anyone have any benchmark results? -- ........................................................................... Andrew Palfreyman a wet bird never flies at night time sucks andrew@dtg.nsc.com there are always two sides to a broken window
mark@mips.COM (Mark G. Johnson) (12/30/89)
In article <435@berlioz.nsc.com> andrew@dtg.nsc.com (Andrew Palfreyman) writes: > >I am reading this thread with interest, and would like to see some realistic >numbers on SPARC performance. How does it perform on, say, a large CAD >package compared with the Sun-3? - does anyone have any benchmark results? > One large CAD package is "SPICE2G6" from U.C. Berkeley. It's a circuit simulator. Here are measurements of a Sun3 and a Sun4, both running SPICE2G6, on three different input files. The Sun3 model (3/260) is among the fastest Sun3's ever made, while the Sun4 (4/260) is a medium-performer, employing the 16 MHz Fujitsu gate array SPARC. I apologize for having no measured data to present on the non gate array (custom) CMOS SPARCs, nor BIT's SPARC built out of ECL, nor Prisma's SPARC in Gallium Arsenide. ********************* SPICE 2G.6 USER CPU TIMES *********************** Dataset (input circuit) Sun 3/260 Sun 4/260 Sun4 faster by? =========================================================================== digsr 361.2 sec 218.3 sec 1.65 X bipole 112.0 sec 61.1 sec 1.83 X comparator 129.4 sec 66.0 sec 1.96 X Other large CAD packages have been measured for the Sun4 but not, to my knowledge, the Sun3. For example, the SPEC benchmark suite includes "Espresso" (a logic equation minimizer and PLA generator). SPEC measurements of the SPARCstation_1 have been published, but not the Sun3. On Espresso the SPARCstation_1 is 8.9X faster than a VAX 11-780, making it *roughly* 3X faster than a Sun3/260 for Espresso. (However this is a guess in the absence of data). -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 (408) 991-0208 mark@mips.com {or ...!decwrl!mips!mark}
khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) (12/30/89)
In article <34019@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
One large CAD package is "SPICE2G6" from U.C. Berkeley. It's a circuit
simulator. Here are measurements of a Sun3 and a Sun4, both running
SPICE2G6, ...
Using the SPEC versions of SPICE2G6 and EXPRESSO
Sun3/470 (top of the line sun3 cpu) 4761s 461s
Sun4/490 (top of the line sun4 cpu) 1486s 151s
Which is a factor of 3x.
Those with a serious interest in such things should examine the
results of commerical codes (which are often much faster, and have
been tuned for various platforms). The folks at HSPICE, for example,
would probably be happy to chat with you about your CAD needs.
--
Keith H. Bierman |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault | MTS --Only my work belongs to Sun*
I Voted for Bill & | Advanced Languages/Floating Point Group
Opus | "When the going gets Weird .. the Weird turn PRO"
"There is NO defense against the attack of the KILLER MICROS!"
Eugene Brooks
bs@linus.UUCP (Robert D. Silverman) (12/30/89)
In article <34019@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes: :In article <435@berlioz.nsc.com> andrew@dtg.nsc.com (Andrew Palfreyman) writes: : >I am reading this thread with interest, and would like to see some realistic : >numbers on SPARC performance. How does it perform on, say, a large CAD : >package compared with the Sun-3? - does anyone have any benchmark results? : > :One large CAD package is "SPICE2G6" from U.C. Berkeley. It's a circuit :simulator. Here are measurements of a Sun3 and a Sun4, both running I may be wrong but..... So far as I know [and I've seen the code] SPICE is not a heavy user of integer multiply/divides. It does use a fair amount of floating point. Therefore, trying to get a measure of integer multiply/divide performance using SPICE is grasping at straws at best. -- Bob Silverman #include <std.disclaimer> Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs Mitre Corporation, Bedford, MA 01730
roy@phri.nyu.edu (Roy Smith) (12/31/89)
I've only been loosely following this thread for a while, so excuse me if this has been gone through already. I assume that SPARC based machines have (at least as an option) some sort of floating point coprocessor, which obviously would be used for floating point multiplies. Most (handwave) non-computational integer multiplies (by which I mean operations invented by the compiler for array subscripting, pointer arithmetic, etc) involve one constant factor with few ones, ideal for shift-and-add strength reductions. Non-power-of-2 integer division is even rarer. The bottom line is that I'm hard pressed to think of an application where full 32 x 32 integer multiplies constitute a significant fraction of the operations performed. Most large utility programs (operating systems, compilers, editors, windowing systems (except maybe NeWS?) don't do many 32x32 multiplies and most scientific number crunchers are all floating point. In fact, we have one large number cruncher that does macromolecule energy minimization using fixed point (integral numbers of tenths of kcals/mole) but that's all addition, subtraction, and table lookups. What's left? Digital signal processing maybe? Slightly off the subject, is it possible, using the current SPARC architecture, to take advantage of the floating point data paths to speed up integer multiplies for those few applications that really need it? -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,philabs,cmcl2,rutgers,hombre}!phri!roy "My karma ran over my dogma"
irf@kuling.UUCP (Bo Thide') (12/31/89)
In order to put the discussion on SPARC arithmetics slowness into some perspective, we ran Plum Hall benchmark on one of our CISC and two of our RISC machines. The machines we used were: HP9000/370 MC68030/33 MHz with and without floating point accelerator (fpa) HP9000/835 HP-PA RISC/15 MHz Sun SparcStation1 (SS1) SPARC RISC/20 MHz ================================================================================ register auto auto int function auto int short long multiply call+ret double HP9000/370 (fpa -O) 0.22 0.26 0.22 1.35 3.96 0.62 HP9000/370 (-O) 0.21 0.26 0.22 1.35 3.08 1.21 HP9000/370 (fpa no -O) 0.26 0.40 0.36 1.44 4.42 1.56 HP9000/370 (no -O) 0.26 0.40 0.37 1.45 3.38 2.72 HP9000/835 (-O) 0.27 0.29 0.27 5.49 0.31 0.27 HP9000/835 (no -O) 0.29 0.53 0.45 5.62 0.31 0.59 Sun SS1 (-O) 0.29 0.33 0.30 19.5 0.49 0.59 Sun SS1 (no -O) 0.38 0.40 0.35 19.7 0.51 0.72 ================================================================================ There is no question that Sun SparcStation 1 is *extremely* slow on integer multiply even for a RISC architecture -- scaling the HP-PA results to the same clock speed as the SPARC (20 MHz) we see that HP-PA is about 470% faster than SPARC!! Our results also show that integer arithmetics on CISC (MC68030) is much faster than on RISC (HP-PA, SPARC). For those of you not familiar with the Plum-Hall benchmark here is some info: ---------------------------------------------------------------------------- [The following article appeared in "C Users Journal" May 1988. It describes the purpose and use of the enclosed benchmarks. ] SIMPLE BENCHMARKS FOR C COMPILERS by Thomas Plum Dr.Plum is the author of several books on C, including Efficient C (co- authored with Jim Brodie). He is Vice-Chair of the ANSI X3J11 Committee, and Chairman of Plum Hall Inc, which offers introductory and advanced sem- inars on C. Copyright (c) 1988, Plum Hall Inc We are placing into the public domain some simple benchmarks with several appealing properties: They are short enough to type while browsing at trade shows. They are protected against overly-aggressive compiler optimizations. They reflect empirically-observed operator frequencies in C programs. They give a C programmer information directly relevant to programming. In Efficient C, Jim Brodie and I described how useful it can be for a pro- grammer to have a general idea of how many microseconds it takes to execute the "average operator" on register int's, on auto short's, on auto long's, and on double data, as well as the time for an integer multiply, and the time to call-and-return from a function. These six numbers allow a programmer to make very good first-order estimates of the CPU time that a particular algorithm will take.
seanf@sco.COM (Sean Fagan) (12/31/89)
In article <85138@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes: >:There should be instructions on the order of "multiply step" and "divide >:step", each of which will do one of the 32 adds/subtracts and then shift. > >There is a multiply step instruction. There is no such support for division. >It can take 200+ cycles to do a division on the SPARC [worst case]. >A 32 x 32 bit unsigned multiply takes 45-47 cycles. Oh, posh. A CDC Cyber 170/760 can do a 60x60 -> 60 multiply (fp) in 5 cycles, worst case. Divide is atrocious, being the only instruction slower than a load (a load is 26 cycles, but I forget exactly how long a divide takes). However, proper instruction ordering means you don't have to worry too much about how long it takes. (Multiply is pipelined [does two 30-bit multiplies at the same time], but divide isn't. Sequential algorithm, if I remember correctly. Apparantly Seymour *hates* divides 8-).) Just had to plug Seymour 8-). -- Sean Eric Fagan | "Time has little to do with infinity and jelly donuts." seanf@sco.COM | -- Thomas Magnum (Tom Selleck), _Magnum, P.I._ (408) 458-1422 | Any opinions expressed are my own, not my employers'.
raob@mullian.ee.mu.oz.au (Richard Oxbrow) (12/31/89)
In article <34019@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes: :In article <435@berlioz.nsc.com> andrew@dtg.nsc.com (Andrew Palfreyman) writes: : >I am reading this thread with interest, and would like to see some realistic : >numbers on SPARC performance. How does it perform on, say, a large CAD : >package compared with the Sun-3? - does anyone have any benchmark results? : > :One large CAD package is "SPICE2G6" from U.C. Berkeley. It's a circuit :simulator. Here are measurements of a Sun3 and a Sun4, both running Spice2G/3C1 is mainly floating point program, while espresso is a logic minimization program. It should also be kept in mind that for one reason or another the attached Floating Point Units on the sun3 series are always slower than their sun4 cousins. If I remember correctly the latest FPU+ on a sun3/470 is only rated at 1.2MFlops while the sun 4/390 (FPU2+?) is rated at something like 2.2 MFlops peak ! ( I don't think Sun wants to see its sun3 series giving the sun4s a hard time) richard .. Richard Oxbrow |ACSnet raob@mullian.oz dept. of ee eng ,uni of melbourne |Internet raob@mullian.ee.mu.OZ.AU parkville 3052 australia |Arpa-relay raob%mullian.oz@uunet.uu.net fax +[061][03]344 6678 |Uunet uunet!munnari!mullian!raob
dtynan@altos86.Altos.COM (Dermot Tynan) (01/01/90)
In article <KHB.89Dec29150110@chiba.kbierman@sun.com>, khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) writes: | | Using the SPEC versions of SPICE2G6 and EXPRESSO | | Sun3/470 (top of the line sun3 cpu) 4761s 461s | Sun4/490 (top of the line sun4 cpu) 1486s 151s | | Which is a factor of 3x. | | Keith H. Bierman |*My thoughts are my own. !! kbierman@sun.com Wait a minute though. Doesn't SPICE use mostly floating-point for its simulations (I don't have a copy, so I don't know)?? If so, it's rather foolish to benchmark Integer M/D, with a floating-point package... - Der -- dtynan@altos86.Altos.COM (408) 946-6700 x4237 Dermot Tynan, Altos Computer Systems, San Jose, CA 95134 "Far and few, far and few, are the lands where the Jumblies live..."
ok@quintus.UUCP (Richard A. O'Keefe) (01/01/90)
In article <1989Dec30.161619.22893@phri.nyu.edu> roy@phri.nyu.edu (Roy Smith) writes: > Most (handwave) >non-computational integer multiplies (by which I mean operations invented by >the compiler for array subscripting, pointer arithmetic, etc) involve one >constant factor with few ones, ideal for shift-and-add strength reductions. Roy Smith isn't the only one to say this recently in comp.arch. What I'm afraid of here is the coupling I see between RISC design and crippled programming languages. Smith's claim about array subscripting is true in some pathetically crippled languages where array bounds are completely known at compile time (C and Pascal, and some derivatives of Pascal). It wasn't true in Algol 60 or Simula 67, and it isn't true in Fortran 77. (The reason that the multiplications in Fortran 77 tend to involve "hard" multiplication is that good compilers already use strength reduction to remove multiplication entirely when they can.) Fortran 8X and Ada also allow arrays with dynamic bounds. Actually, there _is_ a way that addressing polynomials could be handled efficiently on machines which have slow multiplication, and that is for a dynamic array declaration like real array a[1:P,1:Q,1:R]; to generate a local "addressing subroutine" at _run_ time a_poly := lambda(i,j,k) ((check(1,P,i)*Q+check(1,Q,j))*R+ check(1,R,k))*sizeof real + &a_data; using whatever multiply-step sequence was best. But that, of course, requires a cheap way of generating code on the run. And there are some RISC systems that make _that_ hard too. The remaining alternative is to precompute the offsets in some way, so that a[i,j,k] turns into a_data[QR_times[i] + R_times[j] + k] which is all very well, but think about Fortran 8X cross-sections... (This article is not a claim that fast integer multiplication and division is necessary to support Fortran and Ada well. I haven't the figures to prove this. )
mcdonald@aries.scs.uiuc.edu (Doug McDonald) (01/01/90)
In article <1296@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: >In article <1989Dec30.161619.22893@phri.nyu.edu> roy@phri.nyu.edu (Roy Smith) writes: >> Most (handwave) >>non-computational integer multiplies (by which I mean operations invented by >>the compiler for array subscripting, pointer arithmetic, etc) involve one >>constant factor with few ones, ideal for shift-and-add strength reductions. > >Roy Smith isn't the only one to say this recently in comp.arch. >What I'm afraid of here is the coupling I see between RISC design and >crippled programming languages. Smith's claim about array subscripting >is true in some pathetically crippled languages where array bounds are >completely known at compile time (C and Pascal, and some derivatives of >Pascal). It wasn't true in Algol 60 or Simula 67, and it isn't true in >Fortran 77. >Actually, there _is_ a way that addressing polynomials could be handled >efficiently on machines which have slow multiplication, and that is for >a dynamic array declaration like > real array a[1:P,1:Q,1:R]; >to generate a local "addressing subroutine" at _run_ time >using whatever multiply-step sequence was best. But that, of course, >requires a cheap way of generating code on the run. And there are some >RISC systems that make _that_ hard too. > These are two interesting and important points. Especially the latter one: some machine architectures may NEED the ability to use self-modifying code to get good speed. ********This should be the subject of an entire new flame-thread. I have found that architectures (or OS's) that don't allow EASY self-modifying code (or for most of my stuff, incremental compilation) are sufficiently brain-dead to be unuseable for many purposes.***** Doug McDonald (mcdonald@aries.scs.uiuc.edu)
tot@frend.fi (Teemu Torma) (01/02/90)
In article <1313@kuling.UUCP> irf@kuling.UUCP (Bo Thide') writes:
================================================================================
register auto auto int function auto
int short long multiply call+ret double
HP9000/370 (fpa -O) 0.22 0.26 0.22 1.35 3.96 0.62
HP9000/370 (-O) 0.21 0.26 0.22 1.35 3.08 1.21
HP9000/370 (fpa no -O) 0.26 0.40 0.36 1.44 4.42 1.56
HP9000/370 (no -O) 0.26 0.40 0.37 1.45 3.38 2.72
HP9000/835 (-O) 0.27 0.29 0.27 5.49 0.31 0.27
HP9000/835 (no -O) 0.29 0.53 0.45 5.62 0.31 0.59
Sun SS1 (-O) 0.29 0.33 0.30 19.5 0.49 0.59
Sun SS1 (no -O) 0.38 0.40 0.35 19.7 0.51 0.72
================================================================================
There is no question that Sun SparcStation 1 is *extremely* slow on integer
multiply even for a RISC architecture -- scaling the HP-PA results to the
same clock speed as the SPARC (20 MHz) we see that HP-PA is about 470% faster
than SPARC!! Our results also show that integer arithmetics on CISC (MC68030)
is much faster than on RISC (HP-PA, SPARC).
Strange. I got much better int multiply results from Sun SS1. Gcc
version was 1.36.
register auto auto int function auto
int short long multiply call+ret double
Sun 4/60 (cc, no -O) 0.38 0.40 0.36 3.52 0.28 0.72
Sun 4/60 (cc, -O) 0.32 0.35 0.32 3.62 0.28 0.62
Sun 4/60 (cc, -O2) 0.30 0.33 0.30 3.32 0.26 0.61
Sun 4/60 (cc, -O3) 0.30 0.33 0.30 3.45 0.28 0.63
Sun 4/60 (gcc, no -O) 0.21 0.38 0.38 3.50 0.27 0.77
Sun 4/60 (gcc, -O) 0.18 0.21 0.18 3.57 0.27 0.42
Sun 4/60 (gcc, <2>) 0.17 0.22 0.19 3.67 0.28 0.42
Sun 4/60 (gcc, <3>) 0.17 0.21 0.18 3.52 0.27 0.40
gcc <2>: -fstrength-reduce -fcombine-regs -fomit-frame-pointer
gcc <3>: as <2> including -mno-epologue
--
Teemu Torma
Front End Oy, Helsinki, Finland
Internet: tot@nyse.frend.fi
vestal@SRC.Honeywell.COM (Steve Vestal) (01/03/90)
In article <1296@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: > Fortran 8X and Ada also allow arrays with dynamic bounds. > [...] > real array a[1:P,1:Q,1:R]; I believe that induction variable optimizations, which replace multiplications by additions for certain array address computations within loops, work just fine even with dynamic bounds. My limited experience evaluating such optimizations has been that they are surprisingly widely applicable for matrix/vector manipulation code. Personally, I suspect explicit multiplies/divides/mods/rems in the application code might be the hazard to watch out for rather than integer multiplications used in addressing computations. Steve Vestal Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418 Phone: (612) 782-7049 Internet: vestal@src.honeywell.com
khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) (01/03/90)
In article <135@altos86.Altos.COM> dtynan@altos86.Altos.COM (Dermot Tynan) writes:
.....
Wait a minute though. Doesn't SPICE use mostly floating-point for its
simulations (I don't have a copy, so I don't know)?? If so, it's rather
foolish to benchmark Integer M/D, with a floating-point package...
I didn't chose the code, another poster did. As it happens, the
circuit selected by the SPEC group causes SPICE to spend its time in a
very different fashion than one usually expects (most of the
commerical SPICEs have lavished great attention on fast solution of
systems of equations (viz. linpack problem space stuff)) and that
isn't true for this data set. Had it been true, the speed differential
would be much more favorable for SPARC.
--
Keith H. Bierman |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault | MTS --Only my work belongs to Sun*
I Voted for Bill & | Advanced Languages/Floating Point Group
Opus | "When the going gets Weird .. the Weird turn PRO"
"There is NO defense against the attack of the KILLER MICROS!"
Eugene Brooks
alan@oz.nm.paradyne.com (Alan Lovejoy) (01/05/90)
In article <1535@cbnewsi.ATT.COM< reha@cbnewsi.ATT.COM (reha.gur) writes: <In article <1804@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes: < <> It is clear that you are not to be trusted (see above). To multiply <> two 32 bit numbers to get a 64 bit product on a 32x32 -> 32 machine, <> the 32 bit numbers must be divided into 16 bit parts. The whole operation <> takes about 20 operations (count them). Shift and add are far slower. <> Divide is even worse. Also, there is considerable overhead in a <> subroutine call; there are registers to save and restore. Open <> subroutines (in-line functions) are a way around it, but they still <> have the problem. <> <> Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 <> Phone: (317)494-6054 <> hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP) < <The numbers I get (from looking at the data sheets and other info) for <two machines: a 25Mhz i486 and a 25Mhz SPARC are as below: < <Assuming no cache hits and various other items: -- did you mean "no cache misses"? <i486: 18-31 cycles for signed 32 x 32 bit multipication (reg to reg) <SPARC: 48-52 cycles for same (including subroutine call and return time) < <i486: 32 cycles for signed 32 bit division (acc by reg) <SPARC: 41 (approximate best case) to 211 (approximate worst case) < (depends on bits in dividend and divisor) < Sounds like an interesting contest to me! Here's my try (for multiply, anyway) using the mc88k instruction set: (NOTE: r0 is the constant 0 (a hardware protocol); jsr sets r1 with the return addrss) ;dmulu -- multiply unsigned 32x32=<64 ;r2 and r3 contain two 32-bit unsigned integers to be multiplied ;r12 will contain the high word (32 bits) of the product (r2 * r3) ;r13 will contain the low word (32 bits) of the product (r2 * r3) .proc dmulu dmulu: ;divide the two 32-bit numbers into 4 16-bit parts: extu r4, r2, 16<16<; r4 = r2 >> 16; (1 cycle) extu r5, r3, 16<16<; r5 = r3 >> 16; (1 cycle) mask r2, r2, $FFFF; r2 = r2 & 0xFFFF (1 cycle) mask r3, r3, $FFFF; r3 = r3 & 0xFFFF (1 cycle) ;calculate partial products: mul r6, r3, r3; r6 = r2 * r3; (4 cycles!) mul r7, r3, r4; r7 = r3 * r4; (4 cycles!) mul r8, r2, r5; r8 = r2 * r5; (4 cycles!) mul r9, r4, r5; r9 = r4 * r5; (4 cycles!) ;sum partial products: extu r10, r6, 16<16<; r10 = r6 >> 16; (1 cycle) addu r7, r7, r10; r7 = r7 + r10; (1 cycle) addu.co r10, r7, r8; r10 = r7 + r8; generate carry bit (1 cycle) addu.ci r12, r0, r0; r12 = carry from previous addu (1 cycle) mak r12, r12, 16<16<; r12 = r12 << 16; (1 cycle) addu r12, r12, r9; r12 = r12 + r9; (1 cycle) mak r10, r10, 16<16<; r10 = r10 << 16; (1 cycle) mask r13, r6, $FFFF; r13 = r6 & 0xFFFF; (1 cycle) jmp.n r1; return to caller after next instruction (1cycle) or r13, r13, r10; r13 = r13 | r10; (1 cycle) ; done: 30 cycles total (without short circuit code) .end ; invoking dmulu: ; it is the caller's responsibility to save registers r2-13, which ; the caller may or may not need to do... or r25, r0, #dmuluLo16; r25 = low 16 bits of dmulu address (1 cycle) or.u r25, r25, #dmuluHi16; r25 = r25 | high 16 bits of dmulu address (1 c) ld r2, r30, #factor1; r2 = *(framePtr + offset of factor1) (1 cycle) jsr.n r25; call dmulu after next instruction (1 cycle) ld r3, r30, #factor2; r3 = *(framePtr + offset of factor2) (1 cycle) st.d r12, r30, #product; *(frametPtr + offset of product) = r12, r13 ; register to register: 1 cycle call excluding execution time for dmulu ; memory to memory: 6 cycle call excluding execution time for dmulu ; GRAND TOTAL register to register: 31 cycles ; GRAND TOTAL memory to memory: 36 cycles ____"Congress shall have the power to prohibit speech offensive to Congress"____ Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL. Disclaimer: I do not speak for AT&T Paradyne. They do not speak for me. Mottos: << Many are cold, but few are frozen. >> << Frigido, ergo sum. >>
alan@oz.nm.paradyne.com (Alan Lovejoy) (01/05/90)
Excuuuuuuuuuuse me! Three errors, one very slight, two not so slight! In article <6903@pdn.paradyne.com> alan@oz.paradyne.com (Alan Lovejoy) writes: >Sounds like an interesting contest to me! Here's my try (for multiply, >anyway) using the mc88k instruction set: > >(NOTE: r0 is the constant 0 (a hardware protocol); jsr sets r1 with the >return addrss) ;dmulu -- multiply unsigned 32x32=<64 ;r2 and r3 contain two 32-bit unsigned integers to be multiplied ;r12 will contain the high word (32 bits) of the product (r2 * r3) ;r13 will contain the low word (32 bits) of the product (r2 * r3) .proc dmulu dmulu: ;divide the two 32-bit numbers into 4 16-bit parts: extu r4, r2, 16<16>; r4 = r2 >> 16; (1 cycle) *** errror #1: The original posting consistently had "16<16<" instead of "16<16>". The wonders of global replace... (The very slight error). extu r5, r3, 16<16>; r5 = r3 >> 16; (1 cycle) mask r2, r2, $FFFF; r2 = r2 & 0xFFFF (1 cycle) mask r3, r3, $FFFF; r3 = r3 & 0xFFFF (1 cycle) ;calculate partial products: >mul r6, r3, r3; r6 = r2 * r3; (4 cycles!) *** error #2: One of the r3's in the above instruction must be an r2 (as the comment shows). >mul r7, r3, r4; r7 = r3 * r4; (4 cycles!) >mul r8, r2, r5; r8 = r2 * r5; (4 cycles!) >mul r9, r4, r5; r9 = r4 * r5; (4 cycles!) *** error #3: I completely forgot that the mul instruction is fully pipelined! It takes 4 cycles to complete, yes, but a new instruction can be issued on the very next cycle. Upto six mul and/or fmul instructions can be in the pipeline at one time. So, (after some reordering to avoid stalls) we have instead: ;calculate partial products: mul r6, r2, r3; r6 = r2 * r3; (1 cycle) mul r8, r2, r5; r8 = r2 * r5; (1 cycle) mul r7, r3, r4; r7 = r3 * r4; (1 cycle) mul r9, r4, r5; r9 = r4 * r5; (1 cycle) ;sum partial products: extu r10, r6, 16<16>; r10 = r6 >> 16; (1 cycle) addu r8, r8, r10; r8 = r8 + r10; (1 cycle) *** note that r7 changed to r8 to avoid a stall addu.co r10, r7, r8; r10 = r7 + r8; generate carry bit (1 cycle) addu.ci r12, r0, r0; r12 = carry from previous addu (1 cycle) mak r12, r12, 16<16>; r12 = r12 << 16; (1 cycle) addu r12, r12, r9; r12 = r12 + r9; (1 cycle) mak r10, r10, 16<16>; r10 = r10 << 16; (1 cycle) mask r13, r6, $FFFF; r13 = r6 & 0xFFFF; (1 cycle) jmp.n r1; return to caller after next instruction (1cycle) or r13, r13, r10; r13 = r13 | r10; (1 cycle) ; done: 18 cycles total (without short circuit code) .end ; invoking dmulu: ; it is the caller's responsibility to save registers r2-13, which ; the caller may or may not need to do... or r25, r0, #dmuluLo16; r25 = low 16 bits of dmulu address (1 cycle) or.u r25, r25, #dmuluHi16; r25 = r25 | high 16 bits of dmulu address (1 c) ld r2, r30, #factor1; r2 = *(framePtr + offset of factor1) (1 cycle) jsr.n r25; call dmulu after next instruction (1 cycle) ld r3, r30, #factor2; r3 = *(framePtr + offset of factor2) (1 cycle) st.d r12, r30, #product; *(frametPtr + offset of product) = r12, r13 ; register to register: 1 cycle call excluding execution time for dmulu ; memory to memory: 6 cycle call excluding execution time for dmulu ; GRAND TOTAL register to register: 19 cycles (was 30) ; GRAND TOTAL memory to memory: 24 cycles (was 36) *** A significant performance improvement of 33 per cent. ____"Congress shall have the power to prohibit speech offensive to Congress"____ Alan Lovejoy; alan@pdn; 813-530-2211; AT&T Paradyne: 8550 Ulmerton, Largo, FL. Disclaimer: I do not speak for AT&T Paradyne. They do not speak for me. Mottos: << Many are cold, but few are frozen. >> << Frigido, ergo sum. >>
dgr@hpfcso.HP.COM (Dave Roberts) (01/05/90)
Sorry Guys and Gals, I didn't intend to start a pounce on RISC thread. When I answered Bob's question it was intended to show him that there was a way to do multiply and divide. Now for some comments: (1) SPARCs will get multiply and divide. This is from a guy at Sun. Coming soon to a SPARC station near you... (2) By suggesting that Bob was "much better off" (unclear on my part) I didn't mean to suggest that he was going to get steller integer performance all the time. Rather, in general, his whole program should run faster. I guess it didn't, but then again I didn't look at the code. (3) As some have pointed out, the reason for removing those instructions from a RISC architecture is because *most* programs don't do a whole lot of multiplications between arbitrary 32 bit integers. Usually it is an arbitrary integer and a known (though not necessarily small) integer constant. With the known constant you can reduce the mult to a known sequence of shift and adds, which a good compiler will do (in fact, many CISC machines would run faster if the compilers would do this for them also instead of just inserting a XX cycle multiply instruction). (4) If you need the speed, you write the code inline. Loops kill you in whatever architecture you use. If you do huge numbers of arbitrary 32x32 mults, you're code will explode, but hey, this is a RISC machine and your code size is already through the roof, right? If you call a subroutine everytime you want to do a multiply the overhead of the call will kill you. But notice that this wasn't what I suggested, either. (5) The original point was that most programs don't need the kind of integer numerical performance that, I guess, Bob's does, and in general the shift and adds (for computing things like array indices and so forth) are just fine. It's a (semi)pathological case in the whole universe of computer programs. As a user who doesn't generate programs like that, I'd rather all the other instructions be speeded up a bit by allowing higher clock speeds, etc. And most users don't generate or use programs like that. (6) If you really need the blazing integer speed, buy a coprocessor. That is also one of the fundemental RISC ideas. There are times when things just aren't done well by software and do need hardware help. This option also allows you to get *really, really* fast integer speed by using a multiplier array (works by generating all the product terms all at once and then adding the whole sh'bang together. It's fast as hell but it uses ton's of chip area. Perfect for a coprocessor). Someone else pointed this out a few postings back (in the DSP entries, I think). Sure it costs more for this, but I'd rather save the cost when I don't need it. (Remember that floating point is also a coprocessor. Only naivity would hold that interger operations can't be also.) Dave Roberts Hewlett-Packard Co. dgr@hpfcla.hp.com
tim@nucleus.amd.com (Tim Olson) (01/05/90)
In article <8840005@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes: | (4) If you need the speed, you write the code inline. Loops kill | you in whatever architecture you use. If you do huge numbers | of arbitrary 32x32 mults, you're code will explode, but hey, | this is a RISC machine and your code size is already through | the roof, right? If you call a subroutine everytime you want | to do a multiply the overhead of the call will kill you. But | notice that this wasn't what I suggested, either. Inlining code for performing multiplies is an option, but the call overhead isn't going to "kill you" -- the overhead would probably be less than 10% -- especially if these kind of routines used a special calling sequence that the compiler knows about which doesn't have the overhead of a standard call. -- Tim Olson Advanced Micro Devices (tim@amd.com)
bs@linus.UUCP (Robert D. Silverman) (01/09/90)
In article <8840005@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes:
stuff deleted..
: (6) If you really need the blazing integer speed, buy a coprocessor.
: That is also one of the fundemental RISC ideas. There are times
: when things just aren't done well by software and do need hardware
: help. This option also allows you to get *really, really* fast
I'd love to. The only problem is: How do I integrate (say) an integer DSP
chip into my SUN so it will act as a co-processor. Who modifies the compilers
to use it, etc. etc... I could rewrite the compilers but I don't have the
time. Do you know of a commercially available integer coprocessor that I
can plug into a workstation? I don't.
: integer speed by using a multiplier array (works by generating
: all the product terms all at once and then adding the whole
etc.
--
Bob Silverman
#include <std.disclaimer>
Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
Mitre Corporation, Bedford, MA 01730
gregw@otc.otca.oz (Greg Wilkins) (01/11/90)
in article <8840005@hpfcso.HP.COM>, dgr@hpfcso.HP.COM (Dave Roberts) says: > Now for some comments: > (1) SPARCs will get multiply and divide. This is from a guy at > Sun. Coming soon to a SPARC station near you... OK Sun Microsystems! we know you out there listening. How about an official comment on this???? What about the ABI (application binary interface) standard you guys are pushing. Signing up all those clone manufacturers etc etc. So your going to spring a multiply instruction on them and make it all out dated (when compared to the sun5???) Even if it is not a new instruction but an integer co-processor, it still needs to be included in the ABI. Without an official yeh or nay on this one, nobody can have confidence in Suns commitment to open systems and the ABI. DISCLAIMER: I do not know my employers opinion on these matters. Greg Wilkins ACSnet:gregw@otc.oz.au igc nets: igc:peg:gwilkins "To sin by silence when Phone(w): (02)2874862 Telex: OTCAA120591 they should speak out Phone(h): (02)8104592 Snail:OTC Services R&D, makes cowards of men" Fax: (02)2874990 GPO Box 7000, - Abe Lincoln O/S ph: (prefix) 612 # Sydney 2001, Australia
dgr@hpfcso.HP.COM (Dave Roberts) (01/11/90)
Bob Silverman writes, > >In article <8840005@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes: > >stuff deleted.. > >: (6) If you really need the blazing integer speed, buy a coprocessor. >: That is also one of the fundemental RISC ideas. There are times >: when things just aren't done well by software and do need hardware >: help. This option also allows you to get *really, really* fast > >I'd love to. The only problem is: How do I integrate (say) an integer DSP >chip into my SUN so it will act as a co-processor. Who modifies the compilers >to use it, etc. etc... I could rewrite the compilers but I don't have the >time. Do you know of a commercially available integer coprocessor that I >can plug into a workstation? I don't. > >: integer speed by using a multiplier array (works by generating >: all the product terms all at once and then adding the whole > >etc. > >-- >Bob Silverman >#include <std.disclaimer> >Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs >Mitre Corporation, Bedford, MA 01730 >---------- Sorry Bob, I don't have all the answers. I guess you do. I was just trying to propose some solutions. I guess you're perfectly right. The SPARC is a joke. Any machine that doesn't have integer multiply and divide is crippled for life. Whoops, sorry, any machine that doesn't meet your performance requirements in any way is crippled for life, and all the people who designed it are sick in the head. You're right Bob. How could I have been so stupid. I guess all those people who are building chips without those functions you seem to need require are just silly, never mind that they meet the needs of thousands of people. Honestly, what was I thinkin', huh? I just thought that since you seem to have bought the wrong machine (in your opinion), I'd try and help you make it work. I'm sorry I even suggested you doing any extra work. I guess I should have included another option: (7) Sell the damn thing since you seem to hate it so much and buy something else. Of course this time, try running some of your code on it first so you'll know that it performs up to your standards. Sorry people. No smiley faces on this one. I'm just too tired of trying to deal with a bunch of whiners who'll never be satisfied with anything I suggest. People, RISC is not the cat's meow to everybody. If you're one of those for whom it doesn't work, feel free to try something else, but don't tell me that it doesn't work for me either. - Dave Roberts "Just another `brain dead' designer of RISC machines"
mike@umn-cs.CS.UMN.EDU (Mike Haertel) (01/11/90)
In article <1249@otc.otca.oz> gregw@otc.otca.oz (Greg Wilkins) writes: >Even if it is not a new instruction but an integer co-processor, it >still needs to be included in the ABI. > >Without an official yeh or nay on this one, nobody can have confidence in >Suns commitment to open systems and the ABI. I don't see how ABI is related to "open systems." Source level compatibility ought to be enough for anyone. With the emergence of, e.g., ANSI C, POSIX.1, and (less desirably) the X window system, it will be substantially easier to write portable programs, and these standards are likely to have a far longer viable life than any ABI. I see source level standards as being far less likely than ABIs to kill future innovation. RISC architectures, and surely the SPARC, are far from the last word in computer architecture! How will your ABI relate to the massively parallel machines that are likely to become increasinbly prevalent? And yet the source level standards will still be useful, particularly as even some of today's programs might conceivably benefit from parallelizing compilers. -- Mike Haertel <mike@ai.mit.edu> "Everything there is to know about playing the piano can be taught in half an hour, I'm convinced of it." -- Glenn Gould
andrew@frip.WV.TEK.COM (Andrew Klossner) (01/12/90)
[] "What about the ABI (application binary interface) standard you guys are pushing ... Even if it is not a new instruction but an integer co-processor, it still needs to be included in the ABI." This is backward. There is no need to include a new, commonly implemented instruction in the ABI, and there are good reasons not to do it. The ABI details a common subset, the intersection of conformant systems, not the union. The SPARC ABI describes the virtual machine which is (will be?) implemented on all SPARC systems. It gives the interface which an ABI-conformant application can depend on. If a machine implements both the ABI and the new multiply instruction, programs which conform to the ABI (and therefore do not use multiply) will still execute correctly. This is all that matters. -=- Andrew Klossner (uunet!tektronix!frip.WV.TEK!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
henry@utzoo.uucp (Henry Spencer) (01/12/90)
In article <1249@otc.otca.oz> gregw@otc.otca.oz (Greg Wilkins) writes: >OK Sun Microsystems! we know you out there listening. How about an >official comment on this???? What about the ABI (application binary >interface) standard you guys are pushing. Signing up all those >clone manufacturers etc etc. So your going to spring a multiply >instruction on them and make it all out dated (when compared to the sun5???) Why are you so surprised? Some of us have been wondering about this all along. Sun as a company is *not* committed to open systems, despite their marketing hype; just try to get hardware documentation out of them for the Sun 3 line. When the party line on SPARC, S-bus, etc. is "well, we're not really a workstation company, we see no problem in everybody being able to build compatible hardware", and the party line on the Sun 3 is "the innards of that workstation are secret, and we can't let out any information on it, and we can't understand why you persist in making such ridiculous requests for what is obviously necessarily Top Secret stuff", then either the left hand doesn't know what the right hand is doing or there are some really heavy-duty ulterior motives involved. -- 1972: Saturn V #15 flight-ready| Henry Spencer at U of Toronto Zoology 1990: birds nesting in engines | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
kck@mips.COM (Ken Klingman) (01/12/90)
From article <18132@umn-cs.CS.UMN.EDU>, by mike@umn-cs.CS.UMN.EDU (Mike Haertel): > I don't see how ABI is related to "open systems." > > Source level compatibility ought to be enough for anyone. With the Source level compatibility is not enough. An applications developer wants to minimize the number of ports and different copies of the same package that must be produced. Most applications developers do not distribute source and would like to only distribute one version per architecture. That's the purpose of an Application Binary Interface: standardize the binary interface to a system so that an application developer need only distribute one version per architectural version of a system. This means that if a processor supplier adds some whizzy new feature or adds some hardware performance enhancement in an upward compatible fashion, but which requires recompilation to take advantage of, then the chip vendor is effectively creating a new architecture. An application developer has to weigh the cost of producing a new version of an application for a new architecture. If the new features or the improved performance warrants it, then the developer will take advantage of it. This doesn't stifle creativity, but it tends to require much more substantial architectural improvements than just minor tweaks. Ken Klingman Mips Computer Systems kck@mips.com 928 Arques Avenue {uunet,decwrl,pyramid,ames}!mips!kck Sunnyvale, CA 94086 (408)991-7826
khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) (01/12/90)
In article <1249@otc.otca.oz> gregw@otc.otca.oz (Greg Wilkins) writes: in article <8840005@hpfcso.HP.COM>, dgr@hpfcso.HP.COM (Dave Roberts) says: > Now for some comments: > (1) SPARCs will get multiply and divide. This is from a guy at > Sun. Coming soon to a SPARC station near you... No, it was a guy at TI. OK Sun Microsystems! we know you out there listening. How about Yes. Didn't you notice my posting BEFORE the fellow from TI's .... I explained a perfectly ABI complaint way to use such instructions w/o mandating ABI noncompliance (though one might want to go for full speedup and not care about ABI compliance for some application). > interface) standard you guys are pushing. Signing up all those > clone manufacturers etc etc. So your going to spring a multiply > instruction on them and make it all out dated (when compared to Ours ?? TI doesn't belong to Sun. Chip vendors sign up to build chips, as long as they don't break specs (and there are compliant ways to add instructions ... buy a license and learn how :>) who are we to stop them ???? > Even if it is not a new instruction but an integer co-processor, it > still needs to be included in the ABI. ABI compliant codes can get use of additional instructions .... shared libraries have their uses :> > Without an official yeh or nay on this one, nobody can have confidence in > Suns commitment to open systems and the ABI. I'm no offical. But I don't recall seeing the formal ABI publication .... there is a de facto ABI (Solb works or haven't you noticed). Unless you have seen an ABI spec which asserts that there is such and such an instruction, you'd better not generate that instruction directly (if you care about complying with the ABI). If you use the current Sun compilers, you get .mul, .div, .rem etc. .... and if you have a chip which had those as hw instructions you (as an end user, or software author or whathave you) could replace those library codes with ones which employed your nifty instruction(s). No ABI breakage. You jump to some amazing conclusions from a posting from _TI_ about Sun's plans and business practices. -- Keith H. Bierman |*My thoughts are my own. !! kbierman@sun.com It's Not My Fault | MTS --Only my work belongs to Sun* I Voted for Bill & | Advanced Languages/Floating Point Group Opus | "When the going gets Weird .. the Weird turn PRO" "There is NO defense against the attack of the KILLER MICROS!" Eugene Brooks
bs@linus.UUCP (Robert D. Silverman) (01/12/90)
In article <8840008@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes: : :Bob Silverman writes, :> :>In article <8840005@hpfcso.HP.COM> dgr@hpfcso.HP.COM (Dave Roberts) writes: : :Sorry Bob, : I don't have all the answers. I guess you do. I was just trying :to propose some solutions. I guess you're perfectly right. The SPARC :is a joke. Any machine that doesn't have integer multiply and divide is :crippled for life. Whoops, sorry, any machine that doesn't meet your (1) I never said the SPARC was crippled for life. Here we have the old slippery slope argument. I criticize one aspect of SPARC. One which makes it unsuitable for a class of applications and this clown immediately extrapolates the criticism to my saying that the SPARC is worthless. Your arguments also lack credibility when you start attacking me personally. :performance requirements in any way is crippled for life, and all the :people who designed it are sick in the head. You're right Bob. How :could I have been so stupid. I guess all those people who are building :chips without those functions you seem to need require are just silly, :never mind that they meet the needs of thousands of people. Honestly, :what was I thinkin', huh? : :I just thought that since you seem to have bought the wrong machine (in :your opinion), I'd try and help you make it work. I'm sorry I even suggested :you doing any extra work. I guess I should have included another option: (2) I didn't buy any SPARC's. I am fully aware of their defects for my type of applications. However, others with whom I work are gradually acquiring SPARC based workstations and replacing other (68020/80386) workstations that are better suited. What are those of us to do who want the greater speed of newer processors, yet who find that the architecture of the new products is deficient for their application? : (7) Sell the damn thing since you seem to hate it so much and buy : something else. Of course this time, try running some of your : code on it first so you'll know that it performs up to your Try getting the facts before shooting your mouth off. You'll look less like a fool that way. I do not own a SPARC, I have not bought one, nor have I ever said I bought one. Yet, you jump to the conclusion that since I have a criticism of a computer architecture it must be because I am dissatisfied with one I bought. : : :Sorry people. No smiley faces on this one. I'm just too tired of trying :to deal with a bunch of whiners who'll never be satisfied with anything Incredible. I try to carry on a conversation about the scientific merits of a processor, and point out that the market doesn't have an answer to the criticism and this retarded dweeb takes it personally!! He calls me a 'whiner' when I simply point out that the SPARC is defficient in a particular area. This is whining? It sounds to me like a statement of fact. -- Bob Silverman #include <std.disclaimer> Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs Mitre Corporation, Bedford, MA 01730
gerry@zds-ux.UUCP (Gerry Gleason) (01/12/90)
In article <18132@umn-cs.CS.UMN.EDU> mike@umn-cs.cs.umn.edu (Mike Haertel) writes: >In article <1249@otc.otca.oz> gregw@otc.otca.oz (Greg Wilkins) writes: >>Even if it is not a new instruction but an integer co-processor, it >>still needs to be included in the ABI. >>Without an official yeh or nay on this one, nobody can have confidence in >>Suns commitment to open systems and the ABI. >I don't see how ABI is related to "open systems." Well, it is and it isn't. >Source level compatibility ought to be enough for anyone. With the >emergence of, e.g., ANSI C, POSIX.1, and (less desirably) the >X window system, it will be substantially easier to write portable >programs, and these standards are likely to have a far longer viable >life than any ABI. While I agree that source level standards are what allow for open systems in the broadest sense, "enough for anyone," no. ABI's are important in relationship to "shrink-wrap" selling of software, the selling of software in pre-packaged executable form for standard platforms. There must be enough machines out there compliant with any particular ABI for the market to be big enough for distributers to stock products for it. A related issue is how many different ABI's does the market have room for. I suspect the number is quit small (maybe only 2, certainly not more than 5), so anyone care to guess which architectures will win? For my money, I don't see how Intel 386/486 could not be on the list, there are simply too many roughly compatible machines out there already, and then I would be on one or two of SPARC, MIPS, or 88000. Greg is right that it is critical to lock down all aspects of the hardware (and systems software standards) that effect the ABI. If Sun or any of the others falter on these points the marketplace gets devided and will never develop. >I see source level standards as being far less likely than ABIs >to kill future innovation. RISC architectures, and surely the >SPARC, are far from the last word in computer architecture! Standards that do not evolve to include future innovation, die off, but that's not the point. Source standards do a lot to preserve/extend software investment, particularly over time, but all of us know that there is no such thing as just compiling any source over about 10k (and few smaller) on a different family of machine (even different '386 PC's you must recite the proper incantations for things to work). If there is not a commodity market in software for a particular architecture, the only company guarenteen to have an incentive to compile and package binaries would be to manufacturer. Admittedly, the situation is different in the upper ranges of the market; MIPS R6000 machines as an example, they won't likely be in environments where the user is the system administrator, and package installer, so installation and maintainance doesn't have to be simple. But for the high-volume low-end of the market the requirements for strict standards that support drop-in compatibility will become ever greater. What I am curious about is whether there have been any reasonable proposals for the RFP that OSF put out last year. The one for an architecture independent distribution standard, or whatever they called it. To me it sounded like a great idea that might be impossible in practice. >How will your ABI relate to the massively parallel machines >that are likely to become increasinbly prevalent? And yet the >source level standards will still be useful, particularly >as even some of today's programs might conceivably benefit from >parallelizing compilers. For the short term, ABI is not that relevant to parallel machines because they are mid to high end machines at this point. Also, to my knowledge, UNIX has a very narrow range of parallel architectures that it maps well to (various closely and loosly coupled MIMD architectures). So for mainsteam machines (where ABI's are relevant) I expect that parallelism will be supported in standards by MACH features or MACH like features, in other words parallelism isn't really a problem now. Gerry Gleason
rogerk@mips.COM (Roger B.A. Klorese) (01/13/90)
In article <18132@umn-cs.CS.UMN.EDU> mike@umn-cs.cs.umn.edu (Mike Haertel) writes: >Source level compatibility ought to be enough for anyone. With the >emergence of, e.g., ANSI C, POSIX.1, and (less desirably) the >X window system, it will be substantially easier to write portable >programs, and these standards are likely to have a far longer viable >life than any ABI. At least 90% of computer users do not have the source of the applications they use. And most software houses will not recompile their products for an arbitrarily large number of platforms, because each compilation also requires a (long, one hopes) test cycle. API's are useful, but ABI's are most important to open systems. -- ROGER B.A. KLORESE MIPS Computer Systems, Inc. phone: +1 408 720-2939 928 E. Arques Ave. Sunnyvale, CA 94086 rogerk@mips.COM {ames,decwrl,pyramid}!mips!rogerk "Two guys, one cart, fresh pasta... *you* figure it out." -- Suzanne Sugarbaker
dmocsny@uceng.UC.EDU (daniel mocsny) (01/13/90)
In article <96@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes: >What I am curious about is whether there have been any reasonable >proposals for the RFP that OSF put out last year. The one for >an architecture independent distribution standard, or whatever >they called it. To me it sounded like a great idea that might >be impossible in practice. From Byte Magazine, Nov. 1989, p. 368, "DOS at RISC" (ouch!), by Colin Hunter: "...Hunter Systems has submitted the Analyzer and XDOS to the OSF as an Architecturally Neutral Distribution Format, or ANDF." The interesting article describes XDOS, a system that the article claims generates "true binary ports" of MS-DOS programs to many UNIX platforms. I have no experience with this product. Does anyone else? Dan Mocsny dmocsny@uceng.uc.edu
gregw@otc.otca.oz (Greg Wilkins) (01/13/90)
In article <5837@orca.wv.tek.com> andrew@frip.wv.tek.com writes: >The SPARC ABI describes the virtual machine which is (will be?) >implemented on all SPARC systems. It gives the interface which an >ABI-conformant application can depend on. If a machine implements both >the ABI and the new multiply instruction, programs which conform to the >ABI (and therefore do not use multiply) will still execute correctly. >This is all that matters. > > -=- Andrew Klossner (uunet!tektronix!frip.WV.TEK!andrew) [UUCP] > (andrew%frip.wv.tek.com@relay.cs.net) [ARPA] This is not all that matters. Lets pretend you have just bought a brand new sparc station X, complete with integer multipy. You now want to by some software for it: You have a choice of paying $300 for the ABI version, which cannot use the multiply instruction (which is not part of the ABI), but which is ready to run (be it very slowly as it is a multiply intensive application). OR you can pay $100,000 and by the source code, then study law for a while so you can understand the licence agreement, then hire another system administator (the existing one is too busy finding disk space to install and compile applications for every tom dick and harry). Basicly it has to be in the ABI or it might as well not be in the machine. That is if this ABI thing gets off the ground, which it will never do if every so often a new instruction/co-processor is added. -gregw@otc.oz
mash@mips.COM (John Mashey) (01/14/90)
In article <1253@otc.otca.oz> gregw@hls0.oz.au writes: >Basicly it has to be in the ABI or it might as well not be in the machine. >That is if this ABI thing gets off the ground, which it will never do if >every so often a new instruction/co-processor is added. Well, let us observe that: a) Most architecture families have added user-visible instructions over time, haven't they: S/360 -> S/370 -> etc 68000->68010->68020 8086->80286->80386->80486 for example. b) For microprocessors, there are PLENTY of oppurtunities to use new things while still retatining backwards binary compatibility, like: use them in the kernel [50% of your cycles, often....] (as khb mentions) shared libraries kernel emulation of instructions translation [remember that people often translate RISC instructions on the fly, becasue it's epseiclaly easy] and finally: c) Many microprocessors are also used in embedded control, where ABI's are mostly irrelevant. d) Finally, some programs are ONLY compiled and run on local machines, and if somebody wants to compile with appropriate flags, so be it. [After all, consider the history of floating point accelerators and Sun & elsewhere, and with Weitek chips that go in place of 80387s, etc, etc]. Clearly, one wants to avoid gratuitous changes, and too many different versions of things, but it is certainly a fact of life that: a) at any give time, you may not have enough silicon to do everything you'd like to. b) Successive generations will almost always have some differences relative to the kernel. c) And finally, as families evolve, the optimal code generation may well change from machine to machine anyway, even if they're binary compatible in both directions. This effect has been seen for a long time on S/360.... and VAXen, and 386vs486, etc ,etc. Anyway, let's observe that what's IMPORTANT is getting enough volume on anything to make it interesting.... the fact that some 386s run MSDOS, and some run UNIX, and some run OS/2 doesn't stop people from buying them. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
gregw@otc.otca.oz (Greg Wilkins) (01/15/90)
In article <KHB.90Jan11212327@chiba.kbierman@sun.com> you write: >In article <1249@otc.otca.oz> gregw@otc.otca.oz (Greg Wilkins) writes: > in article <8840005@hpfcso.HP.COM>, dgr@hpfcso.HP.COM (Dave Roberts) says: > > Now for some comments: > > (1) SPARCs will get multiply and divide. This is from a guy at > > Sun. Coming soon to a SPARC station near you... > >No, it was a guy at TI. Well I didn't know that!!!!! >complying with the ABI). If you use the current Sun compilers, you get >.mul, .div, .rem etc. .... and if you have a chip which had those as >hw instructions you (as an end user, or software author or whathave >you) could replace those library codes with ones which employed your >nifty instruction(s). No ABI breakage. Well the compiler I use on a sun 4/110 doesn't produce .mul, .div etc but then it is running an ancient version of the operating system (3.2!!!) But assuming that newer compilers will generate them, surely they are expanded when the a.out file is generated, so they cannot be evaluated at load time. I guess they can be expanded to a call to a shared library routine (I don't know how this mechanism works), so libC.a is not linked in until run time, But I don't know if shared libraries are included in the ABI. If it is not, then libC.a will be linked in before an application is distributed, thus local libC.a routines will be ignored. Well lets assume that via some mechanism, multiplies are performed by an undefined function that is linked in at load time. Then the best you can do is cop a function call then a multiply instruction (possibly with moves to and from a co processor). I guess this is not too bad as function calls are pretty fast on a SPARC anyway. But what about multiplies by constants, the compiler will have turned these into wonderful sequences of shifts and adds. If a mul instruction became available, these should be replaced by a load with constant followed by a multiply. But then I guess nothing can solve this one (without greatly slowing the performance of all machines without a multiply instruction). >You jump to some amazing conclusions from a posting from _TI_ about >Sun's plans and business practices. I was not jumping to amazing conclusions, but simply inviting Sun to clarify a point of obvious public concern. Your posting has indicated that maybe it is not such a big deal, but it still would be better to get an official position from sun on the future a sparc and multiply. It is Obvious that 1) SPARC multiply performance can be improved. 2) There is a class of users who would greatly benefit from it. 3) There is a larger class of user who probably wont notice it, but would sure as hell feel nicer about their machines when reading comparative benchmarks. 4) There are many rumours about how/when/if the multiply will be speeded up. 5) Everybody would appreciate a definitive statement from Sun DISCLAIMER: I do not know my employers opinion on these matters. Greg Wilkins ACSnet:gregw@otc.oz.au igc nets: igc:peg:gwilkins "To sin by silence when Phone(w): (02)2874862 Telex: OTCAA120591 they should speak out Phone(h): (02)8104592 Snail:OTC Services R&D, makes cowards of men" Fax: (02)2874990 GPO Box 7000, - Abe Lincoln O/S ph: (prefix) 612 # Sydney 2001, Australia
preston@titan.rice.edu (Preston Briggs) (01/16/90)
In article <1255@otc.otca.oz> gregw@otc.UUCP (Greg Wilkins) writes: >But what about multiplies by constants, the compiler will have turned these >into wonderful sequences of shifts and adds. If a mul instruction became >available, these should be replaced by a load with constant followed by a >multiply. But then I guess nothing can solve this one (without greatly >slowing the performance of all machines without a multiply instruction). Converting I x C into a sequence of shifts, adds, and subtracts is really a machine-dependent optimization -- it's not always profitable. In a short paper Multiplication by Integer Constants Robert Bernstein Software -- Practice and Experience, July 1986 Bernstein describes the method used by the PL.8 compiler to handle 32x32 signed multiplies. The cost of a multiply, on an 801, using multiply-step instructions is 19 cycles. The cost, using his scheme, is summarized below constant # instructions --------------------------------------- 2 1 3 2 6 3 11 4 22 5 43 6 86 7 171 8 342 9 683 10 1366 11 2731 12 5462 13 ... that is, multiplications by 2731 require 12 instructions. Multiplications by C < 2731 require 11 or fewer instructions. So, if you've got a multiplier that's quicker than 10 cycles, you'd rather not convert some constants >= 683. On the other hand, if you have lots of multiplies by strange integer constants, converting them all might expose opportunities for common subexpression elimination. Note that the above numbers are misleading for some machines. If you don't have a barrel shifter or 3-address instructions or adequate registers, the cost will be higher. Preston Briggs preston@titan.rice.edu
khb@chiba.kbierman@sun.com (Keith Bierman - SPD Advanced Languages) (01/16/90)
In article <1253@otc.otca.oz> gregw@otc.otca.oz (Greg Wilkins) writes:
....
brand new sparc station X, complete with integer multipy. You now want to
by some software for it: You have a choice of paying $300 for the ABI
version, which cannot use the multiply instruction (which is not part of
the ABI), but which is ready to run (be it very slowly as it is a multiply
intensive application).
...
No. As I pointed out in an earlier posting, it is possible to add
instructions AND to get the benefit delievered to ABI compliant users.
Consider the following gedanken experiment:
ACME HiTech Corp's VP of engineering (Wily E. Coyote) concludes that
some nifty mass market project of theirs (say HDTV/Workstation combo,
or part of a navigation system which gets its data 1 point at a time)
absolutely must have a general purpose CPU ... which can execute the
following function at full hw speed
f(ix,ifac) returns int(ifac*cos(x))
int(ifac*sin(x))
int(ifac*cos(-x))
int(ifac*sin(-x))
(viz. mutant given's transformation)
Clearly no merchant chip (well, aside from some nifty CORDIC chips)
has this now.
Coyote contacts the SPARC licensing board (or its real life equivalent
:>) and negoiates an opcode (perhaps in supervisor space ... where it
doesn't affect anyone else) to be available only in their
chips/systems (I have no idea how this is done, but assume for money
it can be arranged).
Consider the following cases:
1) ACME engineers build their software on Sun 4/60's which lack the
instruction.
2) First spin of the chip can't fit the whole thing ... so they
compute sincos(x) in hw.
3) Second spin does it all .. but has a bug which requires that
arg reduction be performed before the rest of the instruction.
4) Third spin does it all ... correctly.
Quiz:
1) How many a.out's (to use SunOS 4.x lingo) are necessary ?
2) How many can be ABI compliant ?
Answers:
1) Only 1 a.out is required .. and can benefit from the hw.
2) It can be ABI compliant.
3) Yes, being non-ABI compliant might improve performance.
The solution is via shared libraries. The a.out only knows that at
runtime there will be a routine called wily_givens.
At runtime the runtime loader links in the "right" shared library ...
where right means the one which matches the local hardware.
1 ABI compliant a.out runs on 4, and goes faster as the hw improves.
Best performance (probably 1-5% faster in this case) would be obtained
by generating the wily_givens instruction directly in the 4th case.
If it were to happen that SPARC's with wily_givens caputured a huge
chunk of the market (say 90% for grins) perhaps the ABI would be
altered .... causing the 10% to trap to the OS for emulation. This
would only adversely impacted the 10% of user's who lacked the
hardware, but employed codes rich in the instruction .... most of the
10% probably wouldn't notice or care.
Clearly integer multiply would be somewhat more popular than
wily_givens and is less computationally intense .... thus the
tradeoffs aren't so obvious. But the general case of how to add an
instruction, yet be ABI compliant, is easily handled.
--
Keith H. Bierman |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault | MTS --Only my work belongs to Sun*
I Voted for Bill & | Advanced Languages/Floating Point Group
Opus | "When the going gets Weird .. the Weird turn PRO"
"There is NO defense against the attack of the KILLER MICROS!"
Eugene Brooks
guy@auspex.auspex.com (Guy Harris) (01/16/90)
>But assuming that newer compilers will generate them, surely they are >expanded when the a.out file is generated, so they cannot be evaluated at >load time. Newer compilers will presumably include a command-line flag instructing them to either produce the multiply/divide instructions themselves or calls to ".mul"/".div" and company. And such calls are surely *not* expanded with the "a.out" file is generated, unless you linked with "-Bstatic" - shared libraries, remember? >I guess they can be expanded to a call to a shared library routine (I >don't know how this mechanism works), so libC.a is not linked in until >run time, Exactly. >But I don't know if shared libraries are included in the ABI. If the ABI is anything like the ones for which I've seen drafts, not only are they included, they are *required* - i.e., the way you do a "stat()" call in an ABI-conforming application is you make a call to the dynamically-linked routine "stat()" in the appropriate library, passing it certain arguments. You don't shove specified stuff into registers and execute trap # N. The same could apply to ".mul"/".div", and probably *would* apply (the drafts I saw hadn't gotten around to specifying those particular routines yet; they came in the processor-specific part of the ABI). (It would also apply to, e.g. "getpwnam()" and company, so ABI implementations will pick up the local "getpwname()"-and-company implementation, whether it be a linear scan through "/etc/passwd", a "dbm"-based implementation like 4.3BSD, a Hesiod-based implementation, an HPollo Registry-based implementation, a YP-based implementation, etc..) >Well lets assume that via some mechanism, multiplies are performed by an >undefined function that is linked in at load time. Then the best you can do >is cop a function call then a multiply instruction (possibly with moves to >and from a co processor). I guess this is not too bad as function calls are >pretty fast on a SPARC anyway. Yes, if you want an ABI-conforming program. If speed, rather than shrink-wrap portability, is important, you could build the program with whatever the "generate multiply/divide in line" flag is. If both are important, either: 1) portability to *all* SPARC-based machines *isn't* important (e.g., an application that won't work fast enough if you don't have multiply/divide instructions), in which case you might be able to build your program with the "generate multiply/divide in line" flag, and label it "this should run on any ABI-conforming machine that *also* has the standard multiply/divide instructions". 2) portability to all SPARC-based machines is important, but so is getting the extra performance from the instructions if they're there, in which case you may build two versions. These do somewhat conflict with the principle of the ABI, but life isn't always perfect (are there PC applications that demand floating-point hardware, or applications that come in multiple versions, one of which does and one of which doesn't?). >But what about multiplies by constants, the compiler will have turned these >into wonderful sequences of shifts and adds. If a mul instruction became >available, these should be replaced by a load with constant followed by a >multiply. Why? Do you have hard evidence (not guesses) to suggest that a load with constant followed by a multiply will be faster than the sequence of shifts and adds? Note that the Sun 68020 compiler generates sequences of shifts and adds for multiplies by constants *even though the 68020 has a 32x32 multiply instruction*; I don't think this was done because the compiler writers were too lazy to change the compiler, I think it was done because it was *still* faster to do the shifts and adds. I can't speak for MIPS, but I wouldn't be surprised to hear that even though they had a multiply instruction since Day 1 they still did shifts and adds for multiplies by constants.
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/16/90)
In article <2819@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes: | Newer compilers will presumably include a command-line flag instructing | them to either produce the multiply/divide instructions themselves or | calls to ".mul"/".div" and company. And such calls are surely *not* | expanded with the "a.out" file is generated, unless you linked with | "-Bstatic" - shared libraries, remember? Has anyone measured the time taken to just generate the mpy and trap it vs the time for a procedure call? We used to trap some instructions on the old GE series 20 years ago, and the time to trap and decode (table lookup for decode) was only a few % slower than a call, when the total time to execute the "instruction" was taken into account. Would it be better to just generate the instruction all the time and trap it, rather than use the various libraries? It would certainly give better performance on the machines with the mpy hardware, and based on the very slow times reported here might not be a notable loss on standard ABI SPARC. Has anyone measured these numbers to get a ballpark figure? I don't have a good feel for how long the partial context change would take on the trap. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Stupidity, like virtue, is its own reward" -me
mash@mips.COM (John Mashey) (01/17/90)
In article <2819@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes: .... >the compiler writers were too lazy to change the compiler, I think it >was done because it was *still* faster to do the shifts and adds. I >can't speak for MIPS, but I wouldn't be surprised to hear that even >though they had a multiply instruction since Day 1 they still did shifts >and adds for multiplies by constants. Yes, of course: shifts, adds, and subtracts, up to the point where it takes about as long as a multiply. (I think there's a bias to use multiply if its equal or close, since the code sequence is shorter, andthere may be code scheduling opportunities in the shadow of the multiply, although not many.) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
sjc@key.COM (Steve Correll) (01/17/90)
In article <2819@auspex.auspex.com>, guy@auspex.auspex.com (Guy Harris) writes: > ...I can't speak for MIPS, but I wouldn't be surprised to hear that even > though they had a multiply instruction since Day 1 they still did shifts > and adds for multiplies by constants. Right. (As of 2 years ago, anyway) the MIPSCo compilers compare the number of cycles for hardware multiply against the cycles for the necessary series of shifts and adds, and emit the cheaper alternative. They also shift and mask (adjusting for the sign bit as needed) in lieu of div and rem by powers of 2. This happens in the non-ASCII back end of the assembler, which the compilers employ to create the object file, so compilers and humans alike can code "mul" and let the assembler figure out the tradeoff. -- ...{sun,pyramid}!pacbell!key!sjc Steve Correll
whit@milton.acs.washington.edu (John Whitmore) (01/24/90)
In article <15418@vlsisj.VLSI.COM> davidc@vlsisj.UUCP (David Chapman) writes: >In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes: >>Does any have, of know of software for the SPARC [SUN-4] that will >>perform the following: >> >> [standard multiply and divide] > . . . >There should be instructions on the order of "multiply step" and "divide >step", each of which will do one of the 32 adds/subtracts and then shift. > I strongly disagree. Smarter routines can optimize a lot of that kind of "microcode" away, and should do so given the opportunity. >Thus they provide you with the tools to do your own multiply and divide. >One of the benefits is that a compiler can optimize small multiplies and >divides to make them execute quicker (i.e. multiply by 10 takes 4 steps >instead of 32). Yes, EXACTLY. So extend the principle; take four-bit nibbles of the argument and do a 16-way JUMP (whatever the equivalent is on a SPARC) to sixteen cases like CASE x0000: RTN CASE x0001: add (to accumulator) CASE x0010: shift +1, add CASE x0011: subtract, shift+3, add CASE x0100: shift+3, add CASE x0101: add, shift+3, add CASE x0110: shift+2, add, shift+3, add CASE x0111: subtract, shift+4, add and if I can figure it out, you experts are getting bored by now. Four operations MAX for a four-bit multiplicand, opposed to 12 operations (estimated) for the one-bit "MULSTEP" approach. > >P.S. Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be > incredibly slow. Unroll the loop 4 or 8 times (MULSTEP, MULSTEP, > MULSTEP, MULSTEP, SUB 4, BNZ). Branches are expensive. Hm. The principle is good, but don't think small. Unroll it to really large chunks of code. The "BNZ" is a bottleneck that shouldn't be employed when really large fanout of the code path can be done in ONE step. I seem to recall that the trick (above) was employed by a hardware multiplier IBM made, some decade or more ago. It still works. I am known for my brilliance, John Whitmore by those who do not know me well.
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/25/90)
Actually memory is cheap enough to use 64k for a lookup table. Make a 16 bit address from the two bytes to multiply as 0 78 15 |_______|______| | byte1 | byte2| |_______|______| and just pull the answer out of memory. Actually, the Intel practice of treating a 32 bit register as lots of little registers, with the 8 and 16 bit portions addressible by name, would be handy for some of this stuff, eliminating a shift. It should still be quite fast. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Stupidity, like virtue, is its own reward" -me