mark@mips.COM (Mark G. Johnson) (12/15/89)
In article <241@dg.dg.com> uunet!dg!chris (Chris Moriondo) writes: > >Anyone care to speculate as to how much pipelining add would win/lose >in terms of useful overlap in FP codes versus increased latency? > Several articles were posted to comp.arch this spring, talking about the perceived benefits of a pipelined FP adder in the Motorola MC88K risc. Perhaps the Moto folks can say more now. [ after the departure of Ross ] I dimly recall that the last time around, they were beginning to feel that the 88K's shared register file -- same regs for integer and FP operands -- required large numbers of read and write ports to make FP programs run quickly. The other design alternative, separate integer regs from the FP regs, has lots of ports already since it's 2 copies of the hardware. I'm sure if my memory is faulty somebody will gently correct me. The eariler MC88K articles were posted by wca@oakhill.UUCP and mpaton@oakhill.UUCP; your site archives may contain comp.arch article <1362@oakhill.UUCP>. -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 (408) 991-0208 mark@mips.com {or ...!decwrl!mips!mark}
preston@titan.rice.edu (Preston Briggs) (12/15/89)
In article <33570@hal.mips.COM> mark@mips.COM (Mark G. Johnson) writes: >In article <241@dg.dg.com> uunet!dg!chris (Chris Moriondo) writes: > > > >Anyone care to speculate as to how much pipelining add would win/lose > >in terms of useful overlap in FP codes versus increased latency? > > > >Several articles were posted to comp.arch this spring, talking about the >perceived benefits of a pipelined FP adder in the Motorola MC88K risc. Well, long pipelines will usually increase latency. Additionally, they'll increase FP performance, given some effort by the compiler. Deciding the tradeoff will just be a decision, probably based on the target applications. For a workstation that *I* would use (to compile and edit things like compilers and editors), FP performance isn't very important. But when people really want FP performance, they can usually manage to avoid task switching too often. What'll happen when people get more used to cheap FP? Probably more FP in everyday code, particularly graphics display. >I dimly recall that the last time around, they were beginning to feel >that the 88K's shared register file -- same regs for integer and FP >operands -- required large numbers of read and write ports to make >FP programs run quickly. The other design alternative, separate >integer regs from the FP regs, has lots of ports already since it's >2 copies of the hardware. The point about read/write ports is good. I used to believe in a single (large) set of multi-purpose registers controlled by the compiler. This scheme would seem to be less wasteful of resources (when you need more FP regs, spill more integers, and vice-versa, whichever works out cheapest for particular programs). Nowadays, I've come around to the view that extra resources for the moments of peak demand (triply-nested loops and such) is better. In particular, I prefer seperate FP and integer register sets. Locally, friends are working on loop transformations that will use *all* the available FP registers profitably. If we've got to balance the FP register pressure against the integer pressure, then we've added another hard issue to optimization. Of course, it's extra hardware and it'll cost money. Preston Briggs preston@titan.rice.edu
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (12/15/89)
In article <3740@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes: >In article <33570@hal.mips.COM> mark@mips.COM (Mark G. Johnson) writes: >>that the 88K's shared register file -- same regs for integer and FP >>operands -- required large numbers of read and write ports to make >The point about read/write ports is good. I used to believe >in a single (large) set of multi-purpose registers controlled >by the compiler. This scheme would seem to be less wasteful >In particular, I prefer seperate FP and integer register sets. There are alternatives. For example, instead of creating separate register files to increase the available ports, there is a technique to use two identical register files for the same purpose. The technique was used on the Cyber 205, and is described in an article in IEEE Transactions on Computers [May 1982] by Neil Lincoln. This redundancy doubles the real estate per register, but, on the other hand, there are advantages to doing this over using separate fp register files. Codes which require movement of data between integer registers, where the logical and bit-crunching instructions work, typically, and the fp register file, don't require data movement. (I hope nobody still uses the IBM 360 setup, where you have to push the data to *memory* to get it back and forth between integer and fp units...) Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
mark@mips.COM (Mark G. Johnson) (12/15/89)
In article <38132@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >(I hope nobody still uses the IBM 360 setup, where you have to push >the data to *memory* to get it back and forth between integer and fp > units...) I believe SPARC does this, because there aren't FPR <-> GPR move instructions. However, "*memory*" is just a few cycles away (3 for the store, 2 more for the load) thanks to cache.
davec@proton.amd.com (Dave Christie) (12/15/89)
In article <38132@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: | In article <3740@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes: | [discussion about separate vs. multi-port shared fp/int registers and integer/fp register allocation tradeoffs] | | >In particular, I prefer seperate FP and integer register sets. | | There are alternatives. For example, instead of creating | separate register files to increase the available ports, there is a | technique to use two identical register files for the same purpose. | The technique was used on the Cyber 205, and is described in an article | in IEEE Transactions on Computers [May 1982] by Neil Lincoln. This redundancy | doubles the real estate per register, but, on the other hand, there | are advantages to doing this over using separate fp register files. This technique is fairly common in machines from the '70s (at least CDC machines) that used small off-the-shelf single-port SRAMS for the register file. Doubling the register file realestate is well worth it for the improved performance that comes with accessing two operands at once. (Small off-the-shelf multi-ported SRAMS didn't quite cut it for various reasons.) The extra ports that come free with separate register sets aren't the only benefit: Preston alluded to the removal from the compiler of a difficult optimization tradeoff (int vs. fp register allocation). (But I think one could argue about whether or not this an enhancement, i.e. not having the opportunity to do it.) Another benefit appears for multiple instruction issue: issuing simultaneous fp/int instructions is simplified if you don't have to worry about dependency checking between the two (witness IBM 360/91, i860, Apollo Prism(?separate register sets?). It also makes it easier to put all fp operations on a separate chip and still maintain high performance (a la MIPS). | Codes which require movement of data between | integer registers, where the logical and bit-crunching instructions work, | typically, and the fp register file, don't require data movement. [in a shared register set] Quite true, but (IMHO) for most applications, int<->fp traffic is not very high, so having to explicitly move some data around is no big deal ... as long as it is done reasonably efficiently so that you don't blow one or two applications out of the water. ------------ Dave Christie My humble opinions, not my employer's.
slackey@bbn.com (Stan Lackey) (12/16/89)
In article <38132@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >In article <3740@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes: >>In particular, I prefer seperate FP and integer register sets. > >There are alternatives. For example, instead of creating >separate register files to increase the available ports, there is a >technique to use two identical register files for the same purpose. It's also possible to build the register file with both multiple read and multiple write ports. I'm pretty sure the Multiflow Trace has this. It's expensive, especially for multiple write ports, but it can certainly be done; there is basically a mux in front of each bit. I'd be surprised if any of the fast RISC's have less than two read ports, in fact. -Stan
ram@shukra.Sun.COM (Renu Raman) (12/16/89)
In article <33623@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes: >In article <38132@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: > >(I hope nobody still uses the IBM 360 setup, where you have to push > >the data to *memory* to get it back and forth between integer and fp > > units...) > >I believe SPARC does this, because there aren't FPR <-> GPR move instructions. >However, "*memory*" is just a few cycles away (3 for the store, 2 more for >the load) thanks to cache. minor addendum: What mark@mips said is for the Fujitsu/LSI/cypress parts. You can at best crunch it down to 3 cycles (2 cycles for store and 1 cycle for load - for doubles, it would be 4) if you can design a good cache system using the BIT ECL parts (which is left as an exercise to the reader :-)). So, it is very implementation dependent. The best case ofcourse is 2 cycles (if you can do single cycle stores. You can do single cycle load doubles to FP using the BIT parts) renu raman
pmontgom@sonia.math.ucla.edu (Peter Montgomery) (12/17/89)
In article <28413@amdcad.AMD.COM> davec@cayman.amd.com (Dave Christie) writes: > >Quite true, but (IMHO) for most applications, int<->fp traffic is not very >high, so having to explicitly move some data around is no big deal ... as >long as it is done reasonably efficiently so that you don't blow one or two >applications out of the water. > I have written a multiple precision integer arithmetic package. Its time-critical routines are conditionally compiled so the algorithms can be individually optimized for different machines (yes, I resemble Herman Rubin in wanting full access to machine instructions from high-level languages). On some machines, almost all arithmetic is integer. Recently, while transporting my program to an Alliant, I vectorized some of these codes. A frequent primitive operation requires finding integers X, Y such that X*2**30 + Y = A*B + C where A, B, C are given (vectors of) integers up to 2^30 (2^30 is the radix for multiple-precision arithmetic). The Alliant has an vectorized integer multiply instruction, but only for the lower 32 bits of a product. Hence I can get Y = IAND(A*B + C, 2**30 - 1) with vectorized integer operations. To get X (upper half of product) using vector operations, I use floating point, such as X = NINT((DBLE(A)*DBLE(B) + DBLE(C - Y))*0.5**30) (DBLE = convert integer to double, NINT = convert double to nearest integer). This statement uses 4 conversions between integer and floating point while doing only 3 floating point operations. The vectorized DBLE is compiled inline, but the vectorized NINT is done in a subroutine; the NINT subroutine uses 10% of my program's time (but total program time is down from before vectorization). BTW, I have asked the X3J3 committee to add this operation (i.e., given A, B, C, D, all nonnegative, and either A < D or B < D, find X, Y such that X*D + Y = A*B + C) to Fortran 8x. -------- Peter Montgomery pmontgom@MATH.UCLA.EDU Department of Mathematics, UCLA, Los Angeles, CA 90024
davec@proton.amd.com (Dave Christie) (12/18/89)
In article <2100@sunset.MATH.UCLA.EDU# pmontgom@math.ucla.edu (Peter Montgomery) writes: #In article <28413@amdcad.AMD.COM> davec@cayman.amd.com (Dave Christie) writes: #> #>Quite true, but (IMHO) for most applications, int<->fp traffic is not very #>high, so having to explicitly move some data around is no big deal ... as #>long as it is done reasonably efficiently so that you don't blow one or two #>applications out of the water. #> # I have written a multiple precision integer arithmetic package. #Its time-critical routines are conditionally compiled so #the algorithms can be individually optimized for different machines #(yes, I resemble Herman Rubin in wanting full access to machine #instructions from high-level languages). On some machines, Yes, multiple precision seems to be one of the applications you may want to be careful about (Herman Rubin replied as well!). My rash generalization comes from a former life at a TLA mainframe company where I did detailed instruction frequency measurements of a wide range of real-world Fortran programs - such things as structural analysis, flight simulation, ECAD, many other flavors (but not likely multiple precision - notice I'm not using the words "representative" or "comprehensive" ;-) This indicated (among many other things) that the frequency of converts would not be a big factor in deciding whether to go with separate or shared registers, were one to design a new general-purpose ISA - other factors carry more weight. Of course, your milage may vary. What one sees fit to do generally depends on what one thinks will bring in the most bucks, right? #A frequent primitive operation requires finding integers X, Y such that #X*2**30 + Y = A*B + C where A, B, C are given (vectors of) integers #up to 2^30 (2^30 is the radix for multiple-precision arithmetic). #The Alliant has an vectorized integer multiply instruction, but #only for the lower 32 bits of a product. Hence I can get #Y = IAND(A*B + C, 2**30 - 1) with vectorized integer operations. #To get X (upper half of product) using vector operations, #I use floating point, such as # # X = NINT((DBLE(A)*DBLE(B) + DBLE(C - Y))*0.5**30) # #(DBLE = convert integer to double, NINT = convert double to nearest integer). #This statement uses 4 conversions between integer and floating point #while doing only 3 floating point operations. This sounds like a nice extreme case. Now I'm curious - does anyone know, or care to figure out (I'm not curious enough to take the time myself :-) whether the simultaneous integer/fp issue capability of the i860 would more than compensate for the penalty on converts caused by the separate int/fp registers, even on this extreme piece of code? # Peter Montgomery # pmontgom@MATH.UCLA.EDU # Department of Mathematics, UCLA, Los Angeles, CA 90024 ------------ Dave Christie The opinions are mine, the facts belong to everyone .... oops - sorry, the facts are proprietary.....