hamrick@convex.COM (Ed Hamrick) (12/10/89)
A few questions about the Intel 860 architecture: 1) The 860 doesn't seem to have any divide instructions, either integer or floating point. It seems to depend on a floating point reciprocal instruction followed by a floating point multiplication. - What other architectures use reciprocal -> multiply for divide? - What are the numerical accuracy tradeoffs? - How many cycles does the reciprocal instruction take? Can it be pipelined? 2) How deep is the pipeline for 64 bit adds / multiplies? 32 bit? 3) What happens to the pipeline if there are page faults / exceptions during dual operation mode? Does the pipeline advance one step per clock cycle, or one step per floating instruction? 4) Is is possible to do pipelined FP loads with non-unit stride? 5) Is it possible to do pipelined scatter/gather operations? 6) The 860 doesn't seem to have integer multiplication instructions, and also doesn't seem to have any integer to floating conversion instructions. What are the best ways to do efficient integer multiplication with the 860? Does this have something to do with the fmlow instruction? All in all, it looks like a well thought out chip, with a lot of clever architectural trade-offs to get everything on one chip. Regards, Ed Hamrick
ccplumb@rose.waterloo.edu (Colin Plumb) (12/10/89)
In article <3818@convex.UUCP> hamrick@convex.COM (Ed Hamrick) writes: >2) How deep is the pipeline for 64 bit adds / multiplies? 32 bit? It's 3 stages for most things, and 2 for d.p multiplies. However, in the latter case, each stage takes 2 cycles, so you only get one result per 2 clocks. >3) What happens to the pipeline if there are page faults / exceptions > during dual operation mode? Does the pipeline advance one step > per clock cycle, or one step per floating instruction? I don't quite understand. The pipeline advances one stage per floating instruction. The instruction's dest specification specifies where to put the current result, not the result of the operation you're currently starting. The i860's exception handling is seriously wierd. It saves just barely enough information for an excpetion handling routine to figure out what went wrong and fix it. No fast context switches on this puppy! And even then, there are code constructs you have to avoid, like branching to the shadow of a delayed branch. It only saves one address, so the excpetion handler has to look back one instruction to see where it should resume... ugh. >4) Is is possible to do pipelined FP loads with non-unit stride? Certainly. The pipelined load business just makes the latency visible to the programmer; you still supply one address per load. There is no auto-increment feature. A pipelined load is just a load that doesn't get satisifed until after you've issued the next pipelined load; other than that it's normal. >5) Is it possible to do pipelined scatter/gather operations? Again, sure if you want to write the software to compute the scatter/gather business. I believe the load pipeline is 2 deep (I may have forgotten). This means the first two instructions you issue, supply addresses and bogus destination registers. The third pipelined load, supply the third address and the destination for the first load (which hopefully has completed by now). There's nothing you couldn't do with agressive scoreboarding and ordinary loads, except that not having to supply a destination register until the data is ready gives you another register for those few clocks. >6) The 860 doesn't seem to have integer multiplication instructions, > and also doesn't seem to have any integer to floating conversion > instructions. What are the best ways to do efficient integer > multiplication with the 860? Does this have something to do with > the fmlow instruction? Ug... I'm forgetting. I believe the fmlow instruction can do an integer multiply, and I'm pretty sure there are int<->fp conversion instructions. >All in all, it looks like a well thought out chip, with a lot of clever >architectural trade-offs to get everything on one chip. To be honest, I wasn't too impressed when I saw it. Lots of wierd non-orthogonalities and I still think the interrupt handling is a pig. But I believe some of the design team reads comp.arch; let them refute. (Note that I believe an interrupt take/return should take about twice a function call/return. The 29000 is still too slow, but shows how simple an interrupt handling structure can be. I still wonder what the chip is doing for all those cycles. Freeze staus registers, set supervisor mode, clear pipeline, and start fetching from a new address. A non-delayed jump with a little bit of fiddling.) -- -Colin
brooks@maddog.llnl.gov (Eugene Brooks) (12/10/89)
In article <3818@convex.UUCP> hamrick@convex.COM (Ed Hamrick) writes: > - What other architectures use reciprocal -> multiply for divide? The machines made at CRI do this. > - What are the numerical accuracy tradeoffs? As good as Newton Raphson can be. > - How many cycles does the reciprocal instruction take? Can it be > pipelined? Yes it can be pipelined, on the Cray it is pipelined. I don't remember about the 860. brooks@maddog.llnl.gov, brooks@maddog.uucp
mark@mips.COM (Mark G. Johnson) (12/11/89)
Two ">>" for hamrick@convex.COM (Ed Hamrick); One ">" for ccplumb@rose.waterloo.edu (Colin Plumb); >>All in all, it looks like a well thought out chip, with a lot of clever >>architectural trade-offs to get everything on one chip. > >To be honest, I wasn't too impressed when I saw it. Lots of wierd >non-orthogonalities and I still think the interrupt handling is >a pig. But I believe some of the design team reads comp.arch; let >them refute. I'd suggest that the Solborne/Mitsubishi _Million_Transistor_SPARC_ chip (having CPU, caches, and floating point on one die, very much like the 860) is a lot better thought out, with lots more architectural cleverness. Including the idea that the computer needs to run an operating system efficiently, and that user programs written in high-level languages should run quickly. -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 (408) 991-0208 mark@mips.com {or ...!decwrl!mips!mark}
mark@mips.COM (Mark G. Johnson) (12/11/89)
In article <33258@hal.mips.COM> I made a blunder > ... the Solborne/Mitsubishi _Million_Transistor_SPARC_ chip ... ^^^^^^^^^^^^ oops, make that Matsushita. Sorry.
afgg6490@uxa.cso.uiuc.edu (12/12/89)
..> Reciprocal approximation Extracting from the comparative tables in my recently completed survey of computer arithmetic (yeah, I know, I wish they were much more complete too): Cray machines use Newton-Raphson reciprocal approximation. Reciprocal approximation was designed into the Advanced Astronautics ZS-1, but was replaced in the final design by commodity chips. IEEE format, but not IEEE accuracy. This algorithms special feature was ommitting mantissa bits that were fixed in value in the intermediate stages to obtain a 1/x approximately 20% more accurate that Cray's algorithm. Reciprocal approximation was used in the Gould NP1, with hexadecimal floating point. Accuracy problems (in large part caused by the hex format) caused a real divide to be backpatched in. The AT&T DSP32C uses NR with a seed instruction and SW iteration for its special 40 bit FP format. Motorola MC960002 also uses Newton-Raphson with a seed instruction. (Cyrix, as mentioned in an earlier post, uses NR to compute an approximate 1/X for use in its 17 bit digit selection divide). COMMENTS -------- Use of N-R reciprocal algorithms was set back quite a bit by IEEE FP's exactness requirements: a true remainder is needed for correct rounding. Also, in non-binary floating point (decimal or hex) accuracy problems when division is implemented by reciprocal multiplication can be extreme. Note that NR reciprocals typically converge from below. Most other problems with reciprocal are caused by laziness - implementations that do not bring the result out to the last possible bit. With a fast enough multiplier, however, the true remainder can be computed with only a little extra work (for Y/X: Q=Y*(1/X); R=Y-Q*X; QQ = Q +/- delta dependeng on comparison of R and X). I have already heard compiler guys say that reciprocal instructions opened up possibilities for optimization that are hidden by the assymmetry of divide. I see some possibility of using both an explicit reciprocal and a remainder correction to advantage, as separable operations: (1) a reciprocal instruction; (2) a remainder correction instruction to acheive a true divide result. (2) Can be optimized away, or combined when redundant.
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (12/14/89)
In article <112400013@uxa.cso.uiuc.edu> afgg6490@uxa.cso.uiuc.edu writes: > > Use of N-R reciprocal algorithms was set back quite a bit by >IEEE FP's exactness requirements: a true remainder is needed for >correct rounding. The i860 supports some IEEE things in software. That is, the logic is in trap handlers, rather than on the chip. This software is not yet available for divide. So, multiplies and adds can underflow, but divides can't. -- Don D.C.Lindsay Carnegie Mellon Computer Science
jesup@cbmvax.commodore.com (Randell Jesup) (12/14/89)
In article <112400013@uxa.cso.uiuc.edu> afgg6490@uxa.cso.uiuc.edu writes: > >..> Reciprocal approximation [list of machines that use NR approx to do divides...] You can add the RPM-40 (with its FPU) to the list. A FAST ?56x56? multiplier array made it fairly easy to do that way. (I think it was 3 40Mhz cycles per pass through the array). -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com BIX: rjesup Common phrase heard at Amiga Devcon '89: "It's in there!"