lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (11/12/90)
Various defunct (ETA, Cydrome) and extant (Cray) machines have a trick, whereby a floating point pipe can deliver one double precision result per clock, or else deliver two single precision results per clock. This always sounded vaguely reasonable, at least at the pin-count level, but I've never read an analysis of the circuit issues. I can't think of a single micro offering this feature. I understand why it's not in the unpipelined FPUs, and I understand why it's not in the FPUs that emphasize 32-bit data paths. But why isn't it in any micro? Is the idea dead for good, or about to come back? -- Don D.C.Lindsay
dik@cwi.nl (Dik T. Winter) (11/12/90)
In article <11054@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: > > Various defunct (ETA, Cydrome) and extant (Cray) machines have a What Cray are you refering to? > trick, whereby a floating point pipe can deliver one double precision > result per clock, or else deliver two single precision results per > clock. > I can't think of a single micro offering this feature. I understand > why it's not in the unpipelined FPUs, and I understand why it's not > in the FPUs that emphasize 32-bit data paths. But why isn't it in any > micro? Is the idea dead for good, or about to come back? Offhand I would say that the idea requires a doubling of all hardware to do the arithmetic. This appeared to be feasable in the 205/ETA, the machine costs enough, and the wires for the 64 bit wide datapath (and 128 bits wide internally) were there anyway. Another thing to note is that the 205/ETA offers this facility in the vector instructions (those are memory to memory) but not in the scalar instructions (those are register to register). -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
pmk@craycos.com (Peter Klausler) (11/12/90)
In article <11054@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: > >Various defunct (ETA, Cydrome) and extant (Cray) machines have a >trick, whereby a floating point pipe can deliver one double precision >result per clock, or else deliver two single precision results per >clock. No CRI or CCC machine supports this "feature".
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (11/12/90)
In article <2511@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >In article <11054@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: > > Various defunct (ETA, Cydrome) and extant (Cray) machines have a >What Cray are you refering to? I'd tell you what I was thinking of, except that apparently I wasn't thinking at all. The Cydrome didn't do it either: reading further into the article, the 2:1 rate comes from double-cycling a 32-bit data path. -- Don D.C.Lindsay
msp33327@uxa.cso.uiuc.edu (Michael S. Pereckas) (11/13/90)
I don't know how hard it would be to do, but I wonder how useful it would be. Does anyone know how useful it was on the Cyber 205? I wonder how you would keep such a unit busy on a micro. Keeping a fully pipelined version busy would require 1 instruction, 128 bits of data, and a place to put 64 bits of results per cycle. This would be easier as a vector instruction in a vector machine (as in the CDC machines). A non-fully pipelined version would be easier to feed, and might have advantages over a more-fully pipelined conventional single-precision unit, in that you could do the same work with half the floating-point instructions. If instruction issue is a bottleneck, than this could help, but there are probably easier ways. (How about a set of fp instructions that code for two successive fp operations. Then you get the same thing, and complicate control and issue instead of the fp functional units. This may or may not be better.) -- Michael Pereckas * InterNet: m-pereckas@uiuc.edu * just another student... (CI$: 72311,3246) *Jargon Dept.: Decoupled Architecture--sounds like the aftermath of a tornado*
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (11/13/90)
>On 12 Nov 90 16:00:09 GMT, msp33327@uxa.cso.uiuc.edu (Michael Pereckas) said:
-> I don't know how hard it would be to do, but I wonder how useful it
-> would be. Does anyone know how useful it was on the Cyber 205?
Because the memory bandwidth was available, this feature (double-speed
half-precision arithmetic) was very useful. Long vector codes
(N>>1000) really did get 2:1 speed-ups when run in 32-bit precision.
Short vector codes got less speedup because the startup overhead was
effectively twice as large (although it was the same number of cycles).
A more significant problem was that the perverse floating-point
arithmetic on the machine was surprisingly inaccurate and many
calculations that were OK in 32-bits on IEEE machines would fail
miserably on the 205/ETA-10.
For more on both the speed and accuracy, see my discussion in the paper
entitled "Some Notes on 32-bit Arithmetic on the Cyber 205 and ETA-10" in
Supercomputer (the Dutch journal), volume 5, number 2 (issue #24),
March, 1988.
--
John D. McCalpin mccalpin@perelandra.cms.udel.edu
Assistant Professor mccalpin@vax1.udel.edu
College of Marine Studies, U. Del. J.MCCALPIN/OMNET
seanf@sco.COM (Sean Fagan) (11/13/90)
In article <11054@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: >Various defunct (ETA, Cydrome) and extant (Cray) machines have a >trick, whereby a floating point pipe can deliver one double precision >result per clock, or else deliver two single precision results per >clock. Uhm... not a Cray. Cray's have one size of fp: word. 64-bits. There are some things you can do (and the hardware helps a little bit, I believe) to get 128-bit fp numbers, but it's done mostly in software. -- -----------------+ Sean Eric Fagan | "*Never* knock on Death's door: ring the bell and seanf@sco.COM | run away! Death hates that!" uunet!sco!seanf | -- Dr. Mike Stratford (Matt Frewer, "Doctor, Doctor") (408) 458-1422 | Any opinions expressed are my own, not my employers'.
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (11/14/90)
> trick, whereby a floating point pipe can deliver one double precision > result per clock, or else deliver two single precision results per > clock. Private email has turned up: >The <..> processor does this. The next generation <..> will not; it >costs more silicon and more critical paths to do it than to build >separate units. -- Don D.C.Lindsay