[comp.arch] 1 double or 2 singles

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (11/12/90)

Various defunct (ETA, Cydrome) and extant (Cray) machines have a
trick, whereby a floating point pipe can deliver one double precision
result per clock, or else deliver two single precision results per
clock.

This always sounded vaguely reasonable, at least at the pin-count
level, but I've never read an analysis of the circuit issues.

I can't think of a single micro offering this feature. I understand
why it's not in the unpipelined FPUs, and I understand why it's not
in the FPUs that emphasize 32-bit data paths. But why isn't it in any
micro? Is the idea dead for good, or about to come back?
-- 
Don		D.C.Lindsay

dik@cwi.nl (Dik T. Winter) (11/12/90)

In article <11054@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
 > 
 > Various defunct (ETA, Cydrome) and extant (Cray) machines have a
What Cray are you refering to?
 > trick, whereby a floating point pipe can deliver one double precision
 > result per clock, or else deliver two single precision results per
 > clock.

 > I can't think of a single micro offering this feature. I understand
 > why it's not in the unpipelined FPUs, and I understand why it's not
 > in the FPUs that emphasize 32-bit data paths. But why isn't it in any
 > micro? Is the idea dead for good, or about to come back?
Offhand I would say that the idea requires a doubling of all hardware to
do the arithmetic.  This appeared to be feasable in the 205/ETA, the
machine costs enough, and the wires for the 64 bit wide datapath (and
128 bits wide internally) were there anyway.  Another thing to note is
that the 205/ETA offers this facility in the vector instructions (those
are memory to memory) but not in the scalar instructions (those are
register to register).
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

pmk@craycos.com (Peter Klausler) (11/12/90)

In article <11054@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>
>Various defunct (ETA, Cydrome) and extant (Cray) machines have a
>trick, whereby a floating point pipe can deliver one double precision
>result per clock, or else deliver two single precision results per
>clock.

No CRI or CCC machine supports this "feature".

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (11/12/90)

In article <2511@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <11054@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
> > Various defunct (ETA, Cydrome) and extant (Cray) machines have a
>What Cray are you refering to?

I'd tell you what I was thinking of, except that apparently I wasn't
thinking at all. The Cydrome didn't do it either: reading further
into the article, the 2:1 rate comes from double-cycling a 32-bit
data path.
-- 
Don		D.C.Lindsay

msp33327@uxa.cso.uiuc.edu (Michael S. Pereckas) (11/13/90)

I don't know how hard it would be to do, but I wonder how useful it
would be.  Does anyone know how useful it was on the Cyber 205?  

I wonder how you would keep such a unit busy on a micro.  Keeping a
fully pipelined version busy would require 1 instruction, 128 bits of
data, and a place to put 64 bits of results per cycle.  This would be
easier as a vector instruction in a vector machine (as in the CDC
machines).  

A non-fully pipelined version would be easier to feed, and might have
advantages over a more-fully pipelined conventional single-precision
unit, in that you could do the same work with half the floating-point
instructions.  If instruction issue is a bottleneck, than this could
help, but there are probably easier ways.  (How about a set of fp
instructions that code for two successive fp operations.  Then you get
the same thing, and complicate control and issue instead of the fp
functional units.  This may or may not be better.) 


--
Michael Pereckas               * InterNet: m-pereckas@uiuc.edu *
just another student...          (CI$: 72311,3246)
*Jargon Dept.: Decoupled Architecture--sounds like the aftermath of a tornado*

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (11/13/90)

>On 12 Nov 90 16:00:09 GMT, msp33327@uxa.cso.uiuc.edu (Michael Pereckas) said:

-> I don't know how hard it would be to do, but I wonder how useful it
-> would be.  Does anyone know how useful it was on the Cyber 205?  

Because the memory bandwidth was available, this feature (double-speed
half-precision arithmetic) was very useful.  Long vector codes
(N>>1000) really did get 2:1 speed-ups when run in 32-bit precision.
Short vector codes got less speedup because the startup overhead was
effectively twice as large (although it was the same number of cycles).

A more significant problem was that the perverse floating-point
arithmetic on the machine was surprisingly inaccurate and many
calculations that were OK in 32-bits on IEEE machines would fail
miserably on the 205/ETA-10.  

For more on both the speed and accuracy, see my discussion in the paper
entitled "Some Notes on 32-bit Arithmetic on the Cyber 205 and ETA-10" in
Supercomputer (the Dutch journal), volume 5, number 2 (issue #24),
March, 1988.
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@vax1.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

seanf@sco.COM (Sean Fagan) (11/13/90)

In article <11054@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>Various defunct (ETA, Cydrome) and extant (Cray) machines have a
>trick, whereby a floating point pipe can deliver one double precision
>result per clock, or else deliver two single precision results per
>clock.

Uhm... not a Cray.  Cray's have one size of fp:  word.  64-bits.  There are
some things you can do (and the hardware helps a little bit, I believe) to
get 128-bit fp numbers, but it's done mostly in software.

-- 
-----------------+
Sean Eric Fagan  | "*Never* knock on Death's door:  ring the bell and 
seanf@sco.COM    |   run away!  Death hates that!"
uunet!sco!seanf  |     -- Dr. Mike Stratford (Matt Frewer, "Doctor, Doctor")
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (11/14/90)

> trick, whereby a floating point pipe can deliver one double precision
> result per clock, or else deliver two single precision results per
> clock.

Private email has turned up:

>The <..> processor does this. The next generation <..> will not; it
>costs more silicon and more critical paths to do it than to build
>separate units.

-- 
Don		D.C.Lindsay