[comp.arch] hardware complex arithmetic support

jk3k+@andrew.cmu.edu (Joe Keane) (08/15/89)

Here's a possibly stupid question, but i've never seen anything about this.

Are there any architectures that have hardware support for complex arithmetic?
For example, dividing two complex numbers (in rectangular form) takes 6
multiplications, 2 divisions, and 3 add/subtracts, if i count correctly.  With
appropriate functional units that can be hooked up in the right way, this could
be done in a couple of clock cycles.

Are there machines with complex arithmetic instructions?  In a more RISC vein,
what instructions are out there that can be used for fast complex arithmetic?

beyer@cbnewsh.ATT.COM (jean-david.beyer) (08/15/89)

In article <kYtuglW00V4G01l=ZK@andrew.cmu.edu>, jk3k+@andrew.cmu.edu (Joe Keane) writes:
> Here's a possibly stupid question, but i've never seen anything about this.

I do not believe that there are NOW any machines that have complex number
operations in the hardware. At Bell Labs, either just before, during, or
slightly after World War II they built a series of computers, using electro-
mechanical relays as the logical elements. These machines were known as the
Model I, II, ... VI. I think they built 2 of one of them. Since they were
interested in calculating electric wave filters, and such things, they had
to do a lot of complex number operations. At least one of these machines
supported complex arithmetic directly. I do not suppose these machines
exist any more. They were probably faster than the desk calculators
(also mechanical) of the day, but were nowhere near as fast as the vacuum
tube machines that replaced them.

-- 
Jean-David Beyer
AT&T Bell Laboratories
Holmdel, New Jersey, 07733
attunix!beyer

davidsen@sungod.crd.ge.com (ody) (08/15/89)

In article <kYtuglW00V4G01l=ZK@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes:

| Are there any architectures that have hardware support for complex arithmetic?
| For example, dividing two complex numbers (in rectangular form) takes 6
| multiplications, 2 divisions, and 3 add/subtracts, if i count correctly.  With
| appropriate functional units that can be hooked up in the right way, this could
| be done in a couple of clock cycles.

  Actually I didn't write the pseudo-microcode, but I *think* the
minimum time is the sum of mpy,div,add for the longest path. And none of
those happen in a few cycles unless you pipeline.

  Now, assuming that you did build a machine which had the right
hardware, if you have a vector long enoung you can do anything on just
ove one clock cycle.

  Why hasn't anyone built a complex FPU? This seems like a reasonable
thing to do, in term of being common. It could probably be built into
one of the existing micro FPUs without too much trouble (obviously needs
more microcode), but I doubt that you win much because the chip doesn't
have the power to do a lot in parallel.
	bill davidsen		(davidsen@crdos1.crd.GE.COM)
  {uunet | philabs}!crdgw1!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (08/15/89)

In article <1672@crdgw1.crd.ge.com> davidsen@crdos1.UUCP 
	(bill davidsen) writes:
>Why hasn't anyone built a complex FPU? This seems like a reasonable
>thing to do, in term of being common.

Complex-arithmetic hardware is extremely common, actually. The
most-used Fourier Transform algorithm involves complex arithmetic.
So, FFT boxes have direct hardware support.

The inner loop of the usual FFT involves something called a
"butterfly", which is mostly just a complex multiplication.  The
high-end boxes have a pipelined butterfly unit, containing multiple
floating point units.

-- 
Don		D.C.Lindsay 	Carnegie Mellon School of Computer Science

cik@l.cc.purdue.edu (Herman Rubin) (08/16/89)

In article <1672@crdgw1.crd.ge.com>, davidsen@sungod.crd.ge.com (ody) writes:
> In article <kYtuglW00V4G01l=ZK@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes:
> 
  | Are there any architectures that have hardware support for complex arithmetic?
  | For example, dividing two complex numbers (in rectangular form) takes 6
  | multiplications, 2 divisions, and 3 add/subtracts, if i count correctly.  With
  | appropriate functional units that can be hooked up in the right way, this could
  | be done in a couple of clock cycles.
> 
>   Actually I didn't write the pseudo-microcode, but I *think* the
> minimum time is the sum of mpy,div,add for the longest path. And none of
> those happen in a few cycles unless you pipeline.

With the maximum amount of parallelism, the multiplications can be done in
parallel, the adds in parallel, and the divides in parallel.  One could
even combine the multiplies and adds.  But that is it.

Hardware support for complex add would not be too difficult, nor complex
multiply.  It would be more expensive than hardware support for good
integer arithmetic, which is almost nonexistent now.

The fastest time for complex multiplication would be slightly worse than
double precision multiplication, but division would slow down as usual.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

rtrauben@cortex.Sun.COM (Richard Trauben) (08/16/89)

>In article <1672@crdgw1.crd.ge.com> davidsen@crdos1.UUCP 
>	(bill davidsen) writes:
>>Why hasn't anyone built a complex FPU? This seems like a reasonable
>>thing to do, in term of being common.
>

The "current/previous" generation of ATT and Texas Instruments digital 
signal processor (DSP) chips provide complex floating point arithmetic
primitives to speed up the inner-loop multiply/accumulate butterfly 
operation of a fast fourier transform. Such things are essential for
spectral analysis, filtering and pattern recognition (radar/speech).

-Richard

tak@eecg.toronto.edu (Mike Takefman) (08/17/89)

> In article <kYtuglW00V4G01l=ZK@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes:
> Are there any architectures that have hardware support for complex arithmetic?
> For example, dividing two complex numbers (in rectangular form) takes 6
> multiplications, 2 divisions, and 3 add/subtracts, if i count correctly.  With
> appropriate functional units that can be hooked up in the right way, this could
> be done in a couple of clock cycles.

The TASP (teamed array signal processor i think) from motorola
information systems (toronto I think) has complex floating
point pipelines. The box is used by the military for radar work. 
-- 
Michael Takefman	"I'll have whatever she ordered!"
University of Toronto		When Harry Met Sally
E.E. Computer Group
tak@eecg.toronto.edu

jbuck@epimass.EPI.COM (Joe Buck) (08/18/89)

In article <121828@sun.Eng.Sun.COM> rtrauben@sun.UUCP (Richard Trauben) writes:
>The "current/previous" generation of ATT and Texas Instruments digital 
>signal processor (DSP) chips provide complex floating point arithmetic
>primitives to speed up the inner-loop multiply/accumulate butterfly 
>operation of a fast fourier transform.

There is no support for complex arithmetic per se in the TI TMS320
family, and I don't believe there is any in the AT&T chips either.
What is supported is a magic "bit reversal addressing" mode, which
makes it easy to handle the shuffling operation required in an FFT.

It requires four multiplies, an addition, and a subtraction to do a
complex multiply; you can do this in four cycles on the TMS320C30 by
taking advantage of parallel multiplication and addition/subtraction.
-- 
-- Joe Buck	jbuck@epimass.epi.com, uunet!epimass.epi.com!jbuck

njk@freja.diku.dk (Niels J|rgen Kruse) (08/18/89)

davidsen@sungod.crd.ge.com (ody) writes:

>  Why hasn't anyone built a complex FPU? This seems like a reasonable
>thing to do, in term of being common. It could probably be built into
>one of the existing micro FPUs without too much trouble (obviously needs
>more microcode), but I doubt that you win much because the chip doesn't
>have the power to do a lot in parallel.

As far as i can tell, the main advantage of hardware support
for complex arithmetic is the greater encoding density allowed
by a dedicated storage format for complex numbers.

Consider that it is meaningless from a numerical viewpoint to
represent one component of a complex number with greater
accuracy than the other.

This means that a dedicated storage format need only have *one*
exponent. Comparing such a double precision format to a conventional
representation as 2 double precision ieee numbers, 11 exponent
bits are saved and 2 hidden fraction bits are lost (because
only one fraction will be normalized in general and it may be
any of them).

This leaves 9 extra bits which can be used to extend the range
and precision, for instance 4 extra bits in each fraction and 1
extra exponent bit. This translates to more than a full decimal
digit of extra precision and larger range too.

The hardware cost of support for such a storage format need not
be excessive, when speed is not the main concern. A low cost
implementation could get away with only 2 extra instructions
for each complex format (single, double precision) :
a load instruction that would load the components of a complex
number into 2 regular fp registers and the reverse store
instruction. Complex arithmetic would then be done with a
sequence of the regular instructions. The main cost of this is
making registers and arithmetic units wider to accomodate the
extra precision of the complex format. However, a few bits of
extra precision on in-register arithmetic is useful for plain
fp arithmetic too, especially for authors of math libraries.

On some micro FPUs, the cost of (much) wider registers and
arithmetic units have already been paid, so i really don't see
a good reason why they don't support complex numbers.

Didn't they think of it? It seems an obvious thing to do.

Perhaps someone out there could enlighten us.
-- 
         Niels J|rgen Kruse
Email    njk@diku.dk
Mail     Tustrupvej 7, 2 tv, 2720 Vanlose, Denmark

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (08/18/89)

In article <4781@freja.diku.dk> njk@freja.diku.dk (Niels J|rgen Kruse) writes:
>As far as i can tell, the main advantage of hardware support
>for complex arithmetic is the greater encoding density allowed
>by a dedicated storage format for complex numbers.
>
>Consider that it is meaningless from a numerical viewpoint to
>represent one component of a complex number with greater
>accuracy than the other.
>
>This means that a dedicated storage format need only have *one*
>exponent. 

On a general-purpose machine, a special format has less value, since
some conventional floating point format must also be supported.

However, on special-purpose machines, there is a long history of this
sort of thing. In particular, many FFT boxes have been built with
"block floating point". This takes your suggestion - a single
exponent per complex number - and extends it, so that there is a
single exponent for an entire data vector.

I believe that this worked adequately for FFTs, where the data is
relatively homogenous. The block exponent is essentially "automatic
gain control", as the analog people used to say. Mostly, this was a
halfway house between the speed/price of integer boxes, and the
convenience/generality of floating point boxes.
-- 
Don		D.C.Lindsay 	Carnegie Mellon School of Computer Science

jpainter@tjp.East.Sun.COM (John Painter - Sun BOS Hardware) (08/19/89)

In article <4781@freja.diku.dk> njk@freja.diku.dk (Niels J|rgen Kruse) writes:
>davidsen@sungod.crd.ge.com (ody) writes:
>
>>  Why hasn't anyone built a complex FPU? This seems like a reasonable
>>thing to do, in term of being common. It could probably be built into
>>one of the existing micro FPUs without too much trouble (obviously needs
>>more microcode), but I doubt that you win much because the chip doesn't
>>have the power to do a lot in parallel.
>
>As far as i can tell, the main advantage of hardware support
>for complex arithmetic is the greater encoding density allowed
>by a dedicated storage format for complex numbers.
>
>Consider that it is meaningless from a numerical viewpoint to
>represent one component of a complex number with greater
>accuracy than the other.
>

Meaningless in what applications?  I might want to know phase relation-
ships with more precision offered when the phase of the voltage and
current components are nearly 90 degrees out of phase.  

/Tjp
-disclaimer disclaimer II SSeeee eevveerryytthhiinngg ttwwiiccee..
- (Catch it 22 times at a store near you) 

The following lines are no satisfy Mr. Mailer <n> now

















You peeked! Bad ... Bad news reader Baa....dddd!

bs@linus.UUCP (Robert D. Silverman) (08/19/89)

In article <3532@epimass.EPI.COM> jbuck@epimass.EPI.COM (Joe Buck) writes:
:In article <121828@sun.Eng.Sun.COM> rtrauben@sun.UUCP (Richard Trauben) writes:
:>The "current/previous" generation of ATT and Texas Instruments digital 
:>signal processor (DSP) chips provide complex floating point arithmetic
:>primitives to speed up the inner-loop multiply/accumulate butterfly 
:>operation of a fast fourier transform.
:
:There is no support for complex arithmetic per se in the TI TMS320
:family, and I don't believe there is any in the AT&T chips either.
:What is supported is a magic "bit reversal addressing" mode, which
:makes it easy to handle the shuffling operation required in an FFT.
:
:It requires four multiplies, an addition, and a subtraction to do a
:complex multiply; you can do this in four cycles on the TMS320C30 by
:taking advantage of parallel multiplication and addition/subtraction.
 
This is not quite correct. Usually [on most hardware] multiplies take
significantly longer than additions. Two complex numbers can be
multiplied in THREE multiplications (not four) at the expense of a 
couple extra add/subtracts:

(a+bi)(c+di) = ac - bd + ((a+b)(c+d) - ac -bd) i

One computes a+b and c+d , then does 3 multiplications: ac, bd and (a+b)(c+d)
one additional add, and 3 subtracts finishes it.

Total: 3 multiplies, 3 adds, 3 subtracts. This can be faster depending
on the relative speed of multiplication and addition. It is almost
always faster if working in double precision.

Bob Silverman

davidsen@sungod.crd.ge.com (ody) (08/19/89)

In article <4781@freja.diku.dk> njk@freja.diku.dk (Niels J|rgen Kruse) writes:

| Consider that it is meaningless from a numerical viewpoint to
| represent one component of a complex number with greater
| accuracy than the other.
| 
| This means that a dedicated storage format need only have *one*
| exponent. Comparing such a double precision format to a conventional

  Whoa! I'm missing something here, what has exponent got to do with
accuracy? The exponent specifies the magnitude and the mantissa provides
the accuracy. If you have one exponent and the real component is large
while the imaginary component is small you will have fewer significant
digits (that's what I mean by accuracy) in one than the other.

  Could you 'splain this to me? It sounds as if you are saying that if
one component is large in magnitude we can afford to have less precision
on the other. Hope I misunderstand what you're telling me.
	bill davidsen		(davidsen@crdos1.crd.GE.COM)
  {uunet | philabs}!crdgw1!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

swarren@eugene.uucp (Steve Warren) (08/19/89)

In article <1758@crdgw1.crd.ge.com> davidsen@crdos1.UUCP(bill davidsen) writes:
>  Could you 'splain this to me? It sounds as if you are saying that if
>one component is large in magnitude we can afford to have less precision
>on the other. Hope I misunderstand what you're telling me.
>	bill davidsen		(davidsen@crdos1.crd.GE.COM)
>  {uunet | philabs}!crdgw1!crdos1!davidsen
>"Stupidity, like virtue, is its own reward" -me

Think of it as a vector.  Changing the mantissa of the component that is
orders of magnitude smaller is not going to move the vector significantly.

For example, if the magnitude of the smaller (call it V1) of the two
components is less than the least significant bit of the larger component
(call it V2), then the truncation error introduced by V2 is greater than
the error introduced by eliminating V1 entirely.  V1 therefore should be
truncated at the least significant position in V2.

--Steve
-------------------------------------------------------------------------
	  {uunet,sun}!convex!swarren; swarren@convex.COM

cik@l.cc.purdue.edu (Herman Rubin) (08/19/89)

In article <1549@convex.UUCP>, swarren@eugene.uucp (Steve Warren) writes:
> In article <1758@crdgw1.crd.ge.com> davidsen@crdos1.UUCP(bill davidsen) writes:
< <  Could you 'splain this to me? It sounds as if you are saying that if
< <one component is large in magnitude we can afford to have less precision
< <on the other. Hope I misunderstand what you're telling me.
< <	bill davidsen		(davidsen@crdos1.crd.GE.COM)
< <  {uunet | philabs}!crdgw1!crdos1!davidsen
< <"Stupidity, like virtue, is its own reward" -me
> 
> Think of it as a vector.  Changing the mantissa of the component that is
> orders of magnitude smaller is not going to move the vector significantly.
> 
> For example, if the magnitude of the smaller (call it V1) of the two
> components is less than the least significant bit of the larger component
> (call it V2), then the truncation error introduced by V2 is greater than
> the error introduced by eliminating V1 entirely.  V1 therefore should be
> truncated at the least significant position in V2.

There are situations, like computing FTs (fast or slow) where this is the case.
There are other situations where this is not the case.  I needed the complex
error function with good relative error in both the real and imaginary parts
for the case in which the real part of the argument was small.  It was 
necessary for me to develop new algorithms to accomplish this; the existing
algorithms with good relative error were quite poor at this.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

pmontgom@sonia.math.ucla.edu (Peter Montgomery) (08/22/89)

In article <4781@freja.diku.dk> njk@freja.diku.dk (Niels J|rgen Kruse) writes:
>
>Consider that it is meaningless from a numerical viewpoint to
>represent one component of a complex number with greater
>accuracy than the other.
>
>This means that a dedicated storage format need only have *one*
>exponent. Comparing such a double precision format to a conventional
>representation as 2 double precision ieee numbers, 11 exponent
>bits are saved and 2 hidden fraction bits are lost (because
>only one fraction will be normalized in general and it may be
>any of them).
>
>This leaves 9 extra bits which can be used to extend the range
>and precision, for instance 4 extra bits in each fraction and 1
>extra exponent bit. This translates to more than a full decimal
>digit of extra precision and larger range too.
>
>The hardware cost of support for such a storage format need not
>be excessive, when speed is not the main concern. A low cost
>implementation could get away with only 2 extra instructions
>for each complex format (single, double precision) :
>a load instruction that would load the components of a complex
>number into 2 regular fp registers and the reverse store
>instruction. Complex arithmetic would then be done with a
>sequence of the regular instructions. The main cost of this is
>making registers and arithmetic units wider to accommodate the
>extra precision of the complex format. However, a few bits of
>extra precision on in-register arithmetic is useful for plain
>fp arithmetic too, especially for authors of math libraries.

	If the application environment is expected to make heavy use of
complex arithmetic, then it is reasonable to supply the fundamental
arithmetic operations (e.g., load/store, load conjugate, indexing into 
complex arrays, addition, subtraction, multiplication, 
multiplication by conjugate, multiplication by real number,
absolute value, equality/inequality test) directly in hardware.  
Doing division in hardware will be especially nice if it frees
the software from concern about exponent overflow (e.g.,
a quotient z1/z2 may be well-defined even though the intermediate results
z1*CONJUGATE(z2) and z2*CONJUGATE(z2) overflow or underflow).

	However, I cannot approve of using different storage conventions for
the components of a complex number than those used for ordinary floating point 
data.  Some of my reasons are:

    i)  I dispute that the real and imaginary components of a 
	complex number will always have comparable exponents.
        For example, after taking the principal logarithm of a 
        complex number, the real part of the result can have any
        magnitude while the imaginary part will be between -PI and +PI.  
	As another example, some of my fellow number theorists are
	studying the zeros of the Riemann zeta function; these zeros
	are known to have real part between 0 and 1 (and believed
	to always be exactly 0.5), yet the imaginary parts can get
	arbitrarily large.

   ii)  The FORTRAN language allows equivalencing of real and
        complex arrays, and requires the numbers have identical format.
        Many existing programs using complex arithmetic are coded in FORTRAN.
	Programs in other languages may also be dependent upon this when
	they pass a pointer to a component of a complex number.

 iii)   If the components of a complex number have more precision
        than regular numbers, it will severely complicate operations which
        look at the individual components.  For example, given a statement

               if (REAL(z1) > REAL(z2)) then ...

        will the compiler need to reduce the precision of z1 and z2 as it
        does the compares?  Niels suggests extra precision for in-register
	arithmetic, but that is inadequate unless those registers can be
	copied to memory and restored as needed.  For example, when converting 
	from ASCII to complex, how will the compiler (or run time library)
	arrange to convert one component from ASCII to extended precision, 
        save this value, convert the other component, and then merge the 
        two components into a complex number?  Different precisions
	for register arithmetic and storage arithmetic can also confuse
	the programmer; for example

		real x, y, sum

		sum = x+y
		if (sum .eq. x+y) then
		    ...

	may skip the "..." code if the compiler stores "sum" in memory
	and reloads it, because of truncation during the store.

   iv)  If the representations are identical, then a smart compiler 
        can use the complex arithmetic instructions even when the 
	original source code uses non-complex data.  For example, given

		real array1(10), array2(10), scalar

		array1 = array1 + scalar*array2      (array assignment)
	
        a machine with complex arithmetic support but no vector support
        can view each array as five complex numbers rather than ten real 
        numbers, and use the complex arithmetic instructions for the operation.
--------
        Peter Montgomery
        pmontgom@MATH.UCLA.EDU

bmw@isgtec.UUCP (Bruce Walker) (08/29/89)

In article <64429@linus.UUCP> bs@linus.UUCP (Robert D. Silverman) writes:
>In article <3532@epimass.EPI.COM> jbuck@epimass.EPI.COM (Joe Buck) writes:
>:It requires four multiplies, an addition, and a subtraction to do a
>:complex multiply; you can do this in four cycles on the TMS320C30 by
>:taking advantage of parallel multiplication and addition/subtraction.
> 
>This is not quite correct. Usually [on most hardware] multiplies take
>significantly longer than additions.

Ah, but this is what makes DSP's special: if the registers are loaded,
multiplies take one cycle.  And since most instructions allow you to
load/store/accumulate in the same cycle (as the multiply), dense
instruction sequences can be created (not usually by compilers :-)
where adds, subtracts and multiplies all cost the same.

At least, such is the case with the tms320c25, with which I am most
familiar.

-- 
Bruce Walker                          ...uunet!mnetor!lsuc!isgtec!bmw
"Better Living Through Connectivity"         ...utzoo!lsuc!isgtec!bmw
ISG Technologies Inc. 3030 Orlando Dr. Mississauga. Ont. Can. L4V 1S8