jk3k+@andrew.cmu.edu (Joe Keane) (08/15/89)
Here's a possibly stupid question, but i've never seen anything about this. Are there any architectures that have hardware support for complex arithmetic? For example, dividing two complex numbers (in rectangular form) takes 6 multiplications, 2 divisions, and 3 add/subtracts, if i count correctly. With appropriate functional units that can be hooked up in the right way, this could be done in a couple of clock cycles. Are there machines with complex arithmetic instructions? In a more RISC vein, what instructions are out there that can be used for fast complex arithmetic?
beyer@cbnewsh.ATT.COM (jean-david.beyer) (08/15/89)
In article <kYtuglW00V4G01l=ZK@andrew.cmu.edu>, jk3k+@andrew.cmu.edu (Joe Keane) writes: > Here's a possibly stupid question, but i've never seen anything about this. I do not believe that there are NOW any machines that have complex number operations in the hardware. At Bell Labs, either just before, during, or slightly after World War II they built a series of computers, using electro- mechanical relays as the logical elements. These machines were known as the Model I, II, ... VI. I think they built 2 of one of them. Since they were interested in calculating electric wave filters, and such things, they had to do a lot of complex number operations. At least one of these machines supported complex arithmetic directly. I do not suppose these machines exist any more. They were probably faster than the desk calculators (also mechanical) of the day, but were nowhere near as fast as the vacuum tube machines that replaced them. -- Jean-David Beyer AT&T Bell Laboratories Holmdel, New Jersey, 07733 attunix!beyer
davidsen@sungod.crd.ge.com (ody) (08/15/89)
In article <kYtuglW00V4G01l=ZK@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes: | Are there any architectures that have hardware support for complex arithmetic? | For example, dividing two complex numbers (in rectangular form) takes 6 | multiplications, 2 divisions, and 3 add/subtracts, if i count correctly. With | appropriate functional units that can be hooked up in the right way, this could | be done in a couple of clock cycles. Actually I didn't write the pseudo-microcode, but I *think* the minimum time is the sum of mpy,div,add for the longest path. And none of those happen in a few cycles unless you pipeline. Now, assuming that you did build a machine which had the right hardware, if you have a vector long enoung you can do anything on just ove one clock cycle. Why hasn't anyone built a complex FPU? This seems like a reasonable thing to do, in term of being common. It could probably be built into one of the existing micro FPUs without too much trouble (obviously needs more microcode), but I doubt that you win much because the chip doesn't have the power to do a lot in parallel. bill davidsen (davidsen@crdos1.crd.GE.COM) {uunet | philabs}!crdgw1!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (08/15/89)
In article <1672@crdgw1.crd.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes: >Why hasn't anyone built a complex FPU? This seems like a reasonable >thing to do, in term of being common. Complex-arithmetic hardware is extremely common, actually. The most-used Fourier Transform algorithm involves complex arithmetic. So, FFT boxes have direct hardware support. The inner loop of the usual FFT involves something called a "butterfly", which is mostly just a complex multiplication. The high-end boxes have a pipelined butterfly unit, containing multiple floating point units. -- Don D.C.Lindsay Carnegie Mellon School of Computer Science
cik@l.cc.purdue.edu (Herman Rubin) (08/16/89)
In article <1672@crdgw1.crd.ge.com>, davidsen@sungod.crd.ge.com (ody) writes: > In article <kYtuglW00V4G01l=ZK@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes: > | Are there any architectures that have hardware support for complex arithmetic? | For example, dividing two complex numbers (in rectangular form) takes 6 | multiplications, 2 divisions, and 3 add/subtracts, if i count correctly. With | appropriate functional units that can be hooked up in the right way, this could | be done in a couple of clock cycles. > > Actually I didn't write the pseudo-microcode, but I *think* the > minimum time is the sum of mpy,div,add for the longest path. And none of > those happen in a few cycles unless you pipeline. With the maximum amount of parallelism, the multiplications can be done in parallel, the adds in parallel, and the divides in parallel. One could even combine the multiplies and adds. But that is it. Hardware support for complex add would not be too difficult, nor complex multiply. It would be more expensive than hardware support for good integer arithmetic, which is almost nonexistent now. The fastest time for complex multiplication would be slightly worse than double precision multiplication, but division would slow down as usual. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
rtrauben@cortex.Sun.COM (Richard Trauben) (08/16/89)
>In article <1672@crdgw1.crd.ge.com> davidsen@crdos1.UUCP > (bill davidsen) writes: >>Why hasn't anyone built a complex FPU? This seems like a reasonable >>thing to do, in term of being common. > The "current/previous" generation of ATT and Texas Instruments digital signal processor (DSP) chips provide complex floating point arithmetic primitives to speed up the inner-loop multiply/accumulate butterfly operation of a fast fourier transform. Such things are essential for spectral analysis, filtering and pattern recognition (radar/speech). -Richard
tak@eecg.toronto.edu (Mike Takefman) (08/17/89)
> In article <kYtuglW00V4G01l=ZK@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes: > Are there any architectures that have hardware support for complex arithmetic? > For example, dividing two complex numbers (in rectangular form) takes 6 > multiplications, 2 divisions, and 3 add/subtracts, if i count correctly. With > appropriate functional units that can be hooked up in the right way, this could > be done in a couple of clock cycles. The TASP (teamed array signal processor i think) from motorola information systems (toronto I think) has complex floating point pipelines. The box is used by the military for radar work. -- Michael Takefman "I'll have whatever she ordered!" University of Toronto When Harry Met Sally E.E. Computer Group tak@eecg.toronto.edu
jbuck@epimass.EPI.COM (Joe Buck) (08/18/89)
In article <121828@sun.Eng.Sun.COM> rtrauben@sun.UUCP (Richard Trauben) writes: >The "current/previous" generation of ATT and Texas Instruments digital >signal processor (DSP) chips provide complex floating point arithmetic >primitives to speed up the inner-loop multiply/accumulate butterfly >operation of a fast fourier transform. There is no support for complex arithmetic per se in the TI TMS320 family, and I don't believe there is any in the AT&T chips either. What is supported is a magic "bit reversal addressing" mode, which makes it easy to handle the shuffling operation required in an FFT. It requires four multiplies, an addition, and a subtraction to do a complex multiply; you can do this in four cycles on the TMS320C30 by taking advantage of parallel multiplication and addition/subtraction. -- -- Joe Buck jbuck@epimass.epi.com, uunet!epimass.epi.com!jbuck
njk@freja.diku.dk (Niels J|rgen Kruse) (08/18/89)
davidsen@sungod.crd.ge.com (ody) writes: > Why hasn't anyone built a complex FPU? This seems like a reasonable >thing to do, in term of being common. It could probably be built into >one of the existing micro FPUs without too much trouble (obviously needs >more microcode), but I doubt that you win much because the chip doesn't >have the power to do a lot in parallel. As far as i can tell, the main advantage of hardware support for complex arithmetic is the greater encoding density allowed by a dedicated storage format for complex numbers. Consider that it is meaningless from a numerical viewpoint to represent one component of a complex number with greater accuracy than the other. This means that a dedicated storage format need only have *one* exponent. Comparing such a double precision format to a conventional representation as 2 double precision ieee numbers, 11 exponent bits are saved and 2 hidden fraction bits are lost (because only one fraction will be normalized in general and it may be any of them). This leaves 9 extra bits which can be used to extend the range and precision, for instance 4 extra bits in each fraction and 1 extra exponent bit. This translates to more than a full decimal digit of extra precision and larger range too. The hardware cost of support for such a storage format need not be excessive, when speed is not the main concern. A low cost implementation could get away with only 2 extra instructions for each complex format (single, double precision) : a load instruction that would load the components of a complex number into 2 regular fp registers and the reverse store instruction. Complex arithmetic would then be done with a sequence of the regular instructions. The main cost of this is making registers and arithmetic units wider to accomodate the extra precision of the complex format. However, a few bits of extra precision on in-register arithmetic is useful for plain fp arithmetic too, especially for authors of math libraries. On some micro FPUs, the cost of (much) wider registers and arithmetic units have already been paid, so i really don't see a good reason why they don't support complex numbers. Didn't they think of it? It seems an obvious thing to do. Perhaps someone out there could enlighten us. -- Niels J|rgen Kruse Email njk@diku.dk Mail Tustrupvej 7, 2 tv, 2720 Vanlose, Denmark
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (08/18/89)
In article <4781@freja.diku.dk> njk@freja.diku.dk (Niels J|rgen Kruse) writes: >As far as i can tell, the main advantage of hardware support >for complex arithmetic is the greater encoding density allowed >by a dedicated storage format for complex numbers. > >Consider that it is meaningless from a numerical viewpoint to >represent one component of a complex number with greater >accuracy than the other. > >This means that a dedicated storage format need only have *one* >exponent. On a general-purpose machine, a special format has less value, since some conventional floating point format must also be supported. However, on special-purpose machines, there is a long history of this sort of thing. In particular, many FFT boxes have been built with "block floating point". This takes your suggestion - a single exponent per complex number - and extends it, so that there is a single exponent for an entire data vector. I believe that this worked adequately for FFTs, where the data is relatively homogenous. The block exponent is essentially "automatic gain control", as the analog people used to say. Mostly, this was a halfway house between the speed/price of integer boxes, and the convenience/generality of floating point boxes. -- Don D.C.Lindsay Carnegie Mellon School of Computer Science
jpainter@tjp.East.Sun.COM (John Painter - Sun BOS Hardware) (08/19/89)
In article <4781@freja.diku.dk> njk@freja.diku.dk (Niels J|rgen Kruse) writes: >davidsen@sungod.crd.ge.com (ody) writes: > >> Why hasn't anyone built a complex FPU? This seems like a reasonable >>thing to do, in term of being common. It could probably be built into >>one of the existing micro FPUs without too much trouble (obviously needs >>more microcode), but I doubt that you win much because the chip doesn't >>have the power to do a lot in parallel. > >As far as i can tell, the main advantage of hardware support >for complex arithmetic is the greater encoding density allowed >by a dedicated storage format for complex numbers. > >Consider that it is meaningless from a numerical viewpoint to >represent one component of a complex number with greater >accuracy than the other. > Meaningless in what applications? I might want to know phase relation- ships with more precision offered when the phase of the voltage and current components are nearly 90 degrees out of phase. /Tjp -disclaimer disclaimer II SSeeee eevveerryytthhiinngg ttwwiiccee.. - (Catch it 22 times at a store near you) The following lines are no satisfy Mr. Mailer <n> now You peeked! Bad ... Bad news reader Baa....dddd!
bs@linus.UUCP (Robert D. Silverman) (08/19/89)
In article <3532@epimass.EPI.COM> jbuck@epimass.EPI.COM (Joe Buck) writes: :In article <121828@sun.Eng.Sun.COM> rtrauben@sun.UUCP (Richard Trauben) writes: :>The "current/previous" generation of ATT and Texas Instruments digital :>signal processor (DSP) chips provide complex floating point arithmetic :>primitives to speed up the inner-loop multiply/accumulate butterfly :>operation of a fast fourier transform. : :There is no support for complex arithmetic per se in the TI TMS320 :family, and I don't believe there is any in the AT&T chips either. :What is supported is a magic "bit reversal addressing" mode, which :makes it easy to handle the shuffling operation required in an FFT. : :It requires four multiplies, an addition, and a subtraction to do a :complex multiply; you can do this in four cycles on the TMS320C30 by :taking advantage of parallel multiplication and addition/subtraction. This is not quite correct. Usually [on most hardware] multiplies take significantly longer than additions. Two complex numbers can be multiplied in THREE multiplications (not four) at the expense of a couple extra add/subtracts: (a+bi)(c+di) = ac - bd + ((a+b)(c+d) - ac -bd) i One computes a+b and c+d , then does 3 multiplications: ac, bd and (a+b)(c+d) one additional add, and 3 subtracts finishes it. Total: 3 multiplies, 3 adds, 3 subtracts. This can be faster depending on the relative speed of multiplication and addition. It is almost always faster if working in double precision. Bob Silverman
davidsen@sungod.crd.ge.com (ody) (08/19/89)
In article <4781@freja.diku.dk> njk@freja.diku.dk (Niels J|rgen Kruse) writes: | Consider that it is meaningless from a numerical viewpoint to | represent one component of a complex number with greater | accuracy than the other. | | This means that a dedicated storage format need only have *one* | exponent. Comparing such a double precision format to a conventional Whoa! I'm missing something here, what has exponent got to do with accuracy? The exponent specifies the magnitude and the mantissa provides the accuracy. If you have one exponent and the real component is large while the imaginary component is small you will have fewer significant digits (that's what I mean by accuracy) in one than the other. Could you 'splain this to me? It sounds as if you are saying that if one component is large in magnitude we can afford to have less precision on the other. Hope I misunderstand what you're telling me. bill davidsen (davidsen@crdos1.crd.GE.COM) {uunet | philabs}!crdgw1!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
swarren@eugene.uucp (Steve Warren) (08/19/89)
In article <1758@crdgw1.crd.ge.com> davidsen@crdos1.UUCP(bill davidsen) writes: > Could you 'splain this to me? It sounds as if you are saying that if >one component is large in magnitude we can afford to have less precision >on the other. Hope I misunderstand what you're telling me. > bill davidsen (davidsen@crdos1.crd.GE.COM) > {uunet | philabs}!crdgw1!crdos1!davidsen >"Stupidity, like virtue, is its own reward" -me Think of it as a vector. Changing the mantissa of the component that is orders of magnitude smaller is not going to move the vector significantly. For example, if the magnitude of the smaller (call it V1) of the two components is less than the least significant bit of the larger component (call it V2), then the truncation error introduced by V2 is greater than the error introduced by eliminating V1 entirely. V1 therefore should be truncated at the least significant position in V2. --Steve ------------------------------------------------------------------------- {uunet,sun}!convex!swarren; swarren@convex.COM
cik@l.cc.purdue.edu (Herman Rubin) (08/19/89)
In article <1549@convex.UUCP>, swarren@eugene.uucp (Steve Warren) writes: > In article <1758@crdgw1.crd.ge.com> davidsen@crdos1.UUCP(bill davidsen) writes: < < Could you 'splain this to me? It sounds as if you are saying that if < <one component is large in magnitude we can afford to have less precision < <on the other. Hope I misunderstand what you're telling me. < < bill davidsen (davidsen@crdos1.crd.GE.COM) < < {uunet | philabs}!crdgw1!crdos1!davidsen < <"Stupidity, like virtue, is its own reward" -me > > Think of it as a vector. Changing the mantissa of the component that is > orders of magnitude smaller is not going to move the vector significantly. > > For example, if the magnitude of the smaller (call it V1) of the two > components is less than the least significant bit of the larger component > (call it V2), then the truncation error introduced by V2 is greater than > the error introduced by eliminating V1 entirely. V1 therefore should be > truncated at the least significant position in V2. There are situations, like computing FTs (fast or slow) where this is the case. There are other situations where this is not the case. I needed the complex error function with good relative error in both the real and imaginary parts for the case in which the real part of the argument was small. It was necessary for me to develop new algorithms to accomplish this; the existing algorithms with good relative error were quite poor at this. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
pmontgom@sonia.math.ucla.edu (Peter Montgomery) (08/22/89)
In article <4781@freja.diku.dk> njk@freja.diku.dk (Niels J|rgen Kruse) writes: > >Consider that it is meaningless from a numerical viewpoint to >represent one component of a complex number with greater >accuracy than the other. > >This means that a dedicated storage format need only have *one* >exponent. Comparing such a double precision format to a conventional >representation as 2 double precision ieee numbers, 11 exponent >bits are saved and 2 hidden fraction bits are lost (because >only one fraction will be normalized in general and it may be >any of them). > >This leaves 9 extra bits which can be used to extend the range >and precision, for instance 4 extra bits in each fraction and 1 >extra exponent bit. This translates to more than a full decimal >digit of extra precision and larger range too. > >The hardware cost of support for such a storage format need not >be excessive, when speed is not the main concern. A low cost >implementation could get away with only 2 extra instructions >for each complex format (single, double precision) : >a load instruction that would load the components of a complex >number into 2 regular fp registers and the reverse store >instruction. Complex arithmetic would then be done with a >sequence of the regular instructions. The main cost of this is >making registers and arithmetic units wider to accommodate the >extra precision of the complex format. However, a few bits of >extra precision on in-register arithmetic is useful for plain >fp arithmetic too, especially for authors of math libraries. If the application environment is expected to make heavy use of complex arithmetic, then it is reasonable to supply the fundamental arithmetic operations (e.g., load/store, load conjugate, indexing into complex arrays, addition, subtraction, multiplication, multiplication by conjugate, multiplication by real number, absolute value, equality/inequality test) directly in hardware. Doing division in hardware will be especially nice if it frees the software from concern about exponent overflow (e.g., a quotient z1/z2 may be well-defined even though the intermediate results z1*CONJUGATE(z2) and z2*CONJUGATE(z2) overflow or underflow). However, I cannot approve of using different storage conventions for the components of a complex number than those used for ordinary floating point data. Some of my reasons are: i) I dispute that the real and imaginary components of a complex number will always have comparable exponents. For example, after taking the principal logarithm of a complex number, the real part of the result can have any magnitude while the imaginary part will be between -PI and +PI. As another example, some of my fellow number theorists are studying the zeros of the Riemann zeta function; these zeros are known to have real part between 0 and 1 (and believed to always be exactly 0.5), yet the imaginary parts can get arbitrarily large. ii) The FORTRAN language allows equivalencing of real and complex arrays, and requires the numbers have identical format. Many existing programs using complex arithmetic are coded in FORTRAN. Programs in other languages may also be dependent upon this when they pass a pointer to a component of a complex number. iii) If the components of a complex number have more precision than regular numbers, it will severely complicate operations which look at the individual components. For example, given a statement if (REAL(z1) > REAL(z2)) then ... will the compiler need to reduce the precision of z1 and z2 as it does the compares? Niels suggests extra precision for in-register arithmetic, but that is inadequate unless those registers can be copied to memory and restored as needed. For example, when converting from ASCII to complex, how will the compiler (or run time library) arrange to convert one component from ASCII to extended precision, save this value, convert the other component, and then merge the two components into a complex number? Different precisions for register arithmetic and storage arithmetic can also confuse the programmer; for example real x, y, sum sum = x+y if (sum .eq. x+y) then ... may skip the "..." code if the compiler stores "sum" in memory and reloads it, because of truncation during the store. iv) If the representations are identical, then a smart compiler can use the complex arithmetic instructions even when the original source code uses non-complex data. For example, given real array1(10), array2(10), scalar array1 = array1 + scalar*array2 (array assignment) a machine with complex arithmetic support but no vector support can view each array as five complex numbers rather than ten real numbers, and use the complex arithmetic instructions for the operation. -------- Peter Montgomery pmontgom@MATH.UCLA.EDU
bmw@isgtec.UUCP (Bruce Walker) (08/29/89)
In article <64429@linus.UUCP> bs@linus.UUCP (Robert D. Silverman) writes: >In article <3532@epimass.EPI.COM> jbuck@epimass.EPI.COM (Joe Buck) writes: >:It requires four multiplies, an addition, and a subtraction to do a >:complex multiply; you can do this in four cycles on the TMS320C30 by >:taking advantage of parallel multiplication and addition/subtraction. > >This is not quite correct. Usually [on most hardware] multiplies take >significantly longer than additions. Ah, but this is what makes DSP's special: if the registers are loaded, multiplies take one cycle. And since most instructions allow you to load/store/accumulate in the same cycle (as the multiply), dense instruction sequences can be created (not usually by compilers :-) where adds, subtracts and multiplies all cost the same. At least, such is the case with the tms320c25, with which I am most familiar. -- Bruce Walker ...uunet!mnetor!lsuc!isgtec!bmw "Better Living Through Connectivity" ...utzoo!lsuc!isgtec!bmw ISG Technologies Inc. 3030 Orlando Dr. Mississauga. Ont. Can. L4V 1S8