[comp.arch] FFT for SPARC

ballen@convex.csd.uwm.edu (Bruce Allen) (05/24/91)

I'm an infrequent visitor to this newsgroup, but it seems like a
reasonable place to post my request.  I'm looking for a hand-coded
FFT routine for a SparcStation 2.  I posted a similar note on
comp.sys.sun and got no useful replies.  I am looking for
something that was initially coded high-level, for instance, and
then recoded by hand to wring out every last iota of performance.

Please reply directly, thanks
	Bruce Allen (ballen@dirac.phys.uwm.edu)

nather@ut-emx.uucp (Ed Nather) (05/29/91)

In article <12392@uwm.edu>, ballen@convex.csd.uwm.edu (Bruce Allen) writes:
> 
> I am looking for
> something that was initially coded high-level, for instance, and
> then recoded by hand to wring out every last iota of performance.

Then you are looking for a procedure that is probably far less efficient
than it could be.

When a programmer codes a known algorithm, he thinks in terms of the
operations available to him -- for example, in Fortran or C there are
a set of operations to choose from, and he chooses from them.  They are
then translated into machine code by a compiler/assembler/linker.  But
when he codes directly in machine code (or in assembler) he thinks in
terms of the basic machine operations available, and chooses from them.
He can often find quicker ways to get the same result using this richer
set of operations, than when he is limited by those available at a higher
level.  This effect was probably minimal in C when the target was a PDP-11,
but can by much greater for other target machines.  Wringing out the last
iota of performance, when the whole coding approach is not optimal, can
give the illusion of economy, but there may be greater economies to be
found by starting with a different set of available operations.

As an example, several years ago I ran a published version of the Sieve of
Eratosthenes written in Fortran and timed it on a machine not noted for its
Fortran compiler efficiency, then recoded the same algorithm in assembler
and ran it.  The assembler version ran 22 times faster.  It made use of an
obscure operation in the inner loop, where the combination "i + i + 1" was
accomplished in a single instruction -- not something a compiler would be
likely to generate.

I believe in compilers, use them most of the time, but can still write
faster code in assembler when performance is otherwise unacceptable.  I
keep waiting for the day when CPU speed is so fast I don't have to do this,
but the day hasn't arrived yet -- at least not for computers I can afford.

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin