[comp.arch] What about hardware implementations of optimized algorithms?

cleary@husc9.harvard.edu (Kenneth Cleary) (08/31/90)

While I find discussions of hardware optimizations achieved through
improvement of the electronic circuitry (such as GaAs, and higher-density
VLSI) and design improvements like pipelining or instruction caching, I
would like to hear about examples of optimized algorithms implemented as
digital circuits.

It is not difficult to find a few books on optimized algorithms implemented
as software, but I'm having difficulty finding out about hardware aspects.
(I'm not specifically referring to digital circuit optimizations like gate-
count reduction, though that also interests me.)  Most hardware optimizat-
ions I've seen assume a fixed algorithm which is to be implemented as
efficiently as possible, rather than looking at desired input and desired
output, and attempting to find the most efficient algorithm.

SOFTWARE CASE IN POINT:
	Donald E. Knuth, The Art of Computer Programming, 
	vol.2, Seminumerical Algorithms
	page 278, section 4.3.3. How Fast Can We Multiply?
Knuth talks about getting faster multiplication than order-n**2 behavior.
He describes some methods of getting behavior of n**(lg3) or n**1.585
but this only starts to provide a signifigant boost as the size of the
numbers being multiplied becomes quite large.  (i.e. it might not be worth
it for 8-bit & 16-bit numbers, but may be competitive for 32-bit & 64-bit.)

	HEAVY PARAPHRASING OF KNUTH BEGINS...
If we have two n-bit [let's say 32-bit] numbers {underscores for subscripts}
u=(u_31,u_30,...,u_1,u_0) and v=(v_31,v_30,...,v_1,v_0)  {individual bits}
we can write       u=(2**16)U_1 + U_0        v=(2**16)V_1 + V_0
basically breaking each number into its most-sig half & least-sig half, so
        uv= ((2**32)+(2**16))(U_1)(V_1) + 
            (2**16)((U_1)-(U_0))((V_0)-(V_1))+((2**16)+1)(U_0)(V_0)
"This formula reduces the problem of multiplying [32-bit] numbers to three
multiplications of [16-bit] numbers... plus some simple shifting and adding
operations."
This breaking up into two parts is done recursively, breaking the number
into smaller and smaller chunks. {reminds me of quicksort}
	PARAPHRASING ENDS. {My apologies for typos}

So, while there may not have been demand for this sort of approach with
smaller 8-bit chunks, it may provide so much of a boost at 64-bits, as to
make it more even more worthwhile to build 64-bit machines for handling
64-bit numbers, instead of chopping them into two 32-bit chunks to be
handled by a 32-bit processor.
(wait a minute... "breaking into two chunks" sounds familiar...)

Oh Well!  It's late, and I think I should stop typing right n.............

Publius@dg.dg.com (Publius) (09/01/90)

In article <4047@husc6.harvard.edu> cleary@husc9.UUCP (Kenneth Cleary) writes:
>While I find discussions of hardware optimizations achieved through
>improvement of the electronic circuitry (such as GaAs, and higher-density
>VLSI) and design improvements like pipelining or instruction caching, I
>would like to hear about examples of optimized algorithms implemented as
>digital circuits.

I would second your call for the discussion in this area.

>	Donald E. Knuth, The Art of Computer Programming, 
>	vol.2, Seminumerical Algorithms
>	page 278, section 4.3.3. How Fast Can We Multiply?
>Knuth talks about getting faster multiplication than order-n**2 behavior.
>He describes some methods of getting behavior of n**(lg3) or n**1.585

There is an algorithm that can do even faster than O(n**1.585).
I don't have the reference in front of me now and I don't remember
the exact time complexity.  But it can be found in Aho, Ullman, and Hopcroft's
book on algorithms.  That algorithm is based on number-theoretic FFT.
It takes advantage of the similarity between multiplication and convolution in
signal processing and the fact that the classic Convolution Theorem can be
extended to the abstract algebraic structures.

-- 
Disclaimer: I speak (and write) only for myself, not my employer.

preston@titan.rice.edu (Preston Briggs) (09/04/90)

In article <889@dg.dg.com> publius@dg-gw.dg.com (Publius) writes:
>In article <4047@husc6.harvard.edu> cleary@husc9.UUCP (Kenneth Cleary) writes:
>>	Donald E. Knuth, The Art of Computer Programming, 
>>	vol.2, Seminumerical Algorithms
>>	page 278, section 4.3.3. How Fast Can We Multiply?
>>Knuth talks about getting faster multiplication than order-n**2 behavior.
>>He describes some methods of getting behavior of n**(lg3) or n**1.585

>There is an algorithm that can do even faster than O(n**1.585).
>I don't have the reference in front of me now and I don't remember
>the exact time complexity.  But it can be found in Aho, Ullman, and Hopcroft's
>book on algorithms.  That algorithm is based on number-theoretic FFT.
>It takes advantage of the similarity between multiplication and convolution in
>signal processing and the fact that the classic Convolution Theorem can be
>extended to the abstract algebraic structures.

It's called Sch\"{o}nhage-Strassen.
You can use it to multiply to $n$-bit numbers in 
$O_{B}(n \log n \log\log n)$ steps.

Unfortunately, this is the asymptotic complexity (as $n$ becomes
very large).  The key point is made in the following sentence:

	"The value of $n\/$ for which many of the algorithms
	 described here become practical is enormous."

That is, they aren't going to help us with 64 bit multiplication.

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

hays@isc.intel.com (Kirk Hays) (09/07/90)

In article <889@dg.dg.com> publius@dg-gw.dg.com (Publius) writes:
>In article <4047@husc6.harvard.edu> cleary@husc9.UUCP (Kenneth Cleary) writes:
>>Knuth talks about getting faster multiplication than order-n**2 behavior.
>>He describes some methods of getting behavior of n**(lg3) or n**1.585
>
>
>There is an algorithm that can do even faster than O(n**1.585).
>I don't have the reference in front of me now and I don't remember
>the exact time complexity.  But it can be found in Aho, Ullman, and Hopcroft's
>book on algorithms.  That algorithm is based on number-theoretic FFT.

Looking in my copy of Aho, Ullman, and Hopcroft (ISBN 0-201-0023-7), 
they discuss multiplication on pages 307-310, and claim the presented 
algorithm is O(n**1.59), which is sufficiently close to O(n**1.585) to 
lead me to believe that they are discussing the same algorithm as 
Knuth.  I don't have my Knuth handy to verify this.

Have you a different reference for multiplication faster than O(n**1.59)?

-- 
Kirk Hays - NRA Life.  Dulce et Decorum est, pro patra mori.
"The way of the samurai is found in death.  It is this simple.  If you
 can accept it then you will fight as though you are already dead."
Tsunemoto Yamamoto, _Hagakure_

cik@l.cc.purdue.edu (Herman Rubin) (09/07/90)

In article <908@intelisc.isc.intel.com>, hays@isc.intel.com (Kirk Hays) writes:
> In article <889@dg.dg.com> publius@dg-gw.dg.com (Publius) writes:
> >In article <4047@husc6.harvard.edu> cleary@husc9.UUCP (Kenneth Cleary) writes:
> >>Knuth talks about getting faster multiplication than order-n**2 behavior.
> >>He describes some methods of getting behavior of n**(lg3) or n**1.585

			...................

> Have you a different reference for multiplication faster than O(n**1.59)?

Knuth describes faster ones.  The fastest known algorithm, asymptotically,
is the Sch"onhage-Strassen algorithm, which is O(n logn loglogn).  It is
known that this is within a few powers of loglogn of the best possible
asymptotics (Cook and Anderaa).

But this says nothing about what to do with a given precision and a
given architecture.  Even the apparently simple Karatsuba-Ofman method
runs into questions of overflow and/or sign, which can made it slower
than the usual one for a given precision.  The n**1.585 algorithm is
easily programmed, and I doubt that hardware could do much better than
a microcoded program, except for adding one bit to numbers before multi-
plication to take care of the above problem.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

mshute@cs.man.ac.uk (Malcolm Shute) (09/14/90)

Going back to the original question in this discussion (which seems since
to have veered towards one specific multiplication example).

In article <4047@husc6.harvard.edu> cleary@husc9.UUCP (Kenneth Cleary) writes:
>It is not difficult to find a few books on optimized algorithms implemented
>as software, but I'm having difficulty finding out about hardware aspects.

Can I express some degree of surprise at this request... I thought there
was already a massive cross-traffic of ideas from hardware to software
and back again.

Normally, if something is originally implemented in one (hardware of
software) first, when it comes to re-implementing it in the other,
the initial approach is to mimic the original implementation (complete
with any leasons learned about algorithmic improvements in that first
medium).

For instance, the 'obvious' way to implement pseudo-random number
generators in software is to use the '*2' (multiply-by-two) operation
to set up shift registers in the program, just like the ring counters
in the original hardware.  Or, look at the way the original unix
encryption program 'simulated' the wheels on the Enigma machine.

In the other direction, look at how floating point hardware started
by implementing what was originally performed in subroutines.

If someone subsequently publishes a paper on how to improve/replace
the algorithm in the original implemention... this will have direct
relevence to the subsequent implementation.

What I think you are more probably asking about, however, is
whether there is a body of mathematics to manipulate hardware
descriptions, just as computer science has already built up
theorems to manipulate software descriptions.

I am most familiar with the work of Dr. Sheeran (now at Glasgow
University) for designing finite state machine hardware in general,
and systolic array hardware in particular, using a language which
is an extension of Backus' FFP:

Sheeran, M. (1984).
uFP, An Algebraic VLSI Design Language,
Programming Research Group, Technical Monograph PRG-39,
Oxford University Computing Laboratory,
150 pp.

Sheeran, M. (1985).
Designing regular array architectures using higher order functions,
in J.P.Jouannaud, ed".,
Functional Programming Languages and Computer Architecture,
LNCS,
Vol. 201, pp 220-237,
Springer-Verlag, Berlin.

I know that others are working on languages which can prove theorems
about other types of hardware.  Prof. Sutherland refers to such
in his Turing Award lecture:

Sutherland, I.E. (1989).
Micropipelines,
CACM,
Vol. 32, No. 6, pp 720-738.
--

Malcolm SHUTE.         (The AM Mollusc:   v_@_ )        Disclaimer: all