[comp.arch] standard extensions

preston@ariel.rice.edu (Preston Briggs) (02/16/91)

>pmontgom@euphemia.math.ucla.edu (Peter Montgomery):
>> The requested operation returns q and/or r, where
>> 
>> 		a*b + c = q*n + r   and   0 <= r < n

jlg@lanl.gov (Jim Giles) writes:
>You will, unfortunately, recieve little sympathy for this kind of
>request.  The UNIX community, in particular, will accuse you of
>requesting a 'swiss army knife' compiler.

He has my sympathy; it's free.
However, I won't spend much time inventing special syntax
or optimizations for specific problems.  There are however,
many people working on many ways of expressing and implementing
extensible languages.  "They" don't belong to a special club;
everyone is free to join.  It's very simple and fairly popular:

    while (dissatisfied) {
	Design a language (perhaps an extension of another language)

	Implement your language (perhaps hacking an existing compiler
	or building an iterpreter).

	Write many programs in your language to gain experience
	with its limitations.  Try and get others to use it.
    }

Some people repeat the process many times (e.g., Wirth).
Other people have problems they want solved and don't
wish to get pulled into an infinite loop.  They can keep
complaining; but I don't expect to see any results.
Language designers aren't satisfied with existing languages
(or they'd be out of the loop); they're busy - designing,
implementing, testing.  They've got their own agenda.  You
might be able to sneak your pet idea on someone's agenda,
but it's probably expensive.

If you want it done right, ...

hrubin@pop.stat.purdue.edu (Herman Rubin) (02/16/91)

In article <1991Feb15.192653.9846@rice.edu>, preston@ariel.rice.edu (Preston Briggs) writes:
> >pmontgom@euphemia.math.ucla.edu (Peter Montgomery):
> >> The requested operation returns q and/or r, where
> >> 
> >> 		a*b + c = q*n + r   and   0 <= r < n
> 
> jlg@lanl.gov (Jim Giles) writes:
> >You will, unfortunately, recieve little sympathy for this kind of
> >request.  The UNIX community, in particular, will accuse you of
> >requesting a 'swiss army knife' compiler.
> 
> He has my sympathy; it's free.
> However, I won't spend much time inventing special syntax
> or optimizations for specific problems.  There are however,
> many people working on many ways of expressing and implementing
> extensible languages.  "They" don't belong to a special club;
> everyone is free to join.  It's very simple and fairly popular:
> 
>     while (dissatisfied) {
> 	Design a language (perhaps an extension of another language)
> 
> 	Implement your language (perhaps hacking an existing compiler
> 	or building an iterpreter).
> 
> 	Write many programs in your language to gain experience
> 	with its limitations.  Try and get others to use it.
>     }

This assumes that the one who needs the operations, or the
improved performance, has the resources and time to do this.
Montgomery and Silverman are number theorists, I am a professor
of Statistics and Mathematics.  We are expected to do other
things.

Mathematicians are used to the idea of an extensible language
for communicating to reasonably intelligent beings.  A computer
is not an electronic "brain"; it is an extremely fast sub-
imbecile.  Someone remarked elsewhere on the complexity of
the code in Fortran to handle x**y; there is necessarily 
separate code for each combination of types of x and y, and
even code taking into account particular values of x and y.
Producing translators of this magnitude is a major project,
and there are not adequate macro translators even available.

I believe that such a macro translator could be produced,
and with that it would be easy to use machine language,
rather than the overly restrictive assembler languages
designed essentially so as to be easy for the machine to
read.

What is needed in addition is a language which allows
the introduction of operator syntax according to the
user's desires.  The above production

> >> 		a*b + c = q*n + r   and   0 <= r < n

is understood by essentially any mathematician, and
it should not be necessary to do more than provide a
template to convert this into code.  Of course, that
does nothing to get the operation into hardware.  The
RS/6000 has a*b+c in floating point, using double
double arithmetic internally but not reusably.
One of the reasons behind putting this into hardware
is that it is relatively easy, and it would be even
easier for integer arithmetic.  From an architecture
standpoint, to a mathematician there is not that much
difference between integer and floating point, floating
point being the more complicated.

> Some people repeat the process many times (e.g., Wirth).
> Other people have problems they want solved and don't
> wish to get pulled into an infinite loop.  They can keep
> complaining; but I don't expect to see any results.
> Language designers aren't satisfied with existing languages
> (or they'd be out of the loop); they're busy - designing,
> implementing, testing.  They've got their own agenda.  You
> might be able to sneak your pet idea on someone's agenda,
> but it's probably expensive.
> 
> If you want it done right, ...

Wirth is being employed to design and develop languages and
related things.  In addition, he has available for this development
somewhere between a squad and a platoon of assistants.  This is
extermely important; it is difficult to catch one's own minor
errrors, even in something as flexible as English, designed to
be read by an intelligent being.  Communicating to a machine is
much harder.  There is also the need for people to do the routine
work.

For Montgomery or Silverman or me to produce the language would be
more than asking an individual automotive designer to do everything
from idea to prototype without any assistance whatever.  
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

chip@tct.uucp (Chip Salzenberg) (02/17/91)

[ Followups to comp.misc ]

According to hrubin@pop.stat.purdue.edu (Herman Rubin):
>Montgomery and Silverman are number theorists, I am a professor
>of Statistics and Mathematics.  We are expected to do other
>things.

I'm a programmer, but that doesn't mean I've got all the time I want
to do projects for J. Random Person.  I'm expected to do other things,
too.

>I believe that such a macro translator could be produced ...

A spec, Herman.  Surely you can squeeze a few hours our of your busy
schedule to produce a spec.  Or is your desired language a mystery
even to you?
-- 
Chip Salzenberg at Teltronics/TCT     <chip@tct.uucp>, <uunet!pdn!tct!chip>
 "I want to mention that my opinions whether real or not are MY opinions."
             -- the inevitable William "Billy" Steinmetz

dmocsny@minerva.che.uc.edu (Daniel Mocsny) (02/18/91)

In article <6049@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>This assumes that the one who needs the operations, or the
>improved performance, has the resources and time to do this.
>Montgomery and Silverman are number theorists, I am a professor
>of Statistics and Mathematics.  We are expected to do other
>things.

This is a problem I have grappled with at virtually every stage
of my own work: I see all these interesting and compelling tools
the computer people create for their own work, or their own idea
of a market. Needless to say, if one has anything resembling a
specialized need, one finds oneself shunted off the mainstream
into a sort of computer orphanage.

What are scientists and mathematicians "expected to do"? Historically,
that answer has been: "Whatever it takes". Scientific researchers
have a venerable tradition of inventing all manner of experimental
and computational devices to support their work. In the "good old
days", when a scientist had a challenging problem to solve, often
the response was to design and build machines to assist.

The decline of this tradition speaks to the success of general-purpose 
computers. Today computers are so cheap and effective that scientists
find they usually can't afford to try to build anything better suited
to their problem. (This is almost always true for hardware, and it is
often true for software.) What's more, general-purpose computers are 
improving so rapidly that even machines designed specifically to solve 
specialized problems are in great danger of rapid obsolescence. If a 
research team decides to design a specialized computational device, 
they had better get cracking, because the general-purpose performance 
they are trying to beat is a rapidly moving target. A factor-of-ten 
improvement buys you about 3--5 years. If you sink 1--4 years into 
building the specialized box, you may have wasted your time.

If I may muse a bit---I see much hope in the recent trend toward 
object-oriented languages. These are inherently extensible, making
them nice substrates for specialized development. Each of the
scientific disciplines may well evolve its own standard set of
common objects, methods, operators, etc., much as they have always
evolved their own nomenclature, curricula, and so on. Assuming, of
course, that busy specialists will find time to detour into all the
intellectual overhead involved in mastering a second discipline (which
is what learning a language well enough to extend it represents).

Unfortunately, piling all this makeup on top of uncooperative hardware
may give us a performance hit. However, at some point the computer
industry will find ways to let users cooperate in the design of their
machines. Optimizing compilers are, in fact, a first step in this
direction. The optimizing compiler permits the computer to adapt itself
to the user's "paradigm". I.e., the user thinks in a language containing
many constructs and instructions that do not directly tell the computer
how best to solve the problem at hand. The correct response is not to
tell the user to think like the computer, but to teach the computer
to think like the user.

Eventually we must extend the notion of optimization to the hardware
as well as the software. Computers need some sort of hardware
flexibility permitting them to shift resources in response to user
demands. I.e., instead of collecting average statistics on instruction-
set usage, and then imposing a one-size-fits-all cast-in-stone 
"solution" on everyone, why can't the computer have flexibility in
deciding how it should execute instructions? Buying a computer should
be like hiring an employee: the longer that employee stays on your
payroll, the more productive (s)he becomes. Computers, however,
turn in pretty much the same benchmarks the first time you run your
problem as they will after the 1,000,000th run. By the time the
computer has served you for several years, it has had ample time
to compile detailed statistics on the instruction mix that *you*
want.


--
Dan Mocsny				
Internet: dmocsny@minerva.che.uc.edu

preston@ariel.rice.edu (Preston Briggs) (02/18/91)

hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>>This assumes that the one who needs the operations, or the
>>improved performance, has the resources and time to do this.
>>Montgomery and Silverman are number theorists, I am a professor
>>of Statistics and Mathematics.  We are expected to do other
>>things.

dmocsny@minerva.che.uc.edu (Daniel Mocsny) writes:
>This is a problem I have grappled with at virtually every stage
>of my own work: I see all these interesting and compelling tools
>the computer people create for their own work, or their own idea
>of a market. Needless to say, if one has anything resembling a
>specialized need, one finds oneself shunted off the mainstream
>into a sort of computer orphanage.

I was pessimistic in an earlier posting;
here's an optimistic view.

One possibility is try to lure computer scientists into your
area with promises of interesting new problems to work on.
That's worked well on some prominent people I've met.

One example is Eugene Myers, who's done significant work in
algorithms and optimization.  Recently though, he's been
working on gene sequencing, etc., supporting the Human Genome
project.  (I should point out that "recently" is several years.
These are long term commitments.)

A 2nd example is Guy Steele.  He has said that he envisions
himself as a guy who builds tools to support AI research.

---

I hope I haven't misrepresented these people.
I respect them both for vision, energy, and accomplishments.

Preston Briggs

herrickd@iccgcc.decnet.ab.com (daniel lance herrick) (02/23/91)

In article <6049@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> In article <1991Feb15.192653.9846@rice.edu>, preston@ariel.rice.edu (Preston Briggs) writes:
[i may have edited away all of preston's contributions to this line]
>> 
>> jlg@lanl.gov (Jim Giles) writes:
>> many people working on many ways of expressing and implementing
>> extensible languages.  "They" don't belong to a special club;
>> everyone is free to join.  It's very simple and fairly popular:
>> 
>>     while (dissatisfied) {
>> 	Design a language (perhaps an extension of another language)
>> 
>> 	Implement your language (perhaps hacking an existing compiler
>> 	or building an iterpreter).
>> 
>> 	Write many programs in your language to gain experience
>> 	with its limitations.  Try and get others to use it.
>>     }
> 
> This assumes that the one who needs the operations, or the
> improved performance, has the resources and time to do this.
> Montgomery and Silverman are number theorists, I am a professor
> of Statistics and Mathematics.  We are expected to do other
> things.
[much more edited away, this is only to remind of what went before]

You are faculty at one of the two schools with the oldest departments
of Computer Science in the country.  Go to your colleagues in CS and
tell them you have a design problem that seems to be about the right
size for a PhD.

If you can get them to cobble it together on gcc (or g++) you can
then profit from any improvements made later in the gnu code generator.
(And also portability to future computers.)

What has frosted me about the languages is that the arithmetic
operators are all single valued.  I usually want the quotient
and remainder from a division.  I have to tell the compiler to
give me the quotient and then tell it to give me the remainder.
It might be smart enough to notice that it gets both of them in
one machine operation, but if it is, it is only undoing bad language
design.

dan herrick
herrickd@iccgcc.decnet.ab.com

ph@ama-1.ama.caltech.edu (Paul Hardy) (02/23/91)

In article <3381.27c548c3@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes:

>   What has frosted me about the languages is that the arithmetic
>   operators are all single valued.  I usually want the quotient
>   and remainder from a division.  I have to tell the compiler to
>   give me the quotient and then tell it to give me the remainder.

This is annoying, but it's one thing an optimizer should look for since it
happens a lot in multi-precision code so that it won't cost much extra time.
Alternatively there's Forth, a stack-based language, which will leave a
quotient and remainder on the stack after a divide.  And there's always
assembly language....

				--Paul

--
This is my address:         ph@ama.caltech.edu
This is UUCP:               ...!{decwrl,uunet}!
This is my address on UUCP: ...!{decwrl,uunet}!caltech.edu!ama!ph
Any questions?

"Does Emacs have the Buddha nature?"  --Paul Hardy   "Yow!" --Zippy

hrubin@pop.stat.purdue.edu (Herman Rubin) (02/24/91)

In article <3381.27c548c3@iccgcc.decnet.ab.com>, herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes:
> In article <6049@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> > In article <1991Feb15.192653.9846@rice.edu>, preston@ariel.rice.edu (Preston Briggs) writes:
> [i may have edited away all of preston's contributions to this line]

> >> jlg@lanl.gov (Jim Giles) writes:
> >> many people working on many ways of expressing and implementing
> >> extensible languages.  "They" don't belong to a special club;
> >> everyone is free to join.  It's very simple and fairly popular:

			.......................

> > This assumes that the one who needs the operations, or the
> > improved performance, has the resources and time to do this.
> > Montgomery and Silverman are number theorists, I am a professor
> > of Statistics and Mathematics.  We are expected to do other
> > things.
> [much more edited away, this is only to remind of what went before]
 
> You are faculty at one of the two schools with the oldest departments
> of Computer Science in the country.  Go to your colleagues in CS and
> tell them you have a design problem that seems to be about the right
> size for a PhD.

You are almost right about its size; something of this complexity needs
at least two people working on it to avoid first class goofs.  But it is
not appropriate for a PhD thesis.  It is a contribution, but not particularly
research, which is the criterion for a thesis and for tenure.  The current
pressure for students in academia is to avoid what is not either required
or directly needed for their thesis research and overconcentrate.

The project can be done by a faculty member whose academic reputation is
essentially assured with the help of a graduate student or two.  Someone
like Wirth can get the funding and the students.  Someone like me has a
small chance to get the funding, and real difficulty in getting the students.

> If you can get them to cobble it together on gcc (or g++) you can
> then profit from any improvements made later in the gnu code generator.
> (And also portability to future computers.)
> 
> What has frosted me about the languages is that the arithmetic
> operators are all single valued.  I usually want the quotient
> and remainder from a division.  I have to tell the compiler to
> give me the quotient and then tell it to give me the remainder.
> It might be smart enough to notice that it gets both of them in
> one machine operation, but if it is, it is only undoing bad language
> design.

This WAS obvious when languages were first started.  Fortran had a
limited purpose, but not Algol.  It is syntactically more difficult,
but a purpose of a language is to make it easier for the user to
use the hardware, and the hardware could do this on the great bulk
of machines then.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

bs@faron.mitre.org (Robert D. Silverman) (02/25/91)

In article <PH.91Feb22180807@ama-1.ama.caltech.edu> ph@ama-1.ama.caltech.edu (Paul Hardy) writes:
:In article <3381.27c548c3@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes:
:
:>   What has frosted me about the languages is that the arithmetic
:>   operators are all single valued.  I usually want the quotient
:>   and remainder from a division.  I have to tell the compiler to
:>   give me the quotient and then tell it to give me the remainder.
:
:This is annoying, but it's one thing an optimizer should look for since it
:happens a lot in multi-precision code so that it won't cost much extra time.
:Alternatively there's Forth, a stack-based language, which will leave a
:quotient and remainder on the stack after a divide.  And there's always
:assembly language....
 
Yes. There always is assembler. However, the cost of calling an assembler
routine to do a division returning both quotient and remainder is very
expensive. In-lining mechanisms for assembler code are woefully inadequate
in current languages. One can, of course, look at the intermediate assembler
code generated by the compiler of a HLL and then adjust your assembler code
so that register usage is correct, but this is very clumsy and must be redone
everytime you make a change to the HLL code.

--
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (02/26/91)

In article <1991Feb25.135057.23667@linus.mitre.org> bs@faron.mitre.org (Robert D. Silverman) writes:
>Yes. There always is assembler. However, the cost of calling an assembler
>routine to do a division returning both quotient and remainder is very
>expensive. In-lining mechanisms for assembler code are woefully inadequate
>in current languages. One can, of course, look at the intermediate assembler
>code generated by the compiler of a HLL and then adjust your assembler code
>so that register usage is correct, but this is very clumsy and must be redone
>everytime you make a change to the HLL code.

I find it hard to criticize the facilities for inlining assembler code (not 
that I would ever do it myself -- well, hardly ever) provided by modern 
compilers, e.g. gcc as outlined in earlier posts to this group and elsewhere. 
At least some modern compilers would seem to provide the facilities for 
including such code in the (sometimes extensive) dataflow analysis they 
already perform for the HLL code -- even to the extent of allocating available 
registers for your use in the inline assembler code.  It would therefore seem 
low-level tweaking as described above is at least on the way out, if not 
already `passed over'.

But on the other hand -- why put assembler into a program at all? Although I 
understand there _are_ a (very few) application areas where even a 10% 
reduction in running time represents big bucks, I find it hard to accept that 
the ability to return the quotient and remainder, e.g., at the same time will 
make even _that_ much difference to code that already would almost certainly 
include more time-intensive instructions in the mix.

For example, suppose we compare two tight loops, one with separate divide & 
modulus/remainder (whichever you need); the other with some combined code. 
Suppose there are n other instructions in each loop.  Let's discard pipe 
flushing considerations and further assume all instructions take a single 
cycle. We have, in loop I say, n+3 instructions (let's even allow another move 
instruction to get at either the quotient or remainder as they may have ended 
up in inconvenient places) and in loop II n+1 instructions. 

Iterate both loops N times. Loop I takes N(n+3) cycles: loop II takes N(n+1) 
cycles -- the ratio is obviously (n+3)/(n+1). This will exceed a 10% 
difference if n < 19.

So, provided there are no more than about 20 instructions in the loop in this 
obviously _idealized_ circumstance, the body of the loop will run 10% or 
more faster given the fancy combined quot/rem facility.  Considering our loop 
is only _part_ of a larger program the impact on the total running time of the 
complete code will be even less marked given the facility.

But perhaps typical loops, esp those containing divide and remainder
instructions, are small? Well, after looking through my source code I don't 
happen to find _any_ loops with both divide & remainder in them -- I guess I 
tend to avoid it (blush). However, below is an _example_ that I will not say 
is _typical_, but is at least indicative of the kind of code I'm looking for 
(it's an inner loop from an FFT routine). With -O a Sun3/60 cc produces the 
code following.  Strangely, there are about 20 instructions in the loop.

------

for(xp=x,yp=y,zp=z; yp<z; ++xp,++yp) {
	if(*xp==M-1) *zp++ = M-*yp;
	else if(*yp==M-1) *zp++ = M-*xp;
	else *zp++ = *xp**yp % M;
	}

L77003:
	cmpl	#1000,a0@
	jne	L77005
	movl	#1001,d0
	subl	a1@,d0
	movl	d0,a5@
	jra	LY00000
L77005:
	cmpl	#1000,a1@
	jne	L77007
	movl	#1001,d0
	subl	a0@,d0
	movl	d0,a5@
	jra	LY00000
L77007:
	movl	a0@,d0
	mulsl	a1@,d0
	divsll	#1001,d1:d0
	movl	d1,a5@
LY00000:
	addqw	#4,a5
	addqw	#4,a0
	addqw	#4,a1
	cmpl	a6@(-12),a1
	jcs	L77003

---

To summarize: I don't think provision of a combined divide/remainder
(or divide/modulus for that matter) instruction will necessarily
speed up the total running time of any real programs appreciably
(i.e. more than, say, 10%). Perhaps Bob Silverman could illustrate
some circumstance which contradicts this?

-kym
===
No sig is a good sig (this isn't one).

mash@mips.COM (John Mashey) (02/26/91)

In article <3381.27c548c3@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes:
....
>What has frosted me about the languages is that the arithmetic
>operators are all single valued.  I usually want the quotient
>and remainder from a division.  I have to tell the compiler to
>give me the quotient and then tell it to give me the remainder.
>It might be smart enough to notice that it gets both of them in
>one machine operation, but if it is, it is only undoing bad language
>design.

I have no idea if it does it in all cases that would make sense.
However, at least the current MIPS compilers and assembler, if faced with:
	i = j / k;
	l = j % k;
conspire to execute a single division instruction.
(To see this, you have to look not at the generated .s file, which has two
divs, but at the dis-assembly of the .o or a.out, which shows only 1...)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

rex@cs.su.oz (Rex Di Bona) (02/26/91)

In article <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:

 [ Inlining HLL calls to assembly routines ]

> For example, suppose we compare two tight loops, one with separate divide & 
> modulus/remainder (whichever you need); the other with some combined code. 
> Suppose there are n other instructions in each loop.  Let's discard pipe 
> flushing considerations and further assume all instructions take a single 
> cycle.

Single Cycle? For a divide instruction? On what machine? If we take a real
example here instead (say the 68020, or the R3000 (as I have the info just
behind me here, ugh, there.... lets see, the 68020... , divide, here we are:

Divide unsigned ranges from 76 cycle (best case, register to register) to
a massive 78+24 = 102 cycles for the worst case, a divide with the source
program counter memory indirect pre-indexed (PCMIP). For signed divisions
we add 12 cycles so the best case is 88 to a worst case of 114 cycles.
Now our move takes (best case) 0 cycles (it is totally overlapped with
other instructions), to a worst case of 3+49 = 51 (the 49 is the PCMIP, to
PCMIP addressing mode).

> We have, in loop I say, n+3 instructions (let's even allow another move 
> instruction to get at either the quotient or remainder as they may have ended 
> up in inconvenient places) and in loop II n+1 instructions. 
> 
> Iterate both loops N times. Loop I takes N(n+3) cycles: loop II takes N(n+1) 
> cycles -- the ratio is obviously (n+3)/(n+1). This will exceed a 10% 
> difference if n < 19.

back in 68020 land...

For a move of a register to a register we can assume that it is overlapped
as we don't need the result in the next instruction, for a cost of 0 cycles.
(Add more if we move the intermediate to memory...)
For unsigned division...

For the best case...

	(n*(average cycles per instruction) + 76)/(n*Ave + (76 * 2))

Assumption:
	average cycles = 1, n < 685
	average cycles = 2, n < 343
	average cycles = 3, n < 229
	average cycles = 4, n < 172
	average cycles = 10, n < 69
	average cycles = 20, n < 35

For the worst case...

	(n*(average cycles per instruction) + 102)/(n*Ave + (102 * 2))

Assumption:
	average cycles = 1, n < 919
	average cycles = 2, n < 460
	average cycles = 3, n < 307
	average cycles = 4, n < 230
	average cycles = 10, n < 92
	average cycles = 20, n < 46

If we look at the example program given in the article,
we have SIGNED division, so...

For the best case...

	(n*(average cycles per instruction) + 88)/(n*Ave + (88 * 2))

Assumption:
	average cycles = 1, n < 793
	average cycles = 2, n < 397
	average cycles = 3, n < 265
	average cycles = 4, n < 199
	average cycles = 10, n < 80
	average cycles = 20, n < 40

For the worst case...

	(n*(average cycles per instruction) + 114)/(n*Ave + (114 * 2))

Assumption:
	average cycles = 1, n < 1027
	average cycles = 2, n < 514
	average cycles = 3, n < 343
	average cycles = 4, n < 257
	average cycles = 10, n < 103
	average cycles = 20, n < 52

> 
> So, provided there are no more than about 20 instructions in the loop in this 
> obviously _idealized_ circumstance, the body of the loop will run 10% or 
> more faster given the fancy combined quot/rem facility.  Considering our loop 
> is only _part_ of a larger program the impact on the total running time of
> the complete code will be even less marked given the facility.

Yes, but it is usually the case that cycle times for other, more
normally executed instructions are _far_ less than those for divide
instructions. In the above examples we find that we strill require a
large number of other instructions before we arrive at the point where
the overhead is lost in the rest of the loop. I would assume (rough
guess here) that 10 cycles per 68020 instruction is about right,
taking into account cache misses and the such, perhaps less, But the
less the average cycle count for those other instructions the better it
is to use the one instruction to do both the div and remainder...

(considering the R3000, it becomes easier (slightly), we have the
following..  divide takes, hmm... let me see, oops, the Kane book
doesn't seem to mention the time taken for a divide, ok then, lets
assume it is 19 cycles (the time for a Fp divide), and also try it at
12 cycles (lower bound, the time for a multiply)

we get (assuming 1.2 cycles per instruction)
	19 cycles for a divide n < 143
	12 cycles for a divide n < 91

The 1.2 figure comes from a hazy recollection of a talk by John Mashey, where
he was going on about why it wasn't really 1.00 cycles per instruction, and
how they (MIPS) wanted the cycles per instruction to go below 1.00 :-)

> 
> To summarize: I don't think provision of a combined divide/remainder
> (or divide/modulus for that matter) instruction will necessarily
> speed up the total running time of any real programs appreciably
> (i.e. more than, say, 10%). Perhaps Bob Silverman could illustrate
> some circumstance which contradicts this?
> 
> -kym
> ===
> No sig is a good sig (this isn't one).

I hope that the above number might convince you that having the one
instruction generate both is better in those pesky real world
situations, using real world machines. I doubt (oops, rash prediction
here) that a division will become as cheap as a move instruction,
unless something changed (radically) the speed of the processing unit
without changing the speed of memory (in which case we'll be back at
super CISC anyway).
--------
Rex di Bona (rex@cs.su.oz.au)
Penguin Lust is NOT immoral

herrickd@iccgcc.decnet.ab.com (daniel lance herrick) (02/27/91)

In article <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
[long and eloquent demonstration that one less divide instruction in the
loop will have no significant effect on computation time]
> To summarize: I don't think provision of a combined divide/remainder
> (or divide/modulus for that matter) instruction will necessarily
> speed up the total running time of any real programs appreciably
> (i.e. more than, say, 10%). Perhaps Bob Silverman could illustrate
> some circumstance which contradicts this?
> 
But I still have to obfuscate the code with
        q = dividend / divisor
        r = dividend % divisor  /* even the operator is bad */
when what I want is
        (q, r) = dividend divided by divisor

Division is defined as producing a unique quotient and remainder in
the first abstract algebra course.  Every hardware divide instruction
I have ever seen produces both quotient and remainder in one operation.
(nothing else makes any sense).  Even the 709 prototype for FORTRAN
provided both results.

My main objection is to the obfuscation of the expression of an
algorithm in the programming language.

This is a place where the computer architects have done the right
thing and the software architects have all ignored this right thing
and done a wrong thing.  Of course, we all know that the world is
floating point, the perpetrators of FORTRAN told us that that is
what is REAL.

dan herrick                            eschew obfuscation
herrickd@iccgcc.decnet.ab.com

bs@linus.mitre.org (Robert D. Silverman) (02/27/91)

In article <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
:To summarize: I don't think provision of a combined divide/remainder
:(or divide/modulus for that matter) instruction will necessarily
:speed up the total running time of any real programs appreciably
:(i.e. more than, say, 10%). Perhaps Bob Silverman could illustrate
:some circumstance which contradicts this?
 
You miss the point. There ARE many processors that provide double
length multiply and divide with remainder instructions. Most current
HLL's have no way of generating code that will access these instructions.

Furthermore, there ARE codes which can show much more than a 10% gain
by having such instructions. Most are number-theoretic or cryptographic
or group-theoretic in nature.

--
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/27/91)

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) wrote:
>But on the other hand -- why put assembler into a program at all?

The people who really love gcc's assembler inlining features are
OS writers.  I suspect that inline assembler spl?() does wonderful
things to some Unix kernel code, as does access to test-and-set
or other atomic instructions in other kernels.

But it's also useful if I just happen to acquire some desperate
need to use the VAX's edit instruction.
-- 
	-Colin

henry@zoo.toronto.edu (Henry Spencer) (02/27/91)

In article <1991Feb26.171338.8362@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes:
>You miss the point. There ARE many processors that provide double
>length multiply and divide with remainder instructions. Most current
>HLL's have no way of generating code that will access these instructions.

Nonsense.  You mean "most current compilers do not recognize situations 
where such instructions could be generated".  In most any HLL you can
write something like

	x = a / b
	y = a % b

which amply suffices as a way of requesting use of such instructions.
Blame the compiler, not the language, if the generated code doesn't use
the hardware properly.
-- 
"But this *is* the simplified version   | Henry Spencer @ U of Toronto Zoology
for the general public."     -S. Harris |  henry@zoo.toronto.edu  utzoo!henry

bs@faron.mitre.org (Robert D. Silverman) (02/27/91)

In article <1991Feb26.190242.18983@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <1991Feb26.171338.8362@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes:
>>You miss the point. There ARE many processors that provide double
>>length multiply and divide with remainder instructions. Most current
>>HLL's have no way of generating code that will access these instructions.
>
>Nonsense.  You mean "most current compilers do not recognize situations 
>where such instructions could be generated".  In most any HLL you can
>write something like
>
>	x = a / b
>	y = a % b
>
>which amply suffices as a way of requesting use of such instructions.
>Blame the compiler, not the language, if the generated code doesn't use
>the hardware properly.
 
Bull. Both the compiler and the language are to blame. One should have
some way of specifying, for example, that c = a*b is to be a 32x32
multiply (and generating the appropriate instruction or calling a routine
to do it). One might also want c = a*b to return only 32 bits
since the latter operation is a lot faster. What is needed is a facility
in the LANGUAGE ITSELF for specifying which type of multiply to perform.
short, int, and long are not enough. One also needs a 'long long' for
intermediate results.

However, If I write the following code:

long a,b,c,d;

a = (b*c + 1)/d

The COMPILER should realize that since the product of two longs can
overflow, then it needs to generate code that will perform a 32x32 multiply
followed by a 64 / 32 divide. Even on machines that have these instructions
in hardware no compilers currently do this.
Current compilers default to returning ONLY the low 32 bits and this is
clearly wrong from a mathematical standpoint.

Now if it should turn out that the true value of a is bigger than 32 bits,
then I have made an error in my code. Furthermore, there should be some
way of specifying explicitly that I ONLY want the low order bits of a
64 bit product. Currently, all compilers assume this by default.
One doesn't even have a way of getting the high 32 bits if wanted.

e.g.

a = low32(b*c + 1)/d

or

a = high32(b*c + 1)/d

It is still my contention that current languages are woefully inadequate
for implementing integer arithmetic in an EXACT mathematical fashion.
 
The operations proposed here are NOT highly specialized and abstract. They
are basic arithmetic.
 
--
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

torek@elf.ee.lbl.gov (Chris Torek) (02/27/91)

In article <1991Feb26.172239.10089@watdragon.waterloo.edu>
ccplumb@rose.uwaterloo.ca (Colin Plumb) writes:
>The people who really love gcc's assembler inlining features are OS writers.

Funny you should mention that....

/* load byte from alternate address space */
static __inline int lduba(void *loc, int asi) {
	int result;

#ifdef PARANOIA
	if ((unsigned)asi > 15)
		panic("lduba asi");
#endif
	__asm __volatile("lduba [%1]%2,%0" : "=r" (result) :
	    "r" (loc), "n" (asi));
	return (result);
}
[sparc]

or:

#define _spl(s) \
({ \
	register int _spl_r; \
\
	asm __volatile ("clrl %0; movew sr,%0; movew %1,sr" : \
		"&=d" (_spl_r) : "di" (s)); \
	_spl_r; \
})

#define	spl1()	_spl(PSL_S|PSL_IPL1)
[680x0]
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

henry@zoo.toronto.edu (Henry Spencer) (02/27/91)

In article <3439.27ca4e40@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes:
>... Every hardware divide instruction
>I have ever seen produces both quotient and remainder in one operation.

Try reading the hardware manuals for the VAX or the NS32k series.  Your
experience seems to be limited.
-- 
"But this *is* the simplified version   | Henry Spencer @ U of Toronto Zoology
for the general public."     -S. Harris |  henry@zoo.toronto.edu  utzoo!henry

gah@hood.hood.caltech.edu (Glen Herrmannsfeldt) (02/27/91)

In the common hardware implementation of the divide instruction
the remainder is automatically generated.  It is left in the
register after the rest is subtracted off.  Most high level
languages throw it away.  Most machines generate it anyway,
because it is already there.

sef@kithrup.COM (Sean Eric Fagan) (02/27/91)

In article <3439.27ca4e40@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes:
>But I still have to obfuscate the code with
>        q = dividend / divisor
>        r = dividend % divisor  /* even the operator is bad */
>when what I want is
>        (q, r) = dividend divided by divisor

ANSI C has a function div(int, int) which returns a vector.  In the form of
a *struct*, mind you, but gcc has, I believe, some extensions which allow
doing such things easier.

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

rex@cs.su.oz (Rex Di Bona) (02/27/91)

In article <1991Feb27.034955.28860@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> Since I am not entirely clear as to what the article boils down to, apart
> from ``kym, you are _WAY_ wrong''

Sorry, but; yes. The time for a 'divide' in what I would consider a general
purpose machine would out weigh (by a large factor) the time taken for more
normal instructions...

> I ran a program on a Sun3/60 created from the loop given last time with and 
> without adding an extra signed divide (that I made certain was not removed by 
> the optimizer) and found there was a `slow down' of about 30% over 100 reps
> (arithmetic mean). 

So, for each iteration it was 30% for that extra divide, ie, a divide was
taking 42% of the time for the loop???

> Considering how far a Sun was from the model I indicated in my previous post, 
> I'm a bit amazed it didn't diverge more (from Rex Di Bona's post and other 
> private email I had been expecting maybe orders of magnitude).
> 
> -kym

Oh, I don't know, having a divide be so long is about what I figured...
Your loops are not as long as you might think, as they have those if
statements. Here are the instruction counts...

				For 68K		For R3000
the initial if comparison,	6		6
the *zp++ = M-*yp		5		13
the else if compare.		6		6
the *zp++ = M-*xp		6		11
the fluff for the final else	5		18
Stalls				-		10 (? I'm not sure)

Anyway, looking at how these branches are used, we find (on a sample
run of 100 program executions) the two (non division) branches are
executed a total of 0 (yes, zero) times, and the division is executed
each time. (Note 1, below) So we have a loop that is 6+6+5 = 17 (and
the division) instructions long....  if adding a division increases the
time by 30% we can conclude that a division is (roughly) equal to 13
(If 17 = 57.% then 42.% is 13) instructions???? (if we assume the divide
takes 78 cycles, this works out at an average of 6 cycles for a 68K
instruction, not too unreasonable?) On the R3000 we have (6+6+18)*1.2+10 =
46, for a divide of about 19.3 cycles.. again, not too unreasonable?

Note 1: If you look at your program the conditions to take the top two
decision traces depend _totally_ upon random numbers being equal...
You do _not_,  I repeat _not_ change the seed to this random number
generator between runs, so it _always_ produces the same sequence.
This is a bug(feature??) of rand()? So, you would always be executing
_exactly_ the same code, the differences you measured were in fact the
instabilities in the kernel timing of your program :-(

> 
> P.S. my `back of the envelope' calcs using the above result indicates a real 
> 68k signed divide in a similar loop as before will contribute 10% of the
> total at 68 instructions 
	      ^^^^^^^^^^^^ - Instructions are a bad measure for a CISC, as
the time each takes varies so much.

Um, lets see... (assumption ... an instruction is 6 cycles) we get that we
need 703 additional instructions (apart from the divide) to make adding
an extra divide only 10% more overhead...
68 instructions (68 * 6 = 408 cycles) means an extra 16% overhead.

I ran your program on a MIPS RC3230 and got very similar numbers, which
also supports my premis. The additional divide was taking 20 cycles, which
_completely_ overwhelms the rest of the loop, so adding in that extra divide
is taking a _big_ performance hit!

What the important thing is is the _relative_ speed of divide to more
normal (move, simple alu) instructions. When the divide instruction is
much greater than the cost of a 'move' instruction you have to have _a
lot_ of move instructions to compenstate for that extra divide. Unless
you have these you lose big. In your case, you have a relatively small
number of 'other' instructions (17 in the 68020, and 30 in the R3000)
which ment that the divide instruction was almost equal in time to all these
instructions!  (This is why it is easier when multiplying by a constant
to do the shift and adds instead of the multiply instruction (SPARC is
except from this example :-)
--------
Rex di Bona (rex@cs.su.oz.au)
Penguin Lust is NOT immoral

peter@ficc.ferranti.com (Peter da Silva) (02/27/91)

Just a random observation:

In article <1991Feb26.171338.8362@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes:
> Furthermore, there ARE codes which can show much more than a 10% gain
> by having [div/rem] instructions. Most are number-theoretic or cryptographic
> or group-theoretic in nature.

I would think the most common one would be formatting integers into ASCII
strings.
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

hrubin@pop.stat.purdue.edu (Herman Rubin) (02/27/91)

In article <1991Feb27.021435.11296@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> In article <2124@cluster.cs.su.oz.au> rex@cluster.cs.su.oz (Rex Di Bona) writes:
> >Single Cycle? For a divide instruction? On what machine? If we take a real
> >example here instead (say the 68020, or the R3000 (as I have the info just
> >behind me here, ugh, there.... lets see, the 68020... , divide, here we are:
> 
> Oh, I don't know. I kinda had the impression that there were some
> high performance divide pipelines that did give a result every cycle.
> Milage may vary however. (Perhaps giving example 68000 code was a bit
> misleading).

Of the machines I know, this is not the case.  The CYBER 205, which 
pipelines just about everything, does not pipeline divide, square root
(the same unit), or convert between decimal and binary, which I believe
also uses that unit.  The CRAY does not provide a division at all, presumably
because it could not be pipelined.  It does have a pipelined approximation to
reciprocal, and a pipelined correction factor.

This is very definitely an architecture problem, and not a language problem.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

weaver@jetsun.weitek.COM (Mike Weaver) (02/28/91)

In article <1991Feb27.021435.11296@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>In article <2124@cluster.cs.su.oz.au> rex@cluster.cs.su.oz (Rex Di Bona) writes:
>Oh, I don't know. I kinda had the impression that there were some
>high performance divide pipelines that did give a result every cycle.
>Milage may vary however. (Perhaps giving example 68000 code was a bit
>misleading).
>
>-kym

Does anyone know of a pipelined divider that gives a result every cycle?
As far as I know, they exist only on paper, and for good reason:
every algorithm I have heard of for an 'array' divider starts with
an iterative algorithm, and simply repeats the hardware for each 
iteration. This means that the amount hardware is increased by 
the same factor as the increase in throughput, and the latency 
remains the same, which is not very attractive. Also, the actual
number of transistors is horrendous -- it would take perhaps
ten times as many transistors as an array multiplier, which
is a large thing.

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (02/28/91)

In article <2134@cluster.cs.su.oz.au> rex@cluster.cs.su.oz (Rex Di Bona) writes:
[I say I'm not way off even considering we're moving away from my original 
assumptions]
> Sorry, but; yes. The time for a 'divide' in what I would consider a general
> purpose machine would out weigh (by a large factor) the time taken for more
> normal instructions...

From my original post I kinda eliminated CISC workstations & such from
consideration, but let's run with Rex's ball anyhow.

[I measured that an extra divide in a loop caused a slow down of 30%]
> So, for each iteration it was 30% for that extra divide, ie, a divide was
> taking 42% of the time for the loop???

No -- 30%.
	1 x divide + other stuff = 1.0
	2 x divide + offer stuff = 1.3
	---
	therefore 1 x divide = 0.3 of loop

> I ran your program on a MIPS RC3230 and got very similar numbers, which
> also supports my premis. The additional divide was taking 20 cycles, which
> _completely_ overwhelms the rest of the loop, so adding in that extra divide
> is taking a _big_ performance hit!

Using pixie on a Magnum I find the following with my little loop with
one divide and the whole thing repeated 100 times:

Number of cycles:		8.2e7
Number of instructions:		3.8e7 (i.e. 2.1 c/i av)
Number of divides executed:	2.6%

I also find a divide takes about 4 add times on average.

So we want to know how big the loop is such that an extra divide accounts for 
10% --

	Td
	----------	= 0.1
	2 Td + n Ta

==>	n = 32 Where Td = 4 Ta (say).

I still think I'm in the ballpark wrt my previous posts even though we have 
diverged a little wrt the architectural assumptions.

As Rex rightly points out the number of instructions _executed_ in my little 
loop is actually fairly small (which only improves my claim that loops
of around 20 instructions would suffice to swap a `divide effect'). There are 
also two other tricks -- the number of divides may be much smaller than the 
number of loop iterations (depending on the size of M), and the `divide' 
effect will be masked by an equal number of multiplies (fairly typical of the 
codes originally in question I think). Which all goes to show the importance 
of measurement.

As to what the _typical_ size of loops that contain, e.g. a divide & remainder,
ARE -- I have no real idea at this time. The loop I have posted is (actually 
the smallest) from various code that I think may be typical of its type. If 
there is any interest, maybe I can do some more pixifying of other programs & 
come up with some better indication.

To summarize: we have an indication that EVEN WHERE DIVIDES ARE EXPENSIVE 
(e.g. on the 68k's as Rex intimates) the cost of an extra divide in (what I 
guess I'm claiming to be) typical loops is not very high wrt the total time 
of the loop (i.e. 30% in my simple test case of the last post) and rapidly 
less when considering the total running time of the program as a whole.

My original point was that talking about the cost of an extra instruction 
or two IN ISOLATION is (both common and) misleading. The total running cost of 
a program is fairly insensitive to the cost of an individual instruction in a 
loop. I indicated at the time that my argument was overly simplistic, but I'm 
surprised at how little it seems to `diverge' (although you _could_ say
30% is 3 times 10% -- I prefer 30% is only 20% away :-) when altering some of 
the underlying assumptions.

-kym

wca@oakhill.sps.mot.com (William Anderson) (02/28/91)

weaver@jetsun.weitek.COM (Mike Weaver) writes:

>Does anyone know of a pipelined divider that gives a result every cycle?

I was under the impression that HP had designed and implemented a multi-
chip FPU with a (partially?) pipelined divider.  It used a model division
(similar to SRT) on every row.  See the proceedings of the 1986 ISSCC (p. 34)
for a description (and die photo) of this divider as well as a clever tree
multiplier.  Judging by this article, HP could cycle the divider only
half as fast (420 nS) as the fully pipelined multiplier (210 nS).  My,
how the clockrates have changed in 5 years....

> ... Also, the actual number of transistors is horrendous -- it would
>take perhaps ten times as many transistors as an array multiplier, which
>is a large thing.

The numbers listed for the (NMOS) parts mentioned above were ~153K
transistors for the multiplier and ~160K for the divider - not an order
of magnitude by any meams.

William Anderson
Motorola 88K Design Group
Motorola MMTG
Austin, TX

tk@wheat-chex.ai.mit.edu (Tom Knight) (02/28/91)

>In article <3381.27c548c3@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes:

>What has frosted me about the languages is that the arithmetic
>operators are all single valued.  I usually want the quotient
>and remainder from a division.  I have to tell the compiler to
>give me the quotient and then tell it to give me the remainder.
>It might be smart enough to notice that it gets both of them in
>one machine operation, but if it is, it is only undoing bad language
>design.

Of course Common Lisp has had just such a function since at least the
early 80's.  In fact, you get your choice of how to treat the
dividend... floor, ceiling, or truncate.  But we all know that Lisp
isn't a real language 8-].

<obligatory garbage to make certain inferior operating system news programs happy>
<obligatory garbage to make certain inferior operating system news programs happy>
<obligatory garbage to make certain inferior operating system news programs happy>
<obligatory garbage to make certain inferior operating system news programs happy>
<obligatory garbage to make certain inferior operating system news programs happy>
<obligatory garbage to make certain inferior operating system news programs happy>
<obligatory garbage to make certain inferior operating system news programs happy>
<obligatory garbage to make certain inferior operating system news programs happy>
<obligatory garbage to make certain inferior operating system news programs happy>

ali@f.gp.cs.cmu.edu (Ali-Reza Adl-Tabatabai) (03/02/91)

In article <1991Feb27.220856.4067@oakhill.sps.mot.com> wca@oakhill.sps.mot.com (William Anderson) writes:
>weaver@jetsun.weitek.COM (Mike Weaver) writes:
>
>>Does anyone know of a pipelined divider that gives a result every cycle?
>
>I was under the impression that HP had designed and implemented a multi-
>chip FPU with a (partially?) pipelined divider.  It used a model division
>(similar to SRT) on every row.  See the proceedings of the 1986 ISSCC (p. 34)
>for a description (and die photo) of this divider as well as a clever tree
>multiplier.  Judging by this article, HP could cycle the divider only
>half as fast (420 nS) as the fully pipelined multiplier (210 nS).  My,
>how the clockrates have changed in 5 years....
>
>> ... Also, the actual number of transistors is horrendous -- it would
>>take perhaps ten times as many transistors as an array multiplier, which
>>is a large thing.
>
>The numbers listed for the (NMOS) parts mentioned above were ~153K
>transistors for the multiplier and ~160K for the divider - not an order
>of magnitude by any meams.

Let us compare a radix-4 division algorithm with residuals (partial 
remainders) in carry save form with a radix-4 multiplication algorithm 
that keeps it's partial products in CS form.  A division step consists
of 
	(1) quotient digit selection
	(2) divisor multiple selection
	(2) carry save addition to produce next partial remainder

A multiplication step consists of

	(1) multiplication of recoded multiplier digit with multiplicand
	(2) carry save addition to produce next partial  product

The only difference in logic between the two is the quotient
digit selection, which can be done based on an estimate of
the partial remainder.  This can be accomplished using a limited
precision adder (5-6 bits, independant of divider's full precision)
and a PLA.  Therefore if we unwind the steps to produce an array
implementation, there should hardly be a 10 times difference
in the number of transistors.  As the precision becomes larger,
the difference diminishes since the size of the quotient digit
selection logic will remain the same.

The speed, however, is a different story, since in the muliplier
the recoding and multiplicand selection can be done in parallel
followed by a propogation through the adder array.  In the divider,
each step depends on the previous partial remainder, so the path
length is longer.  Therefore, in a pipelined implementation, for
a given pipeline step time, you may do much more of the multiplication
than of the division.  Hence you may require more latches for
the divider.

Note that I did not take into account the last carry-assimilation
step for the muliplier and divider, and I did not discuss
recoding of the redundant quotient digits into conventional
form.  After the last multiplication step, the most significant
half of the product is in carry-save form and needs to be 
passed through a carry-assimilate adder.  The last partial 
remainder of the divider will also be in carry-save form.  This
introduces an error in the last bit of the quotient, so that
it must also be assimilated for precise rounding.  The conversion
of the redundant quotient to conventional form may be done
on-the-fly as the digits are produced.  See 'On-the-fly Conversion
of redundant into conventional representations', IEEE Trans. on
comp. July 1987 by Ercegovac and Lang.

>
>William Anderson
>Motorola 88K Design Group
>Motorola MMTG
>Austin, TX

Ali-Reza Adl-Tabatabai	(ali@cs.cmu.edu)