[comp.arch] Divides

oconnor@sunray.steinmetz (Dennis Oconnor) (01/01/70)

In article <6370@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
>Are there any reference to the GE RISC architecture? Its a new one to me.

  There is a paper in the upcoming GOMAC which is the first general
  disclosure. The system had a blurb in AW&ST about 18 months ago.

>I said: Microcode, or nanocode, has to go through all the same
>operations that assembly level code does.
>
>>  Except fetching instructions, and operations resulting from
>>  the need to handle interupts or exceptions at arbitrary points
>>  in the assembly code (microcode can lock excepts out till it completes)
>
>Assembly language doesn't fetch instructions, hardware does it
>automatically. Generally its microcode that has to fetch the assembly
>language instructions, and hardware that fetches the microcode. Is
>there something to this analogy I'm missing? You're not clear.

  Replacing a microcode sequence with an assembly language sequence
  increases the number of instructions that need to be fetched to
  perform the operation. Good cache design can minimize this penalty.

>Assembly language is perfectly capable of locking out interrupts. You
>may have to be in supervisor state or something to do it, but every
>machine I've every seen with interrupts has a way to turn them off.
>This is far more flexible than having only specific microcode routines, which
>are locked into ROM, be able to turn off interrupts.

  Microcode doesn't need to be in supervisor state to temporarily
  disable interupts, doesn't need the user to know interupts need
  disabling, and doesn't prevent you from providing an interupt-disable
  to the user. So what does microcode lose you in this respect?

  This is NOT a CISC/RISC argument : I favor RISC, but let's not
  kid ourselves that CISCs don't have some advantages. As we all know.
  That's like comparing trains to trucks : there are wins for each.

>I said: Its VERY difficult to make fixed point division run faster than
>a bit per cycle, without a LOT of hardware. By leaving out the
>special purpose speedup stuff, you can afford to include some VERY
>useful general purpose speedup stuff: More registers ... branch folding ...
>
>>  This is not really true. If you have a fast multiplier ( which is 
>>  a good idea for many applications ) you can do division very much
>>  quicker than one cycle per bit, relatively easily, especially for
>>  long word lengths. In fact, you can do division in something like
>>
>>   C + (multiply_lateny * (Int_Round_up( log_base2( word_length )) - K)
>>
>>  where C and K are positive integer small constants dependant on
>>  how you implement your algorithm. The technique to use is
>>  Newton-Raphson iteration with a first-guess look-up table.
>>  "The official divide algorithm of the IBM-360/95 and Cray-1 (I think :-)"
>>  The additional hardware needed (besides a fast multiplier) is TWIT.
>
>I am not unaware of the Newton-Raphson divide algorithm.  If you
>think that the look-up table, or the fast multiplier, or the datapath
>logic you have to put around the multiplier to do the Newton-Raphson
>iteration is trivial, then you've haven't designed one. I have. It
>isn't.  It is also not necessarily faster than a bit at a time
>divider, depending on your fast multiply time, word length, look-up
>table, etc. I've been told (by someone who was there) that although
>Amdahl insisted on this approach in the 470, its was discovered
>afterwards that it ran slower than the usual serial approach would have.

I specifically stated that IF you had a fast multiplier around,
this was easy, so saying fast multipliers are non-trivial doesn't
contradict me. Now, I HAVE designed everything you say. Really.
It has the look-up table in random logic ( only need a few bits ).
Has a clever normalize-denominator instruction (needed).
Does the iteration in assembly language (it's a RISC machine)
so no additional datapath is required. And will be in silicon RSN.

I hate "you haven't designed one, I have" arguments. Well, let's
make it clear : I (and the rest of the GE team) HAVE DESIGNED A
RISC PROCESSOR WITH NR-ITERATION FOR DIVISION. Details will have
to wait for the official public release. But since YOU have designed
one (I guess) and I have designed one, and we disagree, then
we must have been designing in different contexts (i.e. speed targets,
architecture limits, implementation technology ... ). You show me
you design and context and I'll show you mine (:-).

( side note : don't you think some designers get a raw deal? I mea
here is some world-class designer at Intel, and he has to design
a new competive processor that's upward-compatable with the 8086.
Will the CS world look at his design and say, "Wow, what a great
job he did given the constraints he was under!"? No, most of the
CS world will just say "What a Kludge!". We're talkin unfair here. )

I never said NR-Iteration was faster than the serial method, I simply
gave an equation for predicting how fast it was. Given an equation for
serial division, like :     C2 + (subtract_latency * word_length),
it's relatively easy to see that sometimes one is faster, sometimes
the other is. No surprise. For us, NR-Iteration was the winner. Your
mileage may very. But do I care how fast you do an 8-bit divide? No.

I said in an article:
>>Of course, the PDP10 didn't have to reorganize code, so it did
>>not have to deal with memory-aliasing problems. 

In article <6370@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
>The PDP-10 didn't have to re-organize code. Neither do RISC
>architectures.  Neither do 370's. But, you might be surprised to find
>that on these machines, re-organizating the code to eliminate
>pipeline interlocks will speed up your code. [Reference deleted]
>{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

Sure, RISCs don't need to reorganize, and CRAYs don't need hard disks,
floppies will work just fine! Semantic Baloney. If by "needs" we
mean "must have in order to be commercially viable", or "must have
to show any advantage over other architecture classes", then RISCs
do need reorganizers. I never said other classes of architecture
(like CISCs) couldn't benefit from them, but they don't exhibit
as large a performance penalty when you don't use one as RISCs do.

And before you say "You haven't written a reorganizer.", well,
it's not finished, but I am on a team writing one NOW. (:-)
--
	Dennis O'Connor 	oconnor@sungoddess.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
        "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"

oconnor@sunray.steinmetz (Dennis Oconnor) (09/25/87)

( All elipses ... are mine, and indicate excluded text. DMOC )
In article <6336@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
> ... Most RISC architectures have a divide step instruction, which
> is precisely what underlying microcode would use ...

  Our RISC architecture here at GE has no divide-step or
  multiply-step. We have a better way. More later.

> ... any hardware support in excess of this will inevitably slow
> the basic cycle down (I've been through the exercise).

  No, this is not true. Cycle time is generally dependant on
  some set of critical paths. Hardware that does not interact
  with these critical paths has no effect, unless it creates new
  critical paths. Were your critical paths lie depends heavily
  on implementation technology : could be the ALU, or the
  register file, or the instruction decode ...

> ... Microcode, or nanocode, has to go through all the same
> operations that assembly level code does. 

  Except fetching instructions, and operations resulting from
  the need to handle interupts or exceptions at arbitrary points
  in the assembly code (microcode can lock excepts out till it completes)

>... Its VERY difficult to make fixed point division run faster than
> a bit per cycle, without a LOT of hardware. By leaving out the
> special purpose speedup stuff, you can afford to include some VERY
> useful general purpose speedup stuff: More registers ... branch folding ...

  This is not really true. If you have a fast multiplier ( which is 
  a good idea for many applications ) you can do division very much
  quicker than one cycle per bit, relatively easily, especially for
  long word lengths. In fact, you can do division in something like

   C + (multiply_lateny * (Int_Round_up( log_base2( word_length )) - K)

  where C and K are positive integer small constants dependant on
  how you implement your algorithm. The technique to use is
  Newton-Raphson iteration with a first-guess look-up table.
  "The official divide algorithm of the IBM-360/95 and Cray-1 (I think :-)"
  The additional hardware needed (besides a fast multiplier) is TWIT.

>> [quote from someone else about allowing registers
>>  to be accessed as memory locations]
>The original PDP-10 from DEC allowed ...  Registers were the first
>16 locations in memory ... instructions could put into the registers
>and executed from them ...

Off course, the PDP10 didn't have to reorganize code, so it did
not have to deal with memory-aliasing problems. 

>The ATT CRISP doesn't have any registers. But, by caching the top of
>the local frame, references to locals are effectively turned into
>register references, and you get register windows as well. You can
>index into these 'registers', byte access them, and reference them
>with short 5-bit fields in the instruction.

  One of the NICE things about registers that are NOT accessable
  as memory is that you can uniquely identify references to a register
  based strictly on the bits in the instruction stream. This is
  crucial to reorganization : you must know when registers are
  modified as a limit on how uses of that register can be moved.
  Memory-aliasing can be a difficult task, especially if post-
  reorganization linking is supported. How does the CRISP
  reorganizer address this issue ?

  Simple reorganizers ( a contradiction ) deal with memory aliasing
  by forcing serialization of loads with respect to stores. If
  your registers are accessable as memory and you use this scheme
  in your reorganization, you wind up serializing every instruction
  with respect to stores. What's the cost of this ?

>{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

--
	Dennis O'Connor 	oconnor@sungoddess.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
        "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"

bcase@apple.UUCP (Brian Case) (09/27/87)

In article <7460@steinmetz.steinmetz.UUCP> oconnor@sunray.UUCP (Dennis Oconnor) writes:
>( All elipses ... are mine, and indicate excluded text. DMOC )
>In article <6336@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:

>>... Its VERY difficult to make fixed point division run faster than
>> a bit per cycle, without a LOT of hardware. By leaving out the
>> special purpose speedup stuff, you can afford to include some VERY
>> useful general purpose speedup stuff: More registers ... branch folding ...
>
>  This is not really true. If you have a fast multiplier ( which is 
>  a good idea for many applications ) you can do division very much
>  quicker than one cycle per bit, relatively easily, especially for
>  long word lengths. In fact, you can do division in something like
>
>   C + (multiply_lateny * (Int_Round_up( log_base2( word_length )) - K)
>
>  where C and K are positive integer small constants dependant on
>  how you implement your algorithm. The technique to use is
>  Newton-Raphson iteration with a first-guess look-up table.
>  "The official divide algorithm of the IBM-360/95 and Cray-1 (I think :-)"
>  The additional hardware needed (besides a fast multiplier) is TWIT.

You are neglecting to realize that for many (most?) implementations,
a fast (i.e. parallel) multiplier *is* a LOT of hardware.  If you are
fitting a sufficiently big multiplier on your chip (I want to talk only
about single-chip implementations), then great, let's all use your
technology.  Yes, Newton-Raphson is used lots of places (e.g. Am29300
family, Am29027 floating-point accellerator that I know of personally).

>>The ATT CRISP doesn't have any registers. But, by caching the top of
>>the local frame, references to locals are effectively turned into
>>register references, and you get register windows as well.
>
>  One of the NICE things about registers that are NOT accessable
>  as memory is that you can uniquely identify references to a register
>  based strictly on the bits in the instruction stream. This is
>  crucial to reorganization : you must know when registers are
>  modified as a limit on how uses of that register can be moved.
>  Memory-aliasing can be a difficult task, especially if post-
>  reorganization linking is supported. How does the CRISP
>  reorganizer address this issue ?

Registers in the CRISP are *completely* transparent to software (well, I
bet there is some way in which they are not completely, but for the
purposes of this discussion, they are *completely*).  To a compiler,
reorganizer, linker, etc. a register reference is a memory reference.
It *does* behoove the compiling system components to know that memory
references with small offsets (less than 32 words away in the implementation
that I know of) from the Stack Pointer will run much faster (so that
local variables can be allocated to have fast access, etc.), but this
is not necessary.  Reorganizers have no artificial constraints placed
on them by the CRISP.  Personally, I think the implementation cost of
the CRISP model is too high, but the transparency acheived is wonderful 
(the register file can be expanded/shrunk at will, even a 0 size will
work just fine).

baum@apple.UUCP (Allen J. Baum) (09/29/87)

--------
[]

In article <6336@apple.UUCP> I wrote:
 ... Most RISC architectures have a divide step instruction, which
 is precisely what underlying microcode would use ...

In article <7460@steinmetz.steinmetz.UUCP> oconnor@sunray.UUCP (Dennis Oconnor) writes:
>  Our RISC architecture here at GE has no divide-step or
>  multiply-step. We have a better way. More later.

Well, I never said all, I said most. I know machines that have neither, also.
Are there any reference to the GE RISC architecture? Its a new one to me.

I also said: any hardware support in excess of this will inevitably slow
the basic cycle down (I've been through the exercise).

 Yes, I was exagerating a bit there. Logic which is not on a critical
path will not inevitably slow down the cycle time. But it might all
the same- extra loading, having to fit the logic in somewhere,
causing wires to lengthen... These are all second order effects, but
they exist, and they are real and measurable.

I said: Microcode, or nanocode, has to go through all the same
operations that assembly level code does.

>  Except fetching instructions, and operations resulting from
>  the need to handle interupts or exceptions at arbitrary points
>  in the assembly code (microcode can lock excepts out till it completes)

Assembly language doesn't fetch instructions, hardware does it
automatically. Generally its microcode that has to fetch the assembly
language instructions, and hardware that fetches the microcode. Is
there something to this analogy I'm missing? You're not clear.

Assembly language is perfectly capable of locking out interrupts. You
may have to be in supervisor state or something to do it, but every
machine I've every seen with interrupts has a way to turn them off.
This is far more flexible than having only specific microcode routines, which
are locked into ROM, be able to turn off interrupts.

I said: Its VERY difficult to make fixed point division run faster than
a bit per cycle, without a LOT of hardware. By leaving out the
special purpose speedup stuff, you can afford to include some VERY
useful general purpose speedup stuff: More registers ... branch folding ...

>  This is not really true. If you have a fast multiplier ( which is 
>  a good idea for many applications ) you can do division very much
>  quicker than one cycle per bit, relatively easily, especially for
>  long word lengths. In fact, you can do division in something like
>
>   C + (multiply_lateny * (Int_Round_up( log_base2( word_length )) - K)
>
>  where C and K are positive integer small constants dependant on
>  how you implement your algorithm. The technique to use is
>  Newton-Raphson iteration with a first-guess look-up table.
>  "The official divide algorithm of the IBM-360/95 and Cray-1 (I think :-)"
>  The additional hardware needed (besides a fast multiplier) is TWIT.

I am not unaware of the Newton-Raphson divide algorithm.  If you
think that the look-up table, or the fast multiplier, or the datapath
logic you have to put around the multiplier to do the Newton-Raphson
iteration is trivial, then you've haven't designed one. I have. It
isn't.  It is also not necessarily faster than a bit at a time
divider, depending on your fast multiply time, word length, look-up
table, etc. I've been told (by someone who was there) that although
Amdahl insisted on this approach in the 470, its was discovered
afterwards that it ran slower than the usual serial approach would
have.

>>> [quote from someone else about allowing registers
>>>  to be accessed as memory locations]
>>The original PDP-10 from DEC allowed ...  Registers were the first
>>16 locations in memory ... instructions could put into the registers
>>and executed from them ...
>
>Of course, the PDP10 didn't have to reorganize code, so it did
>not have to deal with memory-aliasing problems. 

The PDP-10 didn't have to re-organize code. Neither do RISC
architectures.  Neither do 370's. But, you might be surprised to find
that on these machines, re-organizating the code to eliminate
pipeline interlocks will speed up your code. See "Coding guidelines for
Pipelined Processors" by Rymarczyk of IBM in the ASPLOS I Proceedings.

>>The ATT CRISP doesn't have any registers. But, by caching the top of
>>the local frame, references to locals are effectively turned into
>>register references, and you get register windows as well. You can
>>index into these 'registers', byte access them, and reference them
>>with short 5-bit fields in the instruction.
>
>  One of the NICE things about registers that are NOT accessable
>  as memory is that you can uniquely identify references to a register
>  based strictly on the bits in the instruction stream. This is
>  crucial to reorganization : you must know when registers are
>  modified as a limit on how uses of that register can be moved.
>  Memory-aliasing can be a difficult task, especially if post-
>  reorganization linking is supported. How does the CRISP
>  reorganizer address this issue ?
>
>  Simple reorganizers ( a contradiction ) deal with memory aliasing
>  by forcing serialization of loads with respect to stores. If
>  your registers are accessable as memory and you use this scheme
>  in your reorganization, you wind up serializing every instruction
>  with respect to stores. What's the cost of this ?

Any ATT people out there that can answer this?

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

fay@encore.UUCP (10/01/87)

In article <7481@steinmetz.steinmetz.UUCP> oconnor@sunray.UUCP (Dennis Oconnor) writes:
*In article <6370@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
**I said: Its VERY difficult to make fixed point division run faster than
**a bit per cycle, without a LOT of hardware. By leaving out the
**special purpose speedup stuff, you can afford to include some VERY
**useful general purpose speedup stuff: More registers ... branch folding ...
**
***  This is not really true. If you have a fast multiplier ( which is 
***  a good idea for many applications ) you can do division very much
***  quicker than one cycle per bit, relatively easily, especially for
***  long word lengths. In fact, you can do division in something like
***
***   C + (multiply_lateny * (Int_Round_up( log_base2( word_length )) - K)
***
***  where C and K are positive integer small constants dependant on
***  how you implement your algorithm. The technique to use is
***  Newton-Raphson iteration with a first-guess look-up table.
***  "The official divide algorithm of the IBM-360/95 and Cray-1 (I think :-)"
***  The additional hardware needed (besides a fast multiplier) is TWIT.

If the Newton-Raphson is not expensive in hardware, and is faster than
one cycle per bit (~O(log2)), why not always design it into CPUs? (That is,
into CPUs with wide words.)

*It has the look-up table in random logic ( only need a few bits ).
*Has a clever normalize-denominator instruction (needed).
*Does the iteration in assembly language (it's a RISC machine)
*so no additional datapath is required. And will be in silicon RSN.
*

Hey, wait a minute! Are you saying you do almost the whole operation in
assembler? The only extras you need are normalize-denominator instruction
(neat idea) and the look-up table for the initial approximation of the
divisor reciprocal (do you mean RAM or ROM when you say "random
logic"?)???? Wow!!! Whatever happened to doing things the hard way? (:-)

$64 Question: Since NR-Iteration doesn't give accurate remainder, how do
you produce one? Or do you skip remainders altogether?

*... Details will have
*to wait for the official public release...
*

When will this be?

-- 
			peter fay
			fay@multimax.arpa
{allegra|compass|decvax|ihnp4|linus|necis|pur-ee|talcott}!encore!fay

tim@amdcad.AMD.COM (Tim Olson) (10/02/87)

In article <1990@encore.UUCP> fay@encore.UUCP (Peter Fay) writes:
| In article <7481@steinmetz.steinmetz.UUCP> oconnor@sunray.UUCP (Dennis Oconnor) writes:
| ***  This is not really true. If you have a fast multiplier ( which is 
| ***  a good idea for many applications ) you can do division very much
| ***  quicker than one cycle per bit, relatively easily, especially for
| ***  long word lengths. In fact, you can do division in something like
|
| If the Newton-Raphson is not expensive in hardware, and is faster than
| one cycle per bit (~O(log2)), why not always design it into CPUs? (That is,
| into CPUs with wide words.)

Note that Dennis was talking about CPUs which already have a fast
(parallel) multiplier.  The *extra* expense of Newton-Raphson for this
is small, but the cost of the multiplier is large.  That is why it is
not always designed into CPUs (at least single-chip ones).

| $64 Question: Since NR-Iteration doesn't give accurate remainder, how do
| you produce one? Or do you skip remainders altogether?

Since you have a fast multiplier, remainder is just a multiply and a
subtract, once you have the quotient:

	a = (a/b)*b + a%b  --> a%b = a - (a/b)*b

	--> rem = dividend - quotient*divisor


	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

oconnor@sunray.steinmetz (Dennis Oconnor) (10/06/87)

In article <1990@encore.UUCP> fay@encore.UUCP (Peter Fay) writes:
>If the Newton-Raphson is not expensive ... faster than one cycle per
>bit (~O(log2)), why not always design ... into CPUs with wide words.)

Because you need a fast multiplier first, as has been stated. These
are generally expensive ... BUT WORTH IT (depending on application).

[ quoted from my ( Dennis O'Connor's ) earlier article ]
>*It has the look-up table in random logic ( only need a few bits ).
>*Has a clever normalize-denominator instruction (needed).
>*Does the iteration in assembly language (it's a RISC machine)
>*so no additional datapath is required. And will be in silicon RSN.
>
>Hey, wait a minute! Are you saying you do almost the whole operation in
>assembler? The only extras you need are normalize-denominator instruction
>(neat idea) and the look-up table for the initial approximation of the
>divisor reciprocal (do you mean RAM or ROM when you say "random
>logic"?)???? Wow!!! Whatever happened to doing things the hard way? (:-)

What most people mean by random logic : NAND and NOR gates. After all
the first guess only needs to by some small number ( implementation
dependant ) of bits accurate, and is therefor only dependant on some
small number of bits in the input. It could be done in RAM or ROM, but is
(sometimes) more easily done in "random logic". Constructing ramdom logic
to duplicate the effect of a ROM is left as an exercise for the reader.

The normalization instruction is very neat, only deals with exponents,
and is applied to both the numerator and denominator. It was cheap
to build. (Exponent? But that's a floating point thingy! (hint hint))

There is nothing hard about the way we do division. Next time we'll
do it faster and easier, but there's nothing hard about it now.

>$64 Question: Since NR-Iteration doesn't give accurate remainder, how do
>you produce one? Or do you skip remainders altogether?

Given a numerator, a denominator, and a quotient, obtaining a
remainder is left as an exersise for the reader. The only tools
you can use in this exercise are the ones you needed to do the
division in the first place : a multiplier and a subtractor.

>*... Details will have to wait for the official public release...
>When will this be?

Details on when the official public release will occur will have to wait
for the public release of details on when the public release will occur... :-)

>			peter fay
>			fay@multimax.arpa
>{allegra|compass|decvax|ihnp4|linus|necis|pur-ee|talcott}!encore!fay

--
	Dennis O'Connor 	oconnor@sungoddess.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
        "If I have an "s" in my name, am I a PHIL-OSS-IF-FER?"

baum@apple.UUCP (Allen J. Baum) (10/10/87)

--------
[]
This had turned into a really interesting argument, so here is my contribution
to prolong it.

>[ paraphrased from ( Dennis O'Connor's ) earlier article ]
*A Newton-Rapheson interation divider needs:
*A fast multiplier
*It has the look-up table in random logic ( only need a few bits ).
*Has a clever normalize-denominator instruction (needed).
*Does the iteration in assembly language (it's a RISC machine)
*so no additional datapath is required. And will be in silicon RSN.

The obvious ways to do multi-bit/cycle integer divide require that at
least one of the operands be normalized (like floating point- MSB=1).
I believe that N-R also requires the divisor to be normalized, which
is what the normalizer/ denormalizer is used for (as well as floating
point in software).

The N-R algorithm doubles the number of quotient bits on
each iteration. To get an initial seed of N bits, you need a ROM (or
gate equivalent) of 2^(N+2) x N. If you want to do this with gates, N is
going to be less than 8. Suppose it is 4. The, on the first iteration,
you need a 32 x 4 multiply, followed by 32 x 8, followed by 32x16. You
may need a final 32x32 multiply to be accurate to the last bit; I can't
remember the details.

This looks pretty good, but I have to expound Baum's approximation
(or lemma, or rule, or something). You can't build a multiplier with
a multiplier (as opposed to multiplicand) of more than 8 bits that
runs faster than a full adder in the same technology (and this is a
carry-save multiplier- obviously, the full add at the end takes a
full-add time). And, your cycle time should not be much longer than a
full-add time; if it is longer, you have an extremely inefficient
design (I'm willing to make 'much longer' be up to a factor of two--
I'm easy), since all the common operations run slower. Even a 'fast'
multiply is a multi-cycle operation with these rules. You can, of
course, pipeline such a multiplier so you can start a new multiply
every cycle. You might even be able to build a full Wallace tree, but
the 'fast' multiplier is likely to exceed the size of the rest of the
processor. Note that all these arguments apply only to general
purpose processors.

Using these guidelines (which certainly could be questioned, but...),
I think the latter stages of the divide iteration will be multi-cycle
operations, or that the cycles will be much longer than they would be
otherwise. This is a little generous, in that I assume that the early
iterations are faster than the later ones, since they need to
multiply fewer bits. This takes a bit of control logic to implement,
and possibly some input multiplexing to the multipler. On the other
hand, each multiply iteration does need a full carry-propagate adder at its
end, so the divide step must be a lot longer than a simple add.

By the same kinds of reasoning, I don't think that a normalize is a one
cycle operation (for the kinds of short cycles you OUGHT to have). Not only
do you essentially need a 'count leading zeroes' function, but the output
of this has to feed a shifter. It can be done as a five step process
(are the first 16 bits=0? If so <<16. Now, are the first 8bits=0? .....).
This is getting perilously close to full-add speeds. Of course, I suppose
you might have a slow adder....

In an earlier posting, the statement was made that replacing a
microcode sequence with assembly language increases the number of
instructions that must be fetched.

I suppose that if you have microcode that is sufficiently tailored to
a particular problem (i.e. emulating some assembly language), then it
would probably be shorter. For general sorts of operations (like
Loading, Storing, Branching, the kinds of things computers generally
do) this is not the case. If a fixed microcode sequence is used to
emulate some assembly language which implements some operation, than the
microcode sequence will likely take longer.

If, in fact, your microcode sequences are always shorter than assembly
language, then you should do all your programming in microcode and call
it assembly language. Just like the RISCs do.

Finally, I agree that microcode doesn't need to be in supervisor
state to turn off interrupts (as opposed to most assembly language
level machines), so microcode doesn't lose you anything. But the question is:
what does microcode GAIN you? I will, of course, concede that there are places
outside an operating system kernel that need to do it. I don't know that
I would concede that there are enough places outside a kernel that need it
badly enough to make any performance difference, and that's the bottom line.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385