[comp.arch] RISC v. CISC --more misconceptions

baum@Apple.COM (Allen J. Baum) (10/18/88)

[]
>In article <156@gloom.UUCP> cory@gloom.UUCP (Cory Kempf) writes:
>A while back, I was really hot on the idea of RISC.  Then a friend 
>pointed out a few things that set me straight...
>
>First, there is no good reason that all of the cache and pipeline
>enhancements cannot be put on to a CISC processor.

  Some are trying to do it, but they can't do it as well as a RISC processor
  because they don't have room on the chip, because they're busy supporting
  instructions and addressing modes that hardly ever get used.

>Second, to perform a complex task, a RISC chip will need more
>instructions than a CISC chip.

  First of all, not necessarily true. Second of all, so what?  The number
  of instructions used by RISCS has been greatly exaggerated. First generation
  RISCS may have a code expansion of 50% in bytes. However, this is a static
  measurement. It doesn't mean that the inner loops will use twice as many
  cycles. And, since the datapath of a RISC machine is simpler, its cycles
  are likely faster.

>Third, given the same level of technology for each (ie caches, pipelines,
>etc), a microcode fetch is faster than a memory fetch.

  Perhaps, but the microcode is only good for one thing: interpreting macro-
  code. A the code in a cache is good for whatever I happen to be executing
  at the time.

>As an aside, the 68030 can do a 32 bit multiply in about (If I remember 
>correctly -- I don't have the book in front of me) 40 cycles.  A while
>back, I tried to write a 32 bit multiply macro that would take less 
>than the 40 or so that the '030 took.  I didn't even come close (even 
>assuming lots of registers and a 32 bit word size (which the 6502 
>doesn't have)).  

Check out the ASPLOS II proceedings. There is an article the by Dan Magenheimer
of HP explaining how the do 11 cycle average multiplies, without a multiplier,
and without a multiply step instruction. Besides, if you have a RISC machine,
and multiply is truly important, the you have room for multiply support, up
to having a full multiplier. You may find, howver, that it won't make any
difference in your performance because no one needs an integer multiplier very
often. Like a lot of things that RISC designers have left out.

--
{decwrl,hplabs}!nsc!baum@apple.com		(408)973-3385

snoopy@sopwith.UUCP (Snoopy T. Beagle) (10/31/88)

In article <18931@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
| You may find, howver, that it won't make any difference in your
| performance because no one needs an integer multiplier very
| often. Like a lot of things that RISC designers have left out.

I humbly disagree.  Just because *you* never use an integer multiply
does not imply that noone else ever does.

    _____     
   /_____\    Snoopy
  /_______\   
    |___|     tektronix!tekecs!sopwith!snoopy
    |___|     sun!nosun!illian!sopwith!snoopy

cik@l.cc.purdue.edu (Herman Rubin) (10/31/88)

In article <40@sopwith.UUCP>, snoopy@sopwith.UUCP (Snoopy T. Beagle) writes:
> In article <18931@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
> | You may find, howver, that it won't make any difference in your
> | performance because no one needs an integer multiplier very
> | often. Like a lot of things that RISC designers have left out.
> 
> I humbly disagree.  Just because *you* never use an integer multiply
> does not imply that noone else ever does.
>
There are many other operations which are cheap in hardware and expensive in
software.  Maybe we should have a "competition" to see how many operations
we can come up with for which silicon can do a quick, efficient, accurate job,
but which are clumsy and expensive in software.

I will toss in a few for starters.

	Find the distance to the next one in a bit stream FAST.  It would
be good to have an exception handler if one is not found.  I am considering
algorithms not worth implementing if the operation is slow.

	Divide an integer by an integer, obtaining a quotient and a remainder,
the choice of remainder depending on the signs of the dividend and the divisor.

	Divide a floating point number by a floating point number, obtaining
an integer quotient and a floating point remainder.  It would be nice to have
the choice of remainder depending on the signs here also.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

david@ms.uky.edu (David Herron -- One of the vertebrae) (10/31/88)

Yea really ... and compilers use integer multiplies all the
time when referencing into arrays...

I for one would like to see floating point tossed out, since I
never use it (and I happen to dislike my numerical analysis class :-)
-- 
<-- David Herron; an MMDF guy                              <david@ms.uky.edu>
<-- ska: David le casse\*'      {rutgers,uunet}!ukma!david, david@UKMA.BITNET
<--
<-- Controlled anarchy -- the essence of the net.

david@sun.uucp (David DiGiacomo) (11/01/88)

In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>There are many other operations which are cheap in hardware and expensive in
>software. ... I will toss in a few for starters.

I think you're going to have to find some better examples...

>	Find the distance to the next one in a bit stream FAST.  It would
>be good to have an exception handler if one is not found.  I am considering
>algorithms not worth implementing if the operation is slow.

This is very cheap in software (runs at close to memory bandwidth).  Find
the first nonzero word, then the first nonzero byte in that word, then do
a table lookup.  If it really has to be FAST, use a 16 bit table.

>	Divide an integer by an integer, obtaining a quotient and a remainder,
>the choice of remainder depending on the signs of the dividend and the divisor.
>
>	Divide a floating point number by a floating point number, obtaining
>an integer quotient and a floating point remainder.  It would be nice to have
>the choice of remainder depending on the signs here also.

These operations are *extremely* expensive in hardware.

-- 
David DiGiacomo, Sun Microsystems, Mt. View, CA  sun!david david@sun.com

baum@Apple.COM (Allen J. Baum) (11/01/88)

[]
>In article <40@sopwith.UUCP> snoopy@sopwith.UUCP (Snoopy T. Beagle) writes:
>In article <18931@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>| You may find, howver, that it won't make any difference in your
>| performance because no one needs an integer multiplier very
>| often. Like a lot of things that RISC designers have left out.
>
>I humbly disagree.  Just because *you* never use an integer multiply
>does not imply that noone else ever does.

Oh, please!
I didn't say noone.
I didn't say ever.
Of course there are applications that are integer multiplication
intensive (as opposed to floating point).
What I did say is that they are quite rare.

Integer floating point intensive is defined (here and now, by me) to
be an application that will suffer a performance degradation of more
than 3% without a fast hardware multiplier (2-3 cycles, vs. the
average 11 cycles that HP can do in pure software. (A back of the
envelope calculation will show that means .3%- pretty high for
multiply) Most integer multiplies that I am aware of are used for
index scaling and other address calculations. Good optimizing
compilers will strength reduce these away

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

kyriazis@rpics (George Kyriazis) (11/01/88)

In article <10471@s.ms.uky.edu> you write:
>Yea really ... and compilers use integer multiplies all the
>time when referencing into arrays...
>
>I for one would like to see floating point tossed out, since I
>never use it (and I happen to dislike my numerical analysis class :-)
>-- 

Floating point is getting faster and faster every day.  One example that
I know of is the AT&T DSP32 Digital Signal Processor.  It needs 1 CPU
cycle to do a floating point multiplication and accumulation.  An
interesting side-effect is in the calculation of array referencing.
It expands the array index to FP, does the array indexing, and then
truncates the resulting address to an integer.  (The compiler I am
talking about is the one that is the one that AT&T gives for their
Pixel Machine).


-- 

  George Kyriazis
  kyriazis@turing.cs.rpi.edu
------------------------------

  George Kyriazis
  kyriazis@turing.cs.rpi.edu
------------------------------

ok@quintus.uucp (Richard A. O'Keefe) (11/01/88)

In article <10471@s.ms.uky.edu> david@ms.uky.edu (David Herron -- One of the vertebrae) writes:
>I for one would like to see floating point tossed out, since I
>never use it (and I happen to dislike my numerical analysis class :-)

But it _has_ been tossed out of many designs:  the IBM RT PC (at any
rate the old ones) had no floating-point instructions, but used a
National chip (32081?). The AMD 29000 manual says that floating-point
ops are currently extracodes emulated by a trap handler.  And of course
there are all the 680x0, 320xx, 80*86 and so on, with floating-point
done by a co-processor.

cik@l.cc.purdue.edu (Herman Rubin) (11/01/88)

In article <19762@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes:
> []
> >In article <40@sopwith.UUCP> snoopy@sopwith.UUCP (Snoopy T. Beagle) writes:
> >In article <18931@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
> >| You may find, howver, that it won't make any difference in your
> >| performance because no one needs an integer multiplier very
> >| often. Like a lot of things that RISC designers have left out.
> >
> >I humbly disagree.  Just because *you* never use an integer multiply
> >does not imply that noone else ever does.
> 
> Oh, please!
> I didn't say noone.
> I didn't say ever.
> Of course there are applications that are integer multiplication
> intensive (as opposed to floating point).
> What I did say is that they are quite rare.

They are rare because a good programmer knows that they are slow and
difficult to program.  If an operation takes 10 lines of code, each of
which is expanded into one or more hardware instructions, why use it
unless you cannot find a way around it?  The way around it may be clumsy,
but it is very likely to beat the unavailable operation.

> Integer floating point intensive is defined (here and now, by me) to
> be an application that will suffer a performance degradation of more
> than 3% without a fast hardware multiplier (2-3 cycles, vs. the
> average 11 cycles that HP can do in pure software. (A back of the
> envelope calculation will show that means .3%- pretty high for
> multiply) Most integer multiplies that I am aware of are used for
> index scaling and other address calculations. Good optimizing
> compilers will strength reduce these away

If the double-precision product of two single-precision integers is required,
and only single-precision products are available, it is necessary to go to
single-precision products of half-precision numbers.  This takes about 20
instructions.  How does the poster expect to do it in an average of 11 cycles?
Many of these jobs are not being done, or are being kludged by finding ways to
accomplish more-or-less the same results in 10 instructions.  And if a
subroutine call is made, double the time.

Many mathematical computations should be made in fixed-point arithmetic.  If
one does not have the hardware available, the cost is much greater than
floating point.  If the hardware is available, it is much cheaper.  None
of the major languages support fixed point.  So none of the hardware gurus
put it in, so none of the machines have it, so no one programs in it, so
the inclusion of it is objected to as a waste of resources, etc.

Another hardware operation missing on most machines is square root.  So one
does not use algorithms requiring square roots.  Again, the circular
argument.

An application using accurate arithmetic heavily will be spending most of its
time in multiple-precision subroutines, even with good hardware.  A time
penalty of a factor of 10 here is obviously costly.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

guy@auspex.UUCP (Guy Harris) (11/02/88)

>Yea really ... and compilers use integer multiplies all the
>time when referencing into arrays...

Integer multiplies by *constants*, most of the time; you can often do
those better with shifts and adds than you can with the machine's
multiply instruction (and no, I'm not talking about the 68010, which
doesn't have a 32x32 multiply instruction).

baum@Apple.COM (Allen J. Baum) (11/02/88)

[]
>In article <1002@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
(in response to me about integer multiplies)
>They are rare because a good programmer knows that they are slow and
>difficult to program.  If an operation takes 10 lines of code, each of
>which is expanded into one or more hardware instructions, why use it
>unless you cannot find a way around it?  The way around it may be clumsy,
>but it is very likely to beat the unavailable operation.

I'm going further than that. I'm saying they are rare because the are
unnecessary. They are rare because in the USUAL case they can be strength
reduced to additions by an optimizing compiler. This is faster than using
the obvious multiply instruction.

Integer numeric intensive applications are relatively RARE. You
(Herman Rubin personally) may do them all the time, in which case you
don't think they are rare, but as a precentage of programs run by all
computers, they are.

>If the double-precision product of two single-precision integers is required,
>and only single-precision products are available, it is necessary to go to
>single-precision products of half-precision numbers.  This takes about 20
>instructions.  How does the poster expect to do it in an average of 11 cycles?
>Many of these jobs are not being done, or are being kludged by finding ways to
>accomplish more-or-less the same results in 10 instructions.  And if a
>subroutine call is made, double the time.

First of all, the "if double-precision is required" is a BIG if. It isn't,
usually (i.e. rare-- see above definition of rare).
Secondly, many multipliers do not support single x single -> double.
Thirdly, most languages don't support it either.
And fourthly (sp?)- the fact that the operation yields a double precision
result doesn't mean that all bits are significant. When they aren't (see
rare, above), software can shortcut the operation, making it end faster,
ON THE AVERAGE.
Fifthly- the requirement for double precision INTEGER precision is rare
(see rare, above...)

>Many mathematical computations should be made in fixed-point arithmetic.  If
>one does not have the hardware available, the cost is much greater than
>floating point.  If the hardware is available, it is much cheaper.  None
>of the major languages support fixed point.  So none of the hardware gurus
>put it in, so none of the machines have it, so no one programs in it, so
>the inclusion of it is objected to as a waste of resources, etc.
>
>Another hardware operation missing on most machines is square root.  So one
>does not use algorithms requiring square roots.  Again, the circular
>argument.
>
>An application using accurate arithmetic heavily will be spending most of its
>time in multiple-precision subroutines, even with good hardware.  A time
>penalty of a factor of 10 here is obviously costly.

I agree that language and hardware don't support multiple precision fixed
point arithmetic. The circular argument is valid, but I maintain that it starts
because, as a percentage of all programs any particular family of processors
will run, these kinds of problems are exceedingly rare. 

Square root is the same category as divide. Hardware is slow, so algorithms
tend to avoid them. The reason is fundamental. The hardware is slow, and it
is exceedingly difficult to make it faster. Strangly enough, floating point
divide can be made to run much faster, because of its normalized operands.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

PLS@cup.portal.com (Paul L Schauble) (11/02/88)

> Find the distance to the next 1 in a bit stream FAST

Scan for the first non-zero word. Load word and do a floating normalize. Look
at the exponent. 

  ++PLS

seanf@sco.COM (Sean Fagan) (11/02/88)

In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>Maybe we should have a "competition" to see how many operations
>we can come up with for which silicon can do a quick, efficient, accurate job,
>but which are clumsy and expensive in software.
>	Find the distance to the next one in a bit stream FAST.  It would
>be good to have an exception handler if one is not found.  I am considering
>algorithms not worth implementing if the operation is slow.
>	Divide an integer by an integer, obtaining a quotient and a remainder,
>the choice of remainder depending on the signs of the dividend and the divisor.
>	Divide a floating point number by a floating point number, obtaining
>an integer quotient and a floating point remainder.  It would be nice to have
>the choice of remainder depending on the signs here also.

Quick background:  Cyber 170's have 24 regsiters: A, B, and X, eight of each
(0-7).  A and B registers are 18 bits wide, and are, generally, used for
address variables.  X registers are 60-bits wide, and are usually used for
computation.  To load a value into X<n>, you load the address into A<n>,
where n=[1,5].  To store a value from X<n>, you load the address into A<n>,
where n=[6,7].  A0 has no special values, and B0 is a hardwired 0.

Ok, a few from the Wonderful World of the Cyber:

	Count the set bits in a word (register or memory, it doesn't
matter).  Very useful for some trivial applications (such as playing
Othello), but I haven't seen much else done with it.  It was put in, it
looks like, because the hardware was already there (in the form of parity
checking), but I could be wrong.

	Given the, um, interesting setup described above, write a routine to
save all of the registers into memory (hint:  use shifts).  It ain't easy,
and I don't remember how to do it, so don't mail me 8-).  It is, however,
trivial to do in hardware.


Who else has some fun examples?

>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907

-- 
Sean Eric Fagan  | "Engineering without management is *ART*"
seanf@sco.UUCP   |     Jeff Johnson (jeffj@sco)
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

khb%chiba@Sun.COM (Keith Bierman - Sun Tactical Engineering) (11/02/88)

In article <1002@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:

>> Of course there are applications that are integer multiplication
>> intensive (as opposed to floating point).
>> What I did say is that they are quite rare.
>
>They are rare because a good programmer knows that they are slow and
>difficult to program.  

Integer multiplication hard to program ? Slow ? Is this really what is
meant ?

>
>> Integer floating point intensive is defined (here and now, by me) to
>> be an application that will suffer a performance degradation of more
>> than 3% without a fast hardware multiplier (2-3 cycles, vs. the
>> average 11 cycles that HP can do in pure software. (A back of the
>> envelope calculation will show that means .3%- pretty high for
>> multiply) Most integer multiplies that I am aware of are used for
>> index scaling and other address calculations. Good optimizing
>> compilers will strength reduce these away
>
>If the double-precision product of two single-precision integers is required,
>and only single-precision products are available, it is necessary to go to
>single-precision products of half-precision numbers.  This takes about 20
>instructions.  How does the poster expect to do it in an average of 11 cycles?
>Many of these jobs are not being done, or are being kludged by finding ways to
>accomplish more-or-less the same results in 10 instructions.  And if a
>subroutine call is made, double the time.

I belive the poster is refering to numerious HP publications (open lit
and manuals). Their algorithm is quite clever and makes use of the
fact that certain multipliers are much more common than others.
Special instructions in Spectrum are employed in conjuction with
delayed branches to perform multiply in a max of 11 cycles (and often
one wins and it is less). I do not think that a discussion of DP was meant.
>
>Many mathematical computations should be made in fixed-point arithmetic.  

NO! Having witnessed far too much of this in DoD embedded computers.
The fixed point math saved some hardware; but the software was awful.
Because of the handsprings fixedpoint math required, it was not
possble to focus on the real big issues (like is this algorithm
numerically stable ? Can we cut the compute cost by a factor of 10 by
altering the problem ?). Fixed point math is sometimes appropriate,
but if anything too much is done in fixed point.

If
>one does not have the hardware available, the cost is much greater than
>floating point.  If the hardware is available, it is much cheaper.  None
>of the major languages support fixed point.  So none of the hardware gurus
>put it in, so none of the machines have it, so no one programs in it, so
>the inclusion of it is objected to as a waste of resources, etc.

Check out DSP chips. Check out generation after generation of military
chips. fixed point is very common in some envirnoments. PL/1 PL/"x"
(dialects) and misc special mil languages support it. People code in
it, and it is usually a bad design choice.

>
>Another hardware operation missing on most machines is square root.  So one
>does not use algorithms requiring square roots.

Well, I came from the world of kalman filtering where the best
algorithms tend to use square-roots (though sometimes these can be
avoided). sqrt improves numerical reliability of many algorithms (when
employed correctly) which more than makes up for its speed (if you get
the right answer in fewer iterations, the extra cost doesn't
necessarily matter. I have not done a head count, but many machines
have sqrt (8087, 6888x, vax fpa, ibm fpa, univac).

>
>An application using accurate arithmetic heavily will be spending most of its
>time in multiple-precision subroutines,

Not if good algorithms are employed. For example, in the early '70's JPL'ers G.
Bierman and K. Thornton proved that UD and SRIF mechanized kalman
filters could be run in SP (36-bits) rather than the DP (72 bits)
required by competing algorithms to acheive superior results. Better
algorithm, fewer bits, better results. 

Better math logic (like ieee machines with extended accumulators)
won't spend any time in extended precision routines. vax fpa and IBM fpas
also do their math in 64-bits, so that the penalty for extended
precision is the cost of moving more bits to memory (and more paging, etc.)

Keith H. Bierman
It's Not My Fault ---- I Voted for Bill & Opus

firth@sei.cmu.edu (Robert Firth) (11/02/88)

In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>There are many other operations which are cheap in hardware and expensive in
>software...I will toss in a few for starters.

>	Find the distance to the next one in a bit stream FAST.  It would
>be good to have an exception handler if one is not found.  I am considering
>algorithms not worth implementing if the operation is slow.

This is implemented in hardware on the MC68020 by the 'BFFFO' instruction.
It takes 18 cycles to find the first bit in a 32-bit register, and upto
32 cycles to find the first bit in a 32-bit bitfield in memory.

I find it hard to believe that a software algorithm on a RISC machine
would do worse.  In the register case, for instance, a simple binary
chop takes at most 5 steps, and each step is just (and-immediate-with-mask;
branch if result zero).  One can get more cunning, but that gives you
the answer in about 12 cycles.

tim@crackle.amd.com (Tim Olson) (11/03/88)

In article <7575@aw.sei.cmu.edu> firth@bd.sei.cmu.edu (Robert Firth) writes:
| In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
| >There are many other operations which are cheap in hardware and expensive in
| >software...I will toss in a few for starters.
| 
| >	Find the distance to the next one in a bit stream FAST.  It would
| >be good to have an exception handler if one is not found.  I am considering
| >algorithms not worth implementing if the operation is slow.
| 
| This is implemented in hardware on the MC68020 by the 'BFFFO' instruction.
| It takes 18 cycles to find the first bit in a 32-bit register, and upto
| 32 cycles to find the first bit in a 32-bit bitfield in memory.

"hardware"?  Well, microcode anyway.  If it were implemented in
hardware, it can be done in a single cycle, as it is on the Am29000 with
the CLZ (Count Leading Zeros) instruction.

	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

phil@diablo.amd.com (Phil Ngai) (11/03/88)

In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>There are many other operations which are cheap in hardware and expensive in
>software...I will toss in a few for starters.
>	Find the distance to the next one in a bit stream FAST.  It would

The Am29000 has a CLZ instruction which runs in one clock cycle(33 ns):

Operation: 	Determine number of leading zeros in a (32-bit) word.
Description:	A count of the number of zero-bits to the first one-bit
		in the SRCB operand is placed into the DEST location. 
		If the MSB of the SRCB operand is 1, the resulting
		count is zero. If the SRCB operand is zero, the
		resulting count is 32.

"In the West, to waste water is not to consume it, to let it flow unimpeded 
and undiverted down rivers. Use of water is, by definition, beneficial use."
(from _Cadillac Desert_)
Phil Ngai, {ucbvax,decwrl,allegra}!amdcad!phil or phil@amd.com

ok@quintus.uucp (Richard A. O'Keefe) (11/03/88)

In article <19811@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
[Talking about integer multiplication and division.]
>I'm going further than that. I'm saying they are rare because the are
>unnecessary. They are rare because in the USUAL case they can be strength
>reduced to additions by an optimizing compiler. This is faster than using
>the obvious multiply instruction.

Did you notice the implicit assumption that multiplications are only
for address calculations?  Avoidable multiplications are rare because
a generation of programmers has been brainwashed that Hardware Rules,
and if some potentially useful operation is expensive it is their job
to avoid it rather than have the hardware and compiler people get it
right.  People are still avoiding procedure calls (and RISC designers
are assuming that procedure calls are not deeply nested) because old
designs made procedure calls expensive.

The one which is really painful is division.  When one codes up a hash
table, one knows (having read the literature) that remainder with a
prime is a Good Thing.  But one also knows that whizzbang machine X
has no hardware support for division, so to avoid a subroutine call to
a routine not known for its speed one sighs, puts in X & 4095 (instead
of X1 % 4097 or whatever), and wishes...

But to be realistic about this, let's compare a couple of CISCs with
what a good RISC might do.  The issue is not absolute speed, but the
intensity of the temptation to distort your code to avoid a function
perceived as expensive.  I measure this as cost/(cost of ADD).

		MC68020	80386	generic	R2000	WBMX	88k
MULS.L/ADD.L	~ 20	~ 5-10	~18	14	~44	4
DIVS.L/ADD.L	~ 45	~ 20	~35	35	~150	39+possible trap

MC68020 figures from manufacturer's manual
80386   figures from manufacturer's manual
generic figures assume 2-bit-at-a-time multiply step, 1-b.a.a.t. divide step
WBMX	multiply from Whizzbang Ltd's manual, divide _estimated_ from manual;
	figures include procedure call overhead.  WBMX has no divide step.
88k	figures from article <4759@pdn.UUCP> (Alan Lovejoy)
R2000	figures from article <7472@winchester.mips.COM> (Charlie Price)
	(1 for main op, + delay time 12 or 33, + 1 to pick up result)
The 80386 and 88k multiplies deliver 32 bits, the others 64.
The R2000 figures are worst case: other integer operations can be
overlapped with all but two of these cycles.  It would be intersting
to know how often this pays off.

The bottom line is that architectures should _support_ the operations
programmers find useful, but that some architects have shown that good
enough support can be had by doing part of an operation in hardware,
part in software.  Too bad about Whizzbang.

yuval@taux02.UUCP (Gideon Yuval) (11/03/88)

If you want to find the first (bottom) "1" bit in a machine word X, try:

table [  (X xor (X-1)) mod 37 ] 

Proof is left as an exercise for the reader, as is contruction of the array 
"table".

Thus, any machine that can implement "C" can do "find first bit" with
reasonable efficiency.

N.B. On a 16-bit CPU, replace 37 by 19.

(I replaced "^" by "xor", and "%" by "mod", to avoid illegibility).
-- 
Gideon Yuval, yuval@taux01.nsc.com, +972-2-690992 (home) ,-52-522255(work)
 Paper-mail: National Semiconductor, 6 Maskit St., Herzliyah, Israel
                                                TWX: 33691, fax: +972-52-558322

csimmons@hqpyr1.oracle.UUCP (Charles Simmons) (11/03/88)

In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>There are many other operations which are cheap in hardware and expensive in
>software.  Maybe we should have a "competition" to see how many operations
>we can come up with for which silicon can do a quick, efficient, accurate job,
>but which are clumsy and expensive in software.
>
>I will toss in a few for starters.
>
>	Find the distance to the next one in a bit stream FAST.  It would
>be good to have an exception handler if one is not found.  I am considering
>algorithms not worth implementing if the operation is slow.

I once had a program that needed the above algorithm (I think we
ended up using a floating point normalize, which was still fairly slow).
In the same program, one of the algorithms that we really wanted was

	Find all legal moves for an Othello board.

If you have 1 72-bit register, this can be done in roughly 100 instructions.
With dedicated silicon, I think you can implement this with something
like 1024 gates.  Depending on the exact implementation of the silicon,
using the dedicated silicon would probably take about 10 instructions.

If that instruction is too high level, there are interesting things
one can do if a few instructions existed for permuting the bits in
a register.  For example:  pretend that the register holds an 8x8
board.  Now, rotate the board 90 degrees.  Or reflect the board about
an axis.  Or, permute the board so that the diagonals through the
board are laid end to end...

-- Chuck

cik@l.cc.purdue.edu (Herman Rubin) (11/03/88)

In article <7575@aw.sei.cmu.edu>, firth@sei.cmu.edu (Robert Firth) writes:
> In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> >There are many other operations which are cheap in hardware and expensive in
> >software...I will toss in a few for starters.
> 
> >	Find the distance to the next one in a bit stream FAST.  It would
> >be good to have an exception handler if one is not found.  I am considering
> >algorithms not worth implementing if the operation is slow.
> 
> This is implemented in hardware on the MC68020 by the 'BFFFO' instruction.
> It takes 18 cycles to find the first bit in a 32-bit register, and upto
> 32 cycles to find the first bit in a 32-bit bitfield in memory.
> 
> I find it hard to believe that a software algorithm on a RISC machine
> would do worse.  In the register case, for instance, a simple binary
> chop takes at most 5 steps, and each step is just (and-immediate-with-mask;
> branch if result zero).  One can get more cunning, but that gives you
> the answer in about 12 cycles.

For the purposes I had in mind, this is painful.  I am thinking of what
I call "infinite precision" methods of generating random variables (A
technical report is available on request.); these procedures have no
errors due to accuracy of computer operations.  For generating exponential
random variables, there are several methods which use ~5 BITS on the 
average, and then only use Boolean operations to generate the final result.
Finding the next 1 counts as 2 bits.  However, for such a procedure to be
practical, it must compete with procedures which use about 10 operations,
for one the slowest being two table lookups and a floating-point test.  

These competing procedures are obviously more complex computationally.
However, the architecture makes them much cheaper.  Also, since random
numbers of bits are used, one has to worry about where the current
bit pointer is and that the current word may become exhausted.  It is 
impossible to modify any good algorithm for generating non-uniform random
numbers to avoid uncertainties of this nature.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

henry@utzoo.uucp (Henry Spencer) (11/04/88)

In article <611@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>But it _has_ been tossed out of many designs:  the IBM RT PC (at any
>rate the old ones) had no floating-point instructions, but used a
>National chip (32081?).

I.e., it had floating-point hardware.  Why do you say that floating
point was "tossed out" of its design??

>The AMD 29000 manual says that floating-point
>ops are currently extracodes emulated by a trap handler...

Which may well use the 29xyz (don't remember the number offhand) chip
to do all the real work.  Again, it hasn't been tossed out, just moved
around a bit.

>And of course
>there are all the 680x0, 320xx, 80*86 and so on, with floating-point
>done by a co-processor.

So?  On most of those chips, the software can't tell the difference.
Does the FP hardware have to be on the CPU chip to be "real" somehow?
Why?
-- 
The Earth is our mother.        |    Henry Spencer at U of Toronto Zoology
Our nine months are up.         |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

ok@quintus.uucp (Richard A. O'Keefe) (11/04/88)

In article <232@taux02.UUCP> yuval@taux02.UUCP (Gideon Yuval) writes:
[That finding the first bit in a machine word requires NO special
 hardware -- his emphasis.  His formula, for a 32-bit machine, is]
>	table[(X xor (X-1)) mod 37] 

But this *DOES* require a fast modulo-37 operator!

greg@sce.carleton.ca (Greg Franks) (11/05/88)

In article <19811@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
#>In article <1002@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
#(in response to me about integer multiplies)
#>They are rare because a good programmer knows that they are slow and
#>difficult to program.  If an operation takes 10 lines of code, each of
#>which is expanded into one or more hardware instructions, why use it
#>unless you cannot find a way around it?  The way around it may be clumsy,
#>but it is very likely to beat the unavailable operation.
#
#I'm going further than that. I'm saying they are rare because the are
#unnecessary. They are rare because in the USUAL case they can be strength
#reduced to additions by an optimizing compiler. This is faster than using
#the obvious multiply instruction.
#
#Integer numeric intensive applications are relatively RARE. You
#(Herman Rubin personally) may do them all the time, in which case you
#don't think they are rare, but as a precentage of programs run by all
#computers, they are.
#

Integer multiplies are VERY POPULAR, if you look in the right place.
TI (TMS320) and Motorola (The NeXT machine) and many others are making
lots of money selling digital signal processing chips.  These chips
perform FFT's and all sorts of other operations which perform multiple
MULTIPLY and ADD very quickly.

However, many chip makers are now going to floating point in their
DSP's, so the lowly integer multiply may once again become somewhat
useless. 
-- 
Greg Franks                  Systems and Computer Engineering, 
utzoo!dciem!nrcaer!sce!greg  Carleton University, Ottawa, Ontario, Canada.
greg@sce.carleton.ca
   ACME Electric: When you can't find your shorts, call us!

cik@l.cc.purdue.edu (Herman Rubin) (11/05/88)

In article <232@taux02.UUCP>, yuval@taux02.UUCP (Gideon Yuval) writes:
> 
> If you want to find the first (bottom) "1" bit in a machine word X, try:
> 
> table [  (X xor (X-1)) mod 37 ] 

The question I asked is the spacing between BITS in a BIT stream.  It is
possible that the best way on a particular machine is to use words, but 
not here.

What is wanted is to get the distance to the next bit in the stream on
demand (this can be vectorized) and then pass over to the next location
for the next step.  Once a particular bit in the stream is read, it cannot
be used for anything else.

A similar problem is a conditional transfer on the next bit.  The whole 
process is bit-oriented, and some extraordinary procedure must be used to
use larger units.

There is also the problem of the bit stream becoming empty and needing
refilling.  In some cases, this problem can be minimized, even to the 
point of some cases being able to ignore it.  But it cannot be eliminated.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

henry@utzoo.uucp (Henry Spencer) (11/05/88)

In article <75772@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes:
>... I have not done a head count, but many machines
>have sqrt (8087, 6888x, vax fpa, ibm fpa, univac).

In fact, if I recall correctly, you cannot have an IEEE-conforming
floating-point implementation without it.
-- 
The Earth is our mother.        |    Henry Spencer at U of Toronto Zoology
Our nine months are up.         |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

ok@quintus.uucp (Richard A. O'Keefe) (11/05/88)

In article <1988Nov3.185535.28850@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <611@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>>But it _has_ been tossed out of many designs:  the IBM RT PC (at any
>>rate the old ones) had no floating-point instructions, but used a
>>National chip (32081?).

>I.e., it had floating-point hardware.  Why do you say that floating
>point was "tossed out" of its design??

I was responding to someone else who said that floating point ops
_ought_ to be tossed out, and by his phrasing appeared to be implying
that this was a prospect for improvement which RISC designers had so far
ignored.  "tossed out" was _his_ phrase, not mine.

As for whether the RT PC "had floating-point hardware", the floating-
point chip was not accessed as a co-processor, but as a device, and fast
it was not.  By that criterion, an 8008 talking down an RS232 line to a
/370 to get floating-point done could be regarded as "having floating-
point hardware".  The RT's instruction set has _no_ floating-point ops,
not even co-processor ones.

>>The AMD 29000 manual says that floating-point
>>ops are currently extracodes emulated by a trap handler...

>Which may well use the 29xyz (don't remember the number offhand) chip
>to do all the real work.  Again, it hasn't been tossed out, just moved
>around a bit.

Again, your reply misses the point of my message.  I was responding to
someone who claimed that RISC designers _ought_ to be throwing floating-
point calculation out of the CPU chip, and I was merely pointing out
that this had already been done.  (My copy of the AMD29000 manual does
not mention a 29xyz, but says FP is currently done in software.  If you
do recall the part number & what manual to ask for, please let me know.)

>So?  On most of those chips, the software can't tell the difference.
>Does the FP hardware have to be on the CPU chip to be "real" somehow?
>Why?

This is a gross distortion of my position.  You will search my former
posting in vain for any suggestion that offboarding floating point is
_bad_, only that it has already been _done_.

cjosta@taux01.UUCP (Jonathan Sweedler) (11/06/88)

In article <1988Nov4.194026.22806@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <75772@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes:
>>... I have not done a head count, but many machines
>>have sqrt (8087, 6888x, vax fpa, ibm fpa, univac).
>
>In fact, if I recall correctly, you cannot have an IEEE-conforming
>floating-point implementation without it.

Yes, but IEEE conformity may be accomplished in hardware or software.  A
*system* conforms to the IEEE standard, not necessarily the processor.
-- 
Jonathan Sweedler  ===  National Semiconductor Israel
UUCP:    ...!{amdahl,hplabs,decwrl}!nsc!taux01!cjosta
Domain:  cjosta@taux01.nsc.com

mac3n@babbage.acc.virginia.edu (Alex Colvin) (11/09/88)

In article <475@oracle.UUCP>, csimmons@hqpyr1.oracle.UUCP (Charles Simmons) writes:
> In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> >     Find the distance to the next one in a bit stream FAST.

> ended up using a floating point normalize, which was still fairly slow).

In 1972 in "Structured Programming", C.A.R.  Hoare suggests using
(floating) standardisation to locate elements in a powerset representation.
This operation seems to have been lost in Pascal.

Charles Simmons may remember the GE635's GTB (Gray to Binary) instruction,
which executes a long chain of XORs, used to convert radar data to binary.
We had real trouble coming up with another use for this.  Maybe Hypercube
routing?

randys@omews2.intel.com (Randy Steck) (11/11/88)

In article <10774@cup.portal.com> PLS@cup.portal.com (Paul L Schauble) writes:
>> Find the distance to the next 1 in a bit stream FAST
>
>Scan for the first non-zero word. Load word and do a floating normalize. Look
>at the exponent. 
>
>  ++PLS

I'm afraid that it isn't quite this simple.  Leaving aside the complexities of
what FP format you use to load it with, you first of all have to determine
what exponent you are going to use.  Zero exponents are a special case.  You
have to preset the exponent, shift and mask the bit stream, do the
normalization, shift and mask to get the exponent value out, and unbias it
from the original input.  Hmmmmmm.  Sounds messy to me.

However, this is the right idea.  On the 80960, we used the floating point
normalizer, without all the checks on the values, to implement the scan_bit
operation (finds the first set bit in a word).  The thing that saved us here
was independent access to the exponent values and fast transfer and operation
on bit values.

Randy Steck
Intel Corp.