[comp.arch] new instructions

jbs@WATSON.IBM.COM (05/20/91)

        Herman Rubin writes:
Fixed point arithmetic is little used now because the hardware to support
it reasonably well does not exist.  It is worse than the floating problems
before hardware floating arithmetic, especially if floating is automatically
normalized.  THAT feature of "modern" architectures is, in my opinion, a sheer
horror.

        I think fixed point arithmetic is little used now because the vast
majority of users find it harder to use than floating point with no comp-
ensating advantagres.  Does anyone seriously believe if a few instructions
were added to provide hardware support (btw what is missing?  In what way
has fixed point support deteriorated?) fixed point usage would increase
significantly?  Regarding unnormalized floating point what is this good
for besides simulating multiple precision integer arithmetic?

        Herman Rubin also writes:
In the early FP computers, much function calculation was done in fixed point,
to get increased accuracy at little cost.

        This only makes sense when the floating point fraction length is
less than a full integer.  With 64-bit floating point there is little need
for increased accuracy in any case.

        Herman Rubin also writes:
How do you expect users who do not even know of the existence of the operations
to use them?

        I expect the compiler to generate the instructions for them.  If
the compiler won't generate an instruction this is a strong reason for
not having it in the instruction set.

        Herman Rubin also writes:
There are many more algorithms than are in the philosophy of software, and
especially hardware, designers.

        The argument here seems to be that there are numerous algorithms
which are not used at all today which suddenly would be used all over the
place if only a few changes were made to the instruction set.  I don't
agree.  If two algorithms have similar performance both will be in use as
there will be some problems which are particually suited to one or the
other.  If an algorithm is not used at all on today's machines this means
it is not competive even on those problems for which it is particually
suited.  Making changes to the instruction set is unlikely to drastical-
ly change the relative performance of two algorithms,  hence is unlikely
to drastically change the amount of usage each gets.  I believe it is
perfectly sensible for hardware designers to concentrate on speeding up
the existing mix of applications.
                          James B. Shearer

mac@eleazar.dartmouth.edu (Alex Colvin) (05/20/91)

>Fixed point arithmetic is little used now because the hardware to support
>it reasonably well does not exist.  

Would anyone who works with DSPs comment on this?  They seem to be built for
fixed point and fractional arithmetic, with the ALU connected to the registers
through shifters.

hrubin@pop.stat.purdue.edu (Herman Rubin) (05/20/91)

Subject: Re: new instructions
Newsgroups: comp.arch
References: <9105200213.AA05095@ucbvax.Berkeley.EDU>

In article <9105200213.AA05095@ucbvax.Berkeley.EDU>, James B. Shearer
(jbs@WATSON.IBM.COM) writes:

>         Herman Rubin writes:
> Fixed point arithmetic is little used now because the hardware to support
> it reasonably well does not exist.  It is worse than the floating problems
> before hardware floating arithmetic, especially if floating is automatically
> normalized.  THAT feature of "modern" architectures is, in my opinion, a sheer
> horror.
> 
>         I think fixed point arithmetic is little used now because the vast
> majority of users find it harder to use than floating point with no comp-
> ensating advantagres.  Does anyone seriously believe if a few instructions
> were added to provide hardware support (btw what is missing?  In what way
> has fixed point support deteriorated?) fixed point usage would increase
> significantly?  Regarding unnormalized floating point what is this good
> for besides simulating multiple precision integer arithmetic?

The last question first:  How about multiple precision floating (or fixed)
arithmetic?  Considering that there are quite a few papers on this, it is
certainly a topic of interest.  I do not believe it should be necessary here
to go into the full range of situations I can list NOW where this would be
useful.  

Now what is the benefit of allowing only normalized floating point?  It 
eliminates the need for a normalization option in floating instructions,
and it provides ONE more bit of accuracy.  Is that ONE bit exactly what
is needed?  This is very unlikely.

Now what is the cost of not having forced normalization, besides the one
bit?  There would have to be a method for indicating which result of the
operation is wanted (upper, lower, normalized).  There would be little
additional hardware other than the decoding, by the floating unit, of
this information.

There are algorithms which benefit from using both Boolean and arithmetic
operations on either fixed point or floats.  These are not even readily
available on many of the newer machines.

>         Herman Rubin also writes:
> In the early FP computers, much function calculation was done in fixed point,
> to get increased accuracy at little cost.
> 
>         This only makes sense when the floating point fraction length is
> less than a full integer.  With 64-bit floating point there is little need
> for increased accuracy in any case.

So why not have increased integer accuracy?  It is no harder to do this, and
the same units can be used.  To someone with a mathematical outlook, the 
distinction is not integer/float but short integer/good arithmetic.

There are algorithms which call, at some stage at least, for fixed point 
arithmetic.  Infinite precision (no mistake here) methods of generating
non-uniform random numbers tend to be of this type.  Converting the fixed
point results to floating can be a major problem, as 0 is a possible value,
again a real problem only with automatic normalization.

>         Herman Rubin also writes:
> How do you expect users who do not even know of the existence of the operations
> to use them?
> 
>         I expect the compiler to generate the instructions for them.  If
> the compiler won't generate an instruction this is a strong reason for
> not having it in the instruction set.

The chicken and the egg again.  Anyone who is willing to say that something
is not useful is either ignorant, arrogant, or stupid.  Nobody can, or should,
attempt to ever do this.  Even the best people can make big mistakes.

This assumes the language designers and hardware designers perceive the need
for the instructions.  This is clearly not the case.  An example is the
attempt to add integer multiplication to the CDC 6x00 series.  The fact
that floating multiplication automatically normalized the product when
both factors were normalized made the operation far less useful than
intended.  There was too much already in the hardware to correct this.

How many current languages have user-definable operations?  That the designers
of C did not think of multiple precision arithmetic, or quotient and remainder,
or exponentiation as an operation and not a function, etc., does not mean that
these should have been omitted.  There are provisions for octal and hex
integers, but none for explicitly writing floats or fixed-point numbers to
those bases.  Even at my age, I am quite capable of operating entirely binary
even if there is not a good reason for it, and often there is.

>         Herman Rubin also writes:
> There are many more algorithms than are in the philosophy of software, and
> especially hardware, designers.

>         The argument here seems to be that there are numerous algorithms
> which are not used at all today which suddenly would be used all over the
> place if only a few changes were made to the instruction set.  I don't
> agree.  If two algorithms have similar performance both will be in use as
> there will be some problems which are particually suited to one or the
> other.  If an algorithm is not used at all on today's machines this means
> it is not competive even on those problems for which it is particually
> suited.  Making changes to the instruction set is unlikely to drastical-
> ly change the relative performance of two algorithms,  hence is unlikely
> to drastically change the amount of usage each gets.  I believe it is
> perfectly sensible for hardware designers to concentrate on speeding up
> the existing mix of applications.

Like the infamous frexp function in C?  For those not familiar with it,
frexp(x,&n) took a floating number x and transformed it to y*2^k, where
.5 <= |y| < 1, and the value of k is stored as n.  Instead of completely
inlining this operation, which would have made the whole thing simple and
suggested the obvious alternate form

	y,n = frexp(x)

which also assigns the values to registers or memory as wanted, and avoids
a clumsy function call, they did the other.  The 4.2BSD library even used
a machine independent algorithm, which needless to say was frequently slower
by a factor of more than 100.  It even went into an infinite loop in 0.  
Of course the work has to be done in machine-dependent code, and in the
integer registers, if they are separate.

Except for ALGOL and possibly COMMON LISP, I know of no language designers
who even made a real attempt to put in the variety needed.  Neither group
really succeeded.  

On vector and parallel processors, things which are of essentially no
importance on scalar machines suddenly become important.  On modern 
machines, even local transfers can be costly, and context switches more
so.  On parallel machines, how does one handle conditional transfers and
conditional calls?  They are deadly, and if one has 2^14 units (already
in existence), a one-in-a-million condition becomes one-in-62.  If the
condition calls for major work, say 100,000 cycles, and I have examples
of this, this can occupy most of the time.  There are single operations,
which deviate from the parallel idea, but are similar to those already
in use there, which can alleviate the mess.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

nather@ut-emx.uucp (Ed Nather) (05/21/91)

In article <1991May20.121708.5023@dartvax.dartmouth.edu>, mac@eleazar.dartmouth.edu (Alex Colvin) writes:
> >Fixed point arithmetic is little used now because the hardware to support
> >it reasonably well does not exist.  
> 
> Would anyone who works with DSPs comment on this?  They seem to be built for
> fixed point and fractional arithmetic, with the ALU connected to the registers
> through shifters.

For people who use computers for real-time data acquisition and control,
and there are a lot of them, floating point operations are very seldom
used, if at all.  Floating point is valuable for number-crunching jobs,
but running stepping motors or adjudicating multiple interrupts doesn't
need it, and really can't use it.

People use computers for many different jobs, and it's not surprising that
different jobs need different architectures.  What's sort of amazing is how
well our "general purpose" computer architectures do so many different
things quite well indeed.

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin

dik@cwi.nl (Dik T. Winter) (05/21/91)

In article <12526@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
In reply to:
 > In article <9105200213.AA05095@ucbvax.Berkeley.EDU>, James B. Shearer
 > (jbs@WATSON.IBM.COM) writes:
Who again replies to:
 > >         Herman Rubin writes:
etc.  I hope I get everything right.

JBS:
 > >                 Regarding unnormalized floating point what is this good
 > > for besides simulating multiple precision integer arithmetic?
I do not find it even good for that either.
 > 
 > The last question first:  How about multiple precision floating (or fixed)
 > arithmetic?  Considering that there are quite a few papers on this, it is
 > certainly a topic of interest.  I do not believe it should be necessary here
 > to go into the full range of situations I can list NOW where this would be
 > useful.  
Considering that the oldest paper I know about multiple precision FP is from
T.J.Dekker, and as it only considers normalizing FP (eh. standardizing as
on the Electrologica X8, but he does not use it), I find this answer
surprizing.
 > 
 > Now what is the benefit of allowing only normalized floating point?  It 
 > eliminates the need for a normalization option in floating instructions,
 > and it provides ONE more bit of accuracy.  Is that ONE bit exactly what
 > is needed?  This is very unlikely.
Nope.  That bit is not exactly what is needed, but it came in as a surplus bit,
so why not use it?  For multi-precision FP arithmetic unnormalized arithmetic
is not needed, as for multi-precision integer arithmetic.
 > 
 > Now what is the cost of not having forced normalization, besides the one
 > bit?  There would have to be a method for indicating which result of the
 > operation is wanted (upper, lower, normalized).  There would be little
 > additional hardware other than the decoding, by the floating unit, of
 > this information.
This is the 205 way, but the 205 implemented it awful (and the reason is
just that: cost).  On the 205 you could specify for an add that one of the
three results was returned.  But the normalizing add was just that: it first
took the upper 47 bits of the result, normalizing afterwards.  This gives
a loss in precision.
 > 
 > There are algorithms which benefit from using both Boolean and arithmetic
 > operations on either fixed point or floats.  These are not even readily
 > available on many of the newer machines.
Just the same as on other machines.  If current machines do not implement a
specific round float to integer instruction, one can add a simple constant
to a float to get the integer part in a specific portion of the register.
Again I ask, how much is it used?  You say table look up for some elementary
functions.  My reply is, maybe on scalar machines, but if you do that on
vector machines, expect a big loss in performance.  So table look up is not
in general the most efficient way.  (Oh, btw, the same may hold for scalar
machines because of cache misses etc.)
 > 
 > >         Herman Rubin also writes:
 > > In the early FP computers, much function calculation was done in fixed point,
 > > to get increased accuracy at little cost.
Eh?  I always thought it was because of speed (FP was slow, fixed point was
fast).  At least on the Electrologica X8 (1964 vintage) no calculation in
those functions was done in fixed point.  And yes, fixed point was much faster
than floating point.  Some timings (in milliseconds):
	fixed		FP
add	5.0		8.75-18.75
mul	8.75-40.0	12.5-68.75
(for some reasons I have a manual here at home!)  As you see, especially
fixed point add (and sub) were much faster.
 > > 
 > So why not have increased integer accuracy?  It is no harder to do this, and
 > the same units can be used.  To someone with a mathematical outlook, the 
 > distinction is not integer/float but short integer/good arithmetic.
But you are not advocating 'good' arithmetic.  At least from the viewpoint
of many people.  And why do you think that 64 bit integer arithmetic is not
much harder?  If we consider the time it took for micros to go from 8 to 16
and again to 32 bit integer arithmetic....
 > 
 >                                                      Converting the fixed
 > point results to floating can be a major problem, as 0 is a possible value,
 > again a real problem only with automatic normalization.
I do not understand thgis at all.  If the hardware does not provide instructions
to convert fixed point results to floating points, only 2 instructions are
required to do it.  That 0 is a possible value is no problem at all.  But I
think you mean 'converting fixed point results to floating by a single
boolean operation', i.e. or'ing in the exponent.  But in that case negative
numbers are also a problem.
 > 
 >                        This is clearly not the case.  An example is the
 > attempt to add integer multiplication to the CDC 6x00 series.  The fact
 > that floating multiplication automatically normalized the product when
 > both factors were normalized made the operation far less useful than
 > intended.  There was too much already in the hardware to correct this.
This is taking it backwards.
1.  Integer multiplication was not added later, but designed in from the start.
2.  The fact that multiplication automatically normalizes has nothing to do
    with the problems.
3.  The fact that mutliplication assumes that apparently normalized fp numbers
    are indeed normalized fp numbers has all to do with it.
But on the CDC 6x00 series you could get a 94 bit integer product from two
47 bit integer factors; although it was undocumented.

On the other hand, we see with the Cray (were normalization is done too late)
that we can lose a lot of precision in the fp multiplication.

 >                                  There are provisions for octal and hex
 > integers, but none for explicitly writing floats or fixed-point numbers to
 > those bases.
Ada provides all this.  However, upto now I have not seen many use of the
possibility to write fp numbers in any than decimal base.  To wit:
	8#0.12345#e-4
means 0.12345 in octal times 8^-4.

 > Like the infamous frexp function in C?  For those not familiar with it,
 > frexp(x,&n) took a floating number x and transformed it to y*2^k, where
 > .5 <= |y| < 1, and the value of k is stored as n.
Not *took* Herman, it still *takes*.

 >                                                    Instead of completely
 > inlining this operation, which would have made the whole thing simple and
 > suggested the obvious alternate form
 > 	y,n = frexp(x)
(Yup, *you* design the language where that is possible.  Start looking at Mesa.)
 > which also assigns the values to registers or memory as wanted, and avoids
 > a clumsy function call, they did the other.
The fact that it is formally declared as a function does not inhibit inlining,
nor the assignement of values to registers.
 >                                              The 4.2BSD library even used
 > a machine independent algorithm, which needless to say was frequently slower
 > by a factor of more than 100.  It even went into an infinite loop in 0.  
Yes, yell at the language designers if the implementers goof it.
 > Of course the work has to be done in machine-dependent code, and in the
 > integer registers, if they are separate.
Not at all.  If the hardware completely implements IEEE, I want it in the
FP registers!

I get tired.  Perhaps more tomorrow.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

bengtl@maths.lth.se (Bengt Larsson) (05/21/91)

In article <12526@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu 
(Herman Rubin) writes:

>The last question first:  How about multiple precision floating (or fixed)
>arithmetic?  Considering that there are quite a few papers on this, it is
>certainly a topic of interest.  I do not believe it should be necessary here
>to go into the full range of situations I can list NOW where this would be
>useful.  

Please indulge. Where would this be useful?

>So why not have increased integer accuracy?

Yes, why not? But the important question is *why*?

>There are algorithms which call, at some stage at least, for fixed point 
>arithmetic. Infinite precision (no mistake here) methods of generating
>non-uniform random numbers tend to be of this type.  

Interesting progression. From "there are algorithms", to "generating
non-uniform random numbers". That is 1 algorithm.

>The chicken and the egg again.  Anyone who is willing to say that something
>is not useful is either ignorant, arrogant, or stupid.  Nobody can, or should,
>attempt to ever do this.  

You seem to have adopted a favorite metaphor. It doesn't apply here
though. You can implement just about any algorithm you want to.
If not, give counterexamples.

As for ignorant, arrogant, or stupid, this may occasionally apply to others
than the computer architects in the discussion.

>Even the best people can make big mistakes.

Well, name a favorite big mistake of your own doing, Mr Rubin :-)

>How many current languages have user-definable operations?

Practically all of them, assuming a function syntax.

>There are provisions for octal and hex
>integers, but none for explicitly writing floats or fixed-point numbers to
>those bases.  

Ada.

>Like the infamous frexp function in C? 

>The 4.2BSD library even used
>a machine independent algorithm, which needless to say was frequently slower
>by a factor of more than 100.  

Then the 4.2 BSD library could be rewritten, without changing the machine.

>Of course the work has to be done in machine-dependent code, and in the
>integer registers, if they are separate.

Naturally.

>On parallel machines, how does one handle conditional transfers and
>conditional calls?  They are deadly, and if one has 2^14 units (already
>in existence), a one-in-a-million condition becomes one-in-62.  If the
>condition calls for major work, say 100,000 cycles, and I have examples
>of this, this can occupy most of the time.  There are single operations,
>which deviate from the parallel idea, but are similar to those already
>in use there, which can alleviate the mess.

The programming of 2^14 parallelly working units using separate
instruction streams is not a solved problem. 

I assume that a clarification of the operations which can alleviate
the mess would be in order. 

Bengt Larsson.
-- 
Bengt Larsson - Dep. of Math. Statistics, Lund University, Sweden
Internet: bengtl@maths.lth.se             SUNET:    TYCHE::BENGT_L

clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) (05/22/91)

In article <9105200213.AA05095@ucbvax.Berkeley.EDU> jbs@WATSON.IBM.COM writes:
>
>
>        Herman Rubin also writes:
>How do you expect users who do not even know of the existence of the operations
>to use them?
>
>        I expect the compiler to generate the instructions for them.  If
>the compiler won't generate an instruction this is a strong reason for
>not having it in the instruction set.

As a compiler writer in this RISCy age, I am inclined to agree with this
statement. If we compiler writers, in our infinite brilliance, cannot
figure out how to use an instruction, it doesn't belong in the instruction
set. Before this becomes a mantra that is chanted mindlessly, let's look
at an example that might make us think a little harder.

On the VAX architecture, there are two common ways for a compiler to generate
code for the integer remainder operation. Let's say that register r6 contains
the value of variable "x", register r7 contains the value of variable "y",
and we want register r11, currently unused, to contain the value of variable
"z" (as determined by a register allocator.) All three variables are 32-bit
integers (VAX "longword" types.) Registers r6 through r11 are the general
purpose arithmetic registers; we can use others, such as r0 and r1, as scratch
registers that are not live across a function call. We will see this scratch
register usage in both examples below.

Given the C source code statement:

	z = x % y;   /* z gets the remainder of x divided by y */

We will see the "cc" compiler generate the following code:

	divl3	r7,r6,r0	/* divide r6/r7, put quotient in r0 */
	mull2	r7,r0		/* multiply r7*r0, put product in r0 */
	subl3	r0,r6,r11	/* subtract x - (x DIV y)*y; put in r11 */

So, the remainder is left in r11, the location for further uses of "z".
Thus, we computed "z = x - ((x/y)*y);" where the division is integer
division.

The main alternative to the above is to use the nonorthogonal VAX instruction
called "ediv". This takes a 64-bit (VAX "quadword") dividend and 32-bit divisor
and divides them, producing a 32-bit quotient and a 32-bit remainder. Very
CISCy 4-operand instruction; no equivalent exists for 32-bit dividend. We need
to create a 64-bit dividend to use it; unfortunately, we tend to have our
32-bit quotient in a register surrounded by live registers. So we generate
the 3-instruction sequence:

	movl	r6,r1		/* Transfer quotient to r1 */
	clrl	r0		/* Zero out upper word to form 64-bit r0/r1
				   register pair quotient */
	ediv	r7,r0,r2,r11	/* Divide r0-r1 pair by r7; throw away quotient
				   into r2 and keep remainder in r11 */

Intuitively, it seems like a lot of work to use the extended division when
you really don't need 64-bit division. But the first two operations are
very simple register operations (and optimization might eliminate the
second instruction, as r0 might already be cleared.) I was told that the
"cc" compiler did things the first way because "ediv" is such a slow
instruction. But I got curious and timed the alternatives myself:

[from an old description of the experiment] :

I then created a shell script that executes the programs 20 times each,
for a total of 20 times 30,000 == 600,000 runs through the loop. I ran
the script three times to judge the variance. Results:

................................................... So, looking at the 
time added by the computation with which we are concerned, we get about
a 24% speedup when we only want the remainder, and a 27% speedup when we
want the quotient and the remainder.

[end excerpt]

(I was referring to running the experiment two ways: producing the quotient
AND remainder by two methods, and producing just the remainder by the two
methods. The advantage of ediv is greater when use it to get two results
in one instruction, as the first method needs to copy the quotient result
somewhere before destroying it.)

So, a 24% speedup over what the "cc" compiler produces on a VAX 8600 running
BSD Unix 4.2. When I heard that hardware changes in the VAX 8600 might have
obsoleted an instruction selection made by "cc" for the VAX 11/780, I did
the test again on a VAX 11/750 running Ultrix. More than 20% improvement
for ediv there, too.

MORAL: Just because the compiler doesn't use it doesn't mean that the
compiler SHOULDN'T have used it. Use many compilers for many languages
and applications to measure instruction usage, and second-guess the
compiler to ensure good conclusions.

Question for RISC architects: When evaluating a large set of customer programs
to find instruction frequencies, how do you avoid biasing results because of
inefficient compilers? Even if you ran all C programs through "cc" and "gcc",
and let's say "gcc" uses ediv for remaindering, the count for ediv is lower
than it ought to be. "cc" will only use it for true 64/32 bit division. Perhaps
we should be counting intermediate code statement frequencies to decide what
higher level operations are important, and then deciding what lower level
instructions are needed?

-----------------------------------------------------------------------------
"The use of COBOL cripples the mind; its teaching should, therefore, be 
regarded as a criminal offence." E.W.Dijkstra, 18th June 1975.
|||  clc5q@virginia.edu (Clark L. Coleman)

rh@craycos.com (Robert Herndon) (05/22/91)

In article <1991May21.191034.25980@murdoch.acc.Virginia.EDU>, Clark
Coleman writes:

> As a compiler writer in this RISCy age, I am inclined to agree with this
> statement. If we compiler writers, in our infinite brilliance, cannot
> figure out how to use an instruction, it doesn't belong in the instruction
> set. Before this becomes a mantra that is chanted mindlessly, let's look
> at an example that might make us think a little harder.

While I generally agree with his sentiment and his example, I am
inclined to believe in at least one major exception.  Not to mindlessly
encourage H. L. Rubin, but population count and leading zero count
on a word value are quite useful, yet (though I'm not a compiler writer)
I can't think of easy methods for a (non-vector) compiler to use these
instructions often.

Yet I think they are undeniably useful.  Many simple operations can
be built easily and quickly in straight-line code with these
instructions, and simple things like dispatching on a bit mask and
converting octal mask bit constants to shift counts, etc. are much
simplified with them.  Examples of functions easily built using
population_count(x) and leading_zeroes(x) include signum(x), abs(x),
!a, a == b, a < b, a > b, ... trailing_zeroes(x), ceil(log2(x)), and
more.  At the same time, these operations are not that common in
most codes, so I wouldn't expect large performance gains, except
in programs that make heavy use of either assembly codes, compiler
intrinsics, or library routines written in assembler or with
intrinsics.

Perhaps I am mistaken here, and you compiler writers can use these
instructions to significant advantage.  If so, I'd be curious to
know how much these instructions typically save.  Also, in this one
case, I can sympathise with Mr. Rubin, because there isn't any elegant
and portable way to express, for instance, leading_zeroes(x) in a
language other than (say, in C) to make it a function or #define
and define it according to the capabilities of the hardware/compiler
at hand.  Not terribly difficult, but it does seem to lack elegance,
and not all compilers for machines that have these functions have such
intrinsics or deal with them well.  Still, IMHO that is something to
take up with the compiler writers, rather than the language designers.

I will note however, that I am very happy to see that many micros
are beginning to include these instructions...

				Robert Herndon
-- 
Robert Herndon		    --	not speaking officially for Cray Computer.
Cray Computer Corporation			719/540-4240
1110 Bayfield Dr.				rh@craycos.com
Colorado Springs, CO   80906		   Ignore these three words.

jlg@cochiti.lanl.gov (Jim Giles) (05/22/91)

In article <1991May23.084258.5062@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes:
|> [...]
|> You can do pop count quite quickly with table lookup.  Yes, you have to
|> break it up, but, on a byte-addressable machine, it's not a major hassle.
|> You will need a 256-byte table, of course, but that, also, isn't a major
|> hassle.  CLZ I can also do somewhat quickly with table lookup on a
|> byte-addressable machine.

Now, let's see.  A pop count instruction is 4 clocks on an X/MP.  _ONE_
memory access is 14 (assuming no bank conflicts).  So, to do a pop count
on just one word (64 bits) using a 256 element table would take 8 memory
references (start them together so they pipeline - still 16 clocks to
issue all those references), plus the time taken to break the word into
bytes, plus the time to add the results of the loads.  I'd guess a
minimum of 40 clocks total.  Does this count as "quite quickly"?

Leading zero count is a 3 clock instruction.  I doubt that could be
implemented in fewer clocks than the pop count if you did it with a
table.

Now, slower CPUs (for which the memory would seem _relatively_ faster)
would take fewer clocks to load the table entries.  If the table _happened_
to be in the cache (for machines that _have_ a cache) the memory fetches
are sometimes only a couple of clocks.  Even so, on such a slower machine,
the hardware pop count and lead zero instructions could also be implemented
as _relatively_ faster instructions - one clock each wouldn't be a surprise.

It would amaze me to find any machine (on which the test could be done) 
where a table lookup came within an order of magnitude of a hardware
instruction on these functions.

J. Giles

jlg@cochiti.lanl.gov (Jim Giles) (05/23/91)

In article <1991May23.192557.7558@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes:
|> [...]                                                  How long would
|> 
|> 	char *byte = (char *)&word;
|> 	pop_count = table[byte[0]] + table[byte[1]] + table[byte[2]] +
|> 		table[byte[3]];
|> 
|> take on a machine with somewhat better memory accesses?  Say, an R6000, or
|> even a Sparc?

First, this code only does a pop count on a 32 bit object, not 64.  Second,
I mentioned this case on my last posting: I would bet that an implementation
of pop count as a hardware instruction on either of these machines (using the
technology they were built with) would be _ONE_ clock long.  The above
sequence takes in excess of 10.

|> And don't forget that, for serial code, the R6000 is faster than the Cray.
|> So that doesn't quite count as a "slow machine," does it?

What is the relevance of a comparison of the R6000 to the Cray in the
context of this discussion?  The issue is whether pop count can be
performed as quickly in software as in hardware.  This is an issue to 
be decided on the basis of each machine individually.  The R6000 would
be even _faster_ if pop count were a hardware instruction!!

J. Giles

firth@sei.cmu.edu (Robert Firth) (05/23/91)

In article <1991May21.191034.25980@murdoch.acc.Virginia.EDU> clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes:

>Given the C source code statement:

>	z = x % y;   /* z gets the remainder of x divided by y */

>... we generate
>the 3-instruction sequence:
>
>	movl	r6,r1		/* Transfer quotient to r1 */
>	clrl	r0		/* Zero out upper word to form 64-bit r0/r1
>				   register pair quotient */
>	ediv	r7,r0,r2,r11	/* Divide r0-r1 pair by r7; throw away quotient
>				   into r2 and keep remainder in r11 */

I hope not.  From the previous code fragment, it is clear you are
expecting the remainder from SIGNED division.  If you want the same
answer as before, the code must be

	MOVL R6,R1		; construct the sign-extended 64-bit ...
	ASHQ #-32,R0,R0		; dividend in the register pair <R0,R1>
	EDIV ... as before

You might like to time THAT sequence, and rethink your post.  Or you
could take my word for it, that when you include the cost of having
to reserve and target into an even-odd register pair, the EDIV is
almost always slower.

sef@kithrup.COM (Sean Eric Fagan) (05/23/91)

In article <1991May22.001620.751@craycos.com> rh@craycos.com (Robert Herndon) writes:
>but population count and leading zero count
>on a word value are quite useful, yet (though I'm not a compiler writer)
>I can't think of easy methods for a (non-vector) compiler to use these
>instructions often.

You can do pop count quite quickly with table lookup.  Yes, you have to
break it up, but, on a byte-addressable machine, it's not a major hassle.
You will need a 256-byte table, of course, but that, also, isn't a major
hassle.  CLZ I can also do somewhat quickly with table lookup on a
byte-addressable machine.

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

peter@ficc.ferranti.com (peter da silva) (05/23/91)

In article <24216@lanl.gov>, jlg@cochiti.lanl.gov (Jim Giles) writes:
> Leading zero count is a 3 clock instruction.  I doubt that could be
> implemented in fewer clocks than the pop count if you did it with a
> table.

I suspect it could. You would only have to do a table-lookup on the first
non-zero byte. You could even use binary search techniques to reduce the
number of compares with zero.

But the question isn't whether you can do these operations in software
anywhere near as fast as they can be done in hardware, but rather whether
they can be done fast enough. That is, how much faster would applications
that make heavy use of these instructions run if you sped them up by a
factor of 10 (which is about what I would have guessed the speed difference
to be)? If you speed up 3% of the program by a factor of 10, it's pretty
much lost in the noise.
-- 
Peter da Silva; Ferranti International Controls Corporation; +1 713 274 5180;
Sugar Land, TX  77487-5012;         `-_-' "Have you hugged your wolf, today?"

jlg@cochiti.lanl.gov (Jim Giles) (05/23/91)

In article <1991May23.220143.8515@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes:
|> [...]
|> It would?  John, would you care to comment on that?  It seems that MIPS
|> engineers were derelict in their job, and didn't realize that adding
|> hardware pop-count would speed up everything else the chip does.

As they would have said on the old Perry Mason show: incompetent, 
irrelevant, and immaterial.  I never said that pop count would speed
the rest of the chip's functions - so your comment has _absolutely_
_no_relevance_ to the discussion at all.

Pop count _would_ speed the machine by allowing the operation to be
done in hardware instead of your slow table lookup.  This would speed
the very benchmarks you were bragging about in your previous post (the
R6000 is faster than Cray on scalar - sometimes this is even true).

J. Giles

jlg@cochiti.lanl.gov (Jim Giles) (05/23/91)

In article <QAIBQ2@xds13.ferranti.com>, peter@ficc.ferranti.com (peter da silva) writes:
|> [...]
|> But the question isn't whether you can do these operations in software
|> anywhere near as fast as they can be done in hardware, but rather whether
|> they can be done fast enough. That is, how much faster would applications
|> that make heavy use of these instructions run if you sped them up by a
|> factor of 10 (which is about what I would have guessed the speed difference
|> to be)? If you speed up 3% of the program by a factor of 10, it's pretty
|> much lost in the noise.

Exactly so.  The only thing I was saying is that the software mechanisms
are _not_ done "quite quickly" compared to a hardware implementation of
the same thing.  Whether is is worthwhile to put an instruction into the
hardware is dependent on two things: 1) the commonality of the operation,
and 2) the amount by which the hardware is faster.  Sean Fagan was claiming
the the speedup was insignificant - it isn't.  Now, if you want to claim
the instruction is rarely needed, that's a completely different issue.
It also depends on two things: a) what is the application domain, and 
b) what features _are_ present in the hardware (individual instructions
cannot be selected in isolation from the rest of the instruction set).

J. Giles

sef@kithrup.COM (Sean Eric Fagan) (05/24/91)

In article <24216@lanl.gov> jlg@cochiti.lanl.gov (Jim Giles) writes:
>It would amaze me to find any machine (on which the test could be done) 
>where a table lookup came within an order of magnitude of a hardware
>instruction on these functions.

How about a Cyber?  A Cyber, without the pop-count hardware, takes
something like 60 cycles to do a popcount.  And the Cray has lousy
memory-access times, and isn't a byte-addressable machine.  How long would

	char *byte = (char *)&word;
	pop_count = table[byte[0]] + table[byte[1]] + table[byte[2]] +
		table[byte[3]];

take on a machine with somewhat better memory accesses?  Say, an R6000, or
even a Sparc?

And don't forget that, for serial code, the R6000 is faster than the Cray.
So that doesn't quite count as a "slow machine," does it?

So here is a way of doing pop-count, quite quickly (it's possible for a
compiler to put the byte[x] into registers and not have to access memory,
the first reference to table could put quite a bit of table into a cache,
and if you have pipelined loads, it *does* go *very* quickly), that doesn't
require any special instructions.  And will work the same way, if not
faster, on later versions of the processor.  This is not true with
instructions that don't get a lot of use:  witness the 68040 and
transcendental instructions.

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

jimb@shograf.com (Jim battle) (05/24/91)

One place I worked used to ask job applicants to write a program to count
the number of 1's in an integer.  Everyone pretty much did the same thing;
see if the number is odd, tally, and shift.  One person came up with an
ingenious method that does not involve tables or conditionals.  Here is his
solution, given in pseudo assembly language; it assumes 32-bit integers,
although it would work for any word size with little modification:

   Given: r1 is a 32-bit integer with an unknown number of 1's set.

   move #n,r1               /* consider r1 to have 32 1-bit sums */
   move r1,r2               /* destinations are on the right */
   shr  #1,r2
   and  #55555555,r1
   and  #55555555,r2
   add  r2,r1               /* now r1 has 16 two-bit sums; b0+b1, b2+b3, etc */
   move r1,r2
   shr  #2,r2
   and  #33333333,r1
   and  #33333333,r2
   add  r2,r1               /* now r1 has 8 four-bit sums; b0+b1+b2+b3, etc */
   ...
   move r1,r2
   shr  #16,r2
   and  #0000ffff,r1
   and  #0000ffff,r2
   add  r2,r1               /* now r1 has the total of all 1 bits */

for a 64-bit number, just one more step is needed.
maybe not useful, but I thought I'd mention it.

As far as I know, this is an original approach; credit goes to Richard Poppen.
I believe he works at Etak.

clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) (05/24/91)

In article <25874@as0c.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
>In article <1991May21.191034.25980@murdoch.acc.Virginia.EDU> clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes:
>
>>Given the C source code statement:
>
>>	z = x % y;   /* z gets the remainder of x divided by y */
>
>>... we generate
>>the 3-instruction sequence:
>>
>>	movl	r6,r1		/* Transfer quotient to r1 */
>>	clrl	r0		/* Zero out upper word to form 64-bit r0/r1
>>				   register pair quotient */
>>	ediv	r7,r0,r2,r11	/* Divide r0-r1 pair by r7; throw away quotient
>>				   into r2 and keep remainder in r11 */
>
>I hope not.  From the previous code fragment, it is clear you are
>expecting the remainder from SIGNED division.  If you want the same
>answer as before, the code must be
>
>	MOVL R6,R1		; construct the sign-extended 64-bit ...
>	ASHQ #-32,R0,R0		; dividend in the register pair <R0,R1>
>	EDIV ... as before
>

Thanks for pointing out my error. I looked into the VAX Architecture Handbook
and it seems that you are trying to get "ASHL  #-32,R1,R0" in your second
statement. "ASHQ #-32,R0,R0" takes a heck of a long time and gives the wrong
answer. "ASHL #-32,R1,R0" takes the sign bit of R1 and fills R0 with it.
Unfortunately, this seems to be the best way to sign extend on the VAX.
(The coercion instructions don't include CVTLQ == coerce longword to quadword,
so the apparently slower pseudo-shift is the best we can do.)

>You might like to time THAT sequence, and rethink your post.  Or you
>could take my word for it, that when you include the cost of having
>to reserve and target into an even-odd register pair, the EDIV is
>almost always slower.

Well, I timed the new sequence, and a little over half of my speedup 
disappeared, but it is still faster by more than 10% compared to what
"cc -O" does.

As for register allocation issues, that is a complex subject on the VAX.
Registers R6 through R11 are "allocable" general-purpose registers. When
translating most source code statements, you can consider R0 through R5
available. (BTW, Robert, this little tutorial is not directed at you, but
at those new to the VAX register set.) As long as you aren't doing weird
assembly language instructions like CRC or POLY or the string instructions,
R2 through R5 are not going to be trampled by anything. R0 and R1 are used
for the return value of a function, so usage of them has to be temporary
and not live across a function call.

Really good register allocation will use R0 through R11 as much as is legal.
Simpler and poorer register allocation will only use R6 through R11 except
when a single intermediate-code statement is translated into multiple
assembly language statements, and those statements need scratch registers
that will be dead upon conclusion of the single intermediate code operation.
Thus, my code above used R0 through R2 temporarily.

The point here is that "cc -O" is doing the same thing. It generated a sequence
of 3 instructions for the remainder operation, and used R0 as a scratch 
register. Thus, for simple and stupid register allocators, R0 and R1 are
always available as a nice even-odd register pair for scratch usage.
(Although the VAX does not care about even-odd pairs, so I am not sure
why you mentioned them. A contiguous pair is all that is needed.) A smarter
allocator might want to avoid using "ediv" for the remainder operation
because of the need to reserve a pair of registers. (A REALLY smart code
generator might look to see if a pair is available for scratch use, and
generate the "ediv" code if it were, and the "cc -O" sequence otherwise.
And the first instruction of my sequence is unnecessary if the next lower
numbered register is unused at the moment; the ASHL and EDIV can be done
in place.)

The point still remains: "cc -O" produces less than optimal code that
biases instruction count analysis of the architecture. I am still wondering
how system architects handle this bias when determining the future path
of the architecture. And how it affects the famous pronouncements about
how CISCs all have umpteen never-used instructions and umpteen more
rarely-used instructions.

(BTW, I will check the timings again on the VAX 11/750. The speedup I
confirmed was on the VAX 8600. If it is different on the VAX 11/750,
that just points out that a code generator can get outdated and bias
instruction counts. So the point remains the same.)

-----------------------------------------------------------------------------
"The use of COBOL cripples the mind; its teaching should, therefore, be 
regarded as a criminal offence." E.W.Dijkstra, 18th June 1975.
|||  clc5q@virginia.edu (Clark L. Coleman)

sef@kithrup.COM (Sean Eric Fagan) (05/24/91)

In article <24263@lanl.gov> jlg@cochiti.lanl.gov (Jim Giles) writes:
>The R6000 would
>be even _faster_ if pop count were a hardware instruction!!

It would?  John, would you care to comment on that?  It seems that MIPS
engineers were derelict in their job, and didn't realize that adding
hardware pop-count would speed up everything else the chip does.

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

bruce@contingency.think.com (Bruce Walker) (05/24/91)

As long as we are on the subject of bitcounting algorithms...

One of my favorite toys is the fact that the expression "N & (N-1)" turns
off the lowest order one-bit in N.  Hence, it's a quick "find first", and
can be made into a 4-cycle SPARC popcount loop which runs proportional to
the number of ones (repeat N &= N-1 until N==0).  I like the Poppen
partial-sum algorithm, though.
--
--Bruce Walker
  Thinking Machines Corporation, Cambridge, MA
  bruce@think.com; +1 617 234 4810; N1IKV

christer@cs.umu.se (Christer Ericson) (05/24/91)

In article <1991May23.204624.22872@shograf.com> jimb@shograf.UUCP (Jim battle) writes:
>One place I worked used to ask job applicants to write a program to count
>the number of 1's in an integer.  Everyone pretty much did the same thing;
>see if the number is odd, tally, and shift.  One person came up with an
>ingenious method that does not involve tables or conditionals.  Here is his
>solution, given in pseudo assembly language; it assumes 32-bit integers,
>although it would work for any word size with little modification:
>
> [ code deleted]
>
>As far as I know, this is an original approach; credit goes to Richard Poppen.
>I believe he works at Etak.

Nope, it isn't really. See Mike Morton's article in the December 1990 issue
of Computer Language for numerous ways to count the 1s in an integer.


| Christer Ericson                            Internet: christer@cs.umu.se |
| Department of Computer Science, University of Umea, S-90187 UMEA, Sweden |

rabin@CS.Yale.Edu (Dan Rabin) (05/24/91)

In article <1991May23.204624.22872@shograf.com> jimb@shograf.com (Jim
battle) gives a technique (credited to Richard Poppen) for counting
the number of one-bits in a word.

For historical interest, a similar technique was proposed in 

    HAKMEM  -  Artificial Intelligence Memo No. 239
    Massachusetts Institute of Technology A. I. Laboratory
    M. Beeler, R. W. Gosper & R. Schroeppel, February 29,1972

The relevant excerpt is:

    ITEM 25 (in order of one-ups-manship:  Gosper, Mann, Lenard, [Root and
    Mann]):

    To count the ones in a PDP-6/10 word:

	    LDB B,[014300,,A]       ;or MOVE B,A then LSH B,-1
	    AND B,[333333,,333333]
	    SUB A,B
	    LSH B,-1
	    AND B,[333333,,333333]
	    SUBB A,B   ;each octal digit is replaced by number of 1's in it
	    LSH B,-3
	    ADD A,B
	    AND A,[070707,070707]
	    IDIVI A,77              ;casting out 63.'s

    These ten instructions, with constants extended, would  work  on  word
    lengths up to 62.;  eleven suffice up to 254..

This proposal uses an integer divide instruction to carry out the
summation; the remainder is the result of interest.  This is
ingenious, but possibly disadvantageous on a RISC that lacks a divide
instruction.  The same result could be obtained by a short sequence of
shifts and adds, I think.

  -- Dan Rabin
     rabin-dan@cs.yale.edu

craig@netcom.COM (Craig Hansen) (05/25/91)

The C code below uses the well-known doubling technique
to count ones in a bit string of arbitrary length.

#define UNSINT_BIT 32
#define LOG_UNSINT_Bit 5

typedef unsigned int UnsInt;
typedef UnsInt *PToUnsInt;
typedef PToUnsInt Bits;

UnsInt bcount(Bits b, UnsInt size) {
    UnsInt count = 0;
    UnsInt words = (size+UNSINT_BIT-1)>>LOG_UNSINT_BIT;
    UnsInt m = ~((-2)<<(UNSINT_BIT-1 - ((words<<LOG_UNSINT_BIT)-size)));
    UnsInt i;
    
    for (i=0; i!=words; i++) {
	UnsInt v = b[i];
	if (i==words-1) v &= m;
#if UNSINT_BIT == 32
	v = ((v>> 1)&0x55555555) + (v&0x55555555);
	v = ((v>> 2)&0x33333333) + (v&0x33333333);
	v = ((v>> 4)&0x0f0f0f0f) + (v&0x0f0f0f0f);
	v = ((v>> 8)&0x00ff00ff) + (v&0x00ff00ff);
	v = ((v>>16)&0x0000ffff) + (v&0x0000ffff);
	count += v;
#else
	while (v != 0) {
	    count++;
	    v = v & (v-1);
	}
#endif
    }
    return count;
}


With a little effort, this code can be further improved by doubling up
words before reducing the word down to a single sum. This improvement
takes advantage of the fact that an n-bit field holds sums of up to
2**n-1.  After the "v = ((v>> 2)&0x33333333) + (v&0x33333333);"
statement, you have sums in 4 bit fields that can hold sums from two
words, and after the "v = ((v>> 4)..."  statement you have sums in 8
bit fields that can hold sums from 32 words.

I would also note that a branch-free "count leading zeros" sequence
can be constructed from the sequence:

	v = v | (v>>16);
	v = v | (v>> 8);
	v = v | (v>> 4);
	v = v | (v>> 2);
	v = v | (v>> 1);

...followed by the population count sequence, and a subtraction of the
result from the word size.

Regards,
Craig Hansen
craig@microunity.com

torek@elf.ee.lbl.gov (Chris Torek) (05/25/91)

In article <1991May24.122453.11011@cs.umu.se> christer@cs.umu.se
(Christer Ericson) writes:
>... See Mike Morton's article in the December 1990 issue
>of Computer Language for numerous ways to count the 1s in an integer.

There was an interesting little battle on `fastest way to count 1s'
in comp.lang.c some time ago, and I put together this program from
other people's methods, plus a few variants of my own.  The comments
in front of a function are usually the originals, sometimes edited a
bit by me.

Note that the fastest count function is quite machine dependent.

#ifndef lint
static char rcsid[] = "$Id: bct.c,v 1.5 90/10/13 08:44:12 chris Exp $";
#endif

/*
 * bct - bitcount timing
 *
 * Runs a bunch of different functions-to-count-bits-in-a-longword
 * (many assume 32 bits per long) with a timer around each loop, and
 * tries to calcuate the time used.
 */
#include <stdio.h>
#include <sys/types.h>

#ifdef FD_SETSIZE
# define USE_GETRUSAGE
# include <sys/time.h>
# include <sys/resource.h>
#else
# include <sys/times.h>
#endif

#define SIZEOF(a) (sizeof(a)/sizeof(a[0]))


/*
 * This function is used to calibrate the timing mechanism.
 * This way we can subtract the loop and call overheads.
 */
int
calibrate(n)
	register unsigned long n;
{
	return 0;
}


/*
 * This function counts the number of bits in a long.
 * It is limited to 63 bit longs, but a minor mod can cope with 511 bits.
 *
 * It is so magic, an explanation is required:
 * Consider a 3 bit number as being
 *	4a+2b+c
 * if we shift it right 1 bit, we have
 *	2a+b
 * subtracting this from the original gives
 *	2a+b+c
 * if we shift the original 2 bits right we get
 *	a
 * and so with another subtraction we have
 *	a+b+c
 * which is the number of bits in the original number.
 * Suitable masking allows the sums of the octal digits in a
 * 32 bit number to appear in each octal digit.  This isn't much help
 * unless we can get all of them summed together.
 * This can be done by modulo arithmetic (sum the digits in a number by molulo
 * the base of the number minus one) the old "casting out nines" trick they
 * taught in school before calculators were invented.
 * Now, using mod 7 wont help us, because our number will very likely have
 * more than 7 bits set.  So add the octal digits together to get base64
 * digits, and use modulo 63.
 * (Those of you with 64 bit machines need to add 3 octal digits together to
 * get base512 digits, and use mod 511.)
 *
 * This is HACKMEM 169, as used in X11 sources.
 */
int
t0_hackmemmod(n)
	register unsigned long n;
{
	register unsigned long tmp;

	tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
	return ((tmp + (tmp >> 3)) & 030707070707) % 63;
}


/*
 * This is the same as the above, but does not use the % operator.
 * Most modern machines have clockless division, and so the modulo is as
 * expensive as, say, an addition.
 */
int
t1_hackmemloop(n)
	register unsigned long n;
{
	register unsigned long tmp;

	tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
	tmp = (tmp + (tmp >> 3)) & 030707070707;
	while (tmp > 63)
		tmp = (tmp & 63) + (tmp >> 6);
	return tmp;
}

/*
 * Above, without using while loop.  It takes at most 5 iterations of the
 * loop, so we just do all 5 in-line.  The final result is never 63
 * (this is assumed above as well).
 */
int
t2_hackmemunrolled(n)
	register unsigned long n;
{
	register unsigned long tmp;

	tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
	tmp = (tmp + (tmp >> 3)) & 030707070707;
	tmp = (tmp & 63) + (tmp >> 6);
	tmp = (tmp & 63) + (tmp >> 6);
	tmp = (tmp & 63) + (tmp >> 6);
	tmp = (tmp & 63) + (tmp >> 6);
	tmp = (tmp & 63) + (tmp >> 6);
	return (tmp);
}


/*
 * This function counts the bits in a long.
 *
 * It removes the lsb and counting the number of times round the loop.
 * The expression (n & -n) yields the lsb of a number,
 * but it only works on 2's compliment machines.
 */
int
t3_rmlsbsub(n)
	register unsigned long n;
{
	register int count;

	for (count = 0; n; n -= (n & -n))
		count++;
	return count;
}

int
t4_rmlsbmask(n)
	register unsigned long n;
{
	register int count;

	for (count = 0; n; count++)
		n &= n - 1;	/* take away lsb */
	return (count);
}

/*
 * This function counts the bits in a long.
 *
 * It works by shifting the number down and testing the bottom bit.
 */
int
t5_testlsb(n)
	register unsigned long n;
{
	register int count;

	for (count = 0; n; n >>= 1)
		if (n & 1)
			count++;
	return count;
}

/*
 * This function counts the bits in a long.
 *
 * It works by shifting the number left and testing the top bit.
 * On many machines shift is expensive, so it uses a cheap addition instead.
 */
int
t6_testmsb(n)
	register unsigned long n;
{
	register int count;

	for (count = 0; n; n += n)
		if (n & ~(~(unsigned long)0 >> 1))
			count++;
	return count;
}

int
t7_testsignandshift(n)
	register unsigned long n;
{
	register int count;

	for (count = 0; n; n <<= 1)
		if ((long)n < 0)
			count++;
	return (count);
}


/*
 * This function counts the bits in a long.
 *
 * It works by masking each bit.
 * This is the second most intuitively obvious method,
 * and is independent of the number of bits in the long.
 */
int
t8_testeachbit(n)
	register unsigned long n;
{
	register int count;
	register unsigned long mask;

	count = 0;
	for (mask = 1; mask; mask += mask)
		if (n & mask)
			count++;
	return count;
}


/*
 * This function counts the bits in a long.
 *
 * It works by masking each bit.
 * This is the most intuitively obvious method,
 * but how do you a priori know how many bits in the long?
 * (except for ''sizeof(long) * CHAR_BITS'' expression)
 */
int
t9_testeachbit1shl(n)
	register unsigned long n;
{
	register int count;
	register int bit;

	count = 0;
	for (bit = 0; bit < 32; ++bit)
		if (n & ((unsigned long)1 << bit))
			count++;
	return count;
}

static char nbits[256] = {
	0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
};

static int inbits[256] = {
	0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
};

int
tA_tableshift(n)
	register unsigned long n;
{

	return (nbits[n & 0xff] + nbits[(n >> 8) & 0xff] +
		nbits[(n >> 16) & 0xff] + nbits[n >> 24]);
}

int
tB_tableuchar(n)
	unsigned long n;
{
	register unsigned char *p = (unsigned char *)&n;

	return (nbits[p[0]] + nbits[p[1]] + nbits[p[2]] + nbits[p[3]]);
}

int
tC_tableshiftcast(n)
	register unsigned long n;
{

	return nbits[(unsigned char) n] +
		nbits[(unsigned char) (n >> 8)] +
		nbits[(unsigned char) (n >> 16)] +
		nbits[(unsigned char) (n >> 24)];
}

int
tD_itableshift(n)
	register unsigned long n;
{

	return (inbits[n & 0xff] + inbits[(n >> 8) & 0xff] +
		inbits[(n >> 16) & 0xff] + inbits[n >> 24]);
}

int
tE_itableuchar(n)
	unsigned long n;
{
	register unsigned char *p = (unsigned char *)&n;

	return (inbits[p[0]] + inbits[p[1]] + inbits[p[2]] + inbits[p[3]]);
}

int
tF_itableshiftcast(n)
	register unsigned long n;
{

	return inbits[(unsigned char) n] +
		inbits[(unsigned char) (n >> 8)] +
		inbits[(unsigned char) (n >> 16)] +
		inbits[(unsigned char) (n >> 24)];
}

/*
 * Explanation:
 * First we add 32 1-bit fields to get 16 2-bit fields.
 * Each 2-bit field is one of 00, 01, or 10 (binary).
 * We then add all the two-bit fields to get 8 4-bit fields.
 * These are all one of 0000, 0001, 0010, 0011, or 0100.
 *
 * Now we can do something different, because for the first
 * time the value in each k-bit field (k now being 4) is small
 * enough that adding two k-bit fields results in a value that
 * still fits in the k-bit field.  The result is four 4-bit
 * fields containing one of {0000,0001,...,0111,1000} and four
 * more 4-bit fields containing junk (sums that are uninteresting).
 * Pictorially:
 *	    n = 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh
 *	 n>>4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg
 *	  sum = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ
 * where W, X, Y, and Z are the interesting sums (each at most 1000,
 * or 8 decimal).  Masking with 0x0f0f0f0f extracts these.
 *
 * Now we can change tactics yet again, because now we have:
 *	    n = 0000WWWW0000XXXX0000YYYY0000ZZZZ
 *	 n>>8 = 000000000000WWWW0000XXXX0000YYYY
 * so	  sum = 0000WWWW000ppppp000qqqqq000rrrrr
 * where p and r are the interesting sums (and each is at most
 * 10000, or 16 decimal).  The sum `q' is junk, like i, j, and
 * k above; but it is not necessarry to discard it this time.
 * One more fold, this time by sixteen bits, gives
 *	    n = 0000WWWW000ppppp000qqqqq000rrrrr
 *	n>>16 = 00000000000000000000WWWW000ppppp
 * so	  sum = 0000WWWW000ppppp000sssss00tttttt
 * where s is at most 11000 and t is it most 100000 (32 decimal).
 *
 * Now we have t = r+p = (Z+Y)+(X+W) = ((h+g)+(f+e))+((d+c)+(b+a)),
 * or in other words, t is the number of bits set in the original
 * 32-bit longword.  So all we have to do is return the low byte
 * (or low 6 bits, but `low byte' is typically just as easy if not
 * easier).
 *
 * This technique is also applicable to 64 and 128 bit words, but
 * 256 bit or larger word sizes require at least one more masking
 * step.
 */
int
tG_sumbits(n)
	register unsigned long n;
{

	n = (n & 0x55555555) + ((n >> 1) & 0x55555555);
	n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
	n = (n + (n >> 4)) & 0x0f0f0f0f;
	n += n >> 8;
	n += n >> 16;
	return (n & 0xff);
}

/*
 * build a long random number.
 * The standard rand() returns at least a 15 bit number.
 * We use the top 9 of 15 (since the lower N bits of the usual rand()
 * repeat with a period of 2^N).
 */
unsigned long
bigrand()
{
#define randbits() ((unsigned long)((rand() >> 6) & 0777))
	register int r;

	r = randbits() << 27;
	r |= randbits() << 18;
	r |= randbits() << 9;
	r |= randbits();
	return (r);
}

/*
 * Run the test many times.
 * You will need a running time in the 10s of seconds
 * for an accurate result.
 *
 * To give a fair comparison, the random number generator
 * is seeded the same for each measurement.
 *
 * Return value is seconds per iteration.
 */
#ifndef REPEAT
#if defined(mips) || defined(sparc)
#define REPEAT (1L<<20)
#else
#define REPEAT (1L<<18)
#endif
#endif

double
measure(func)
	register int	(*func)();
{
#ifdef USE_GETRUSAGE
	struct rusage ru0, ru1;
	register long j;

	srand(1);
	(void) getrusage(RUSAGE_SELF, &ru0);
	for (j = 0; j < REPEAT; ++j)
		func(bigrand());
	(void) getrusage(RUSAGE_SELF, &ru1);
	ru1.ru_utime.tv_sec -= ru0.ru_utime.tv_sec;
	if ((ru1.ru_utime.tv_usec -= ru0.ru_utime.tv_usec) < 0) {
		ru1.ru_utime.tv_usec += 1000000;
		ru1.ru_utime.tv_sec--;
	}
	return ((ru1.ru_utime.tv_sec + (ru1.ru_utime.tv_usec / 1000000.0)) /
		(double)REPEAT);
#else
	register long	j;
	struct tms	start;
	struct tms	finish;

	srand(1);
	times(&start);
	for (j = 0; j < REPEAT; ++j)
		func(bigrand());
	times(&finish);
	return ((finish.tms_utime - start.tms_utime) / (double)REPEAT);
#endif
}

struct table {
	char	*name;
	int	(*func)();
	double	measurement;
	int	trial;
};

struct table	table[] =
{
	{ "hackmemmod",		t0_hackmemmod },
	{ "hackmemloop", 	t1_hackmemloop },
	{ "hackmemunrolled",	t2_hackmemunrolled },
	{ "rmlsbsub", 		t3_rmlsbsub },
	{ "rmlsbmask",	 	t4_rmlsbmask },
	{ "testlsb",		t5_testlsb },
	{ "testmsb", 		t6_testmsb },
	{ "testsignandshift", 	t7_testsignandshift },
	{ "testeachbit", 	t8_testeachbit },
	{ "testeachbit1shl", 	t9_testeachbit1shl },
	{ "tableshift", 	tA_tableshift },
	{ "tableuchar", 	tB_tableuchar },
	{ "tableshiftcast",	tC_tableshiftcast },
	{ "itableshift", 	tD_itableshift },
	{ "itableuchar", 	tE_itableuchar },
	{ "itableshiftcast",	tF_itableshiftcast },
	{ "sumbits",		tG_sumbits },
};

main(argc, argv)
	int argc;
	char **argv;
{
	double calibration, v, best;
	int j, k, m, verbose;

	verbose = argc > 1;
	/*
	 * run a few tests to make sure they all agree
	 */
	srand(getpid());
	for (j = 0; j < 10; ++j) {
		unsigned long n;
		int bad;

		n = bigrand();
		for (k = 0; k < SIZEOF(table); ++k)
			table[k].trial = table[k].func(n);
		bad = 0;
		for (k = 0; k < SIZEOF(table); ++k)
			for (m = 0; m < SIZEOF(table); ++m)
				if (table[k].trial != table[m].trial)
					bad = 1;
		if (bad) {
			printf("wrong: %08lX", n);
			for (k = 0; k < SIZEOF(table); ++k)
				printf(" %3d", table[k].trial);
			printf("\n");
		}
	}

	/*
	 * calibrate the timing mechanism
	 */
	calibration = measure(calibrate);
	if (verbose)
		printf("calibration: %g\n", calibration);

	/*
	 * time them all, keeping track of the best (smallest)
	 */
	for (j = 0; j < SIZEOF(table); ++j) {
		v = measure(table[j].func) - calibration;
		if (verbose)
			printf("%s: %g\n", table[j].name, v);
		table[j].measurement = v;
		if (!j || v < best)
			best = v;
	}
	(void) printf("%-24s   %-14sratio\n", "function", "time");
	for (j = 0; j < SIZEOF(table); ++j) {
		(void) printf("%-24s %#10.8g  %6.3f\n",
		    table[j].name,
		    table[j].measurement,
		    table[j].measurement / best);
	}
	exit(0);
}
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

peter@ficc.ferranti.com (peter da silva) (05/25/91)

In article <1991May24.175323.9730@netcom.COM>, craig@netcom.COM (Craig Hansen) writes:
> The C code below uses the well-known doubling technique
> to count ones in a bit string of arbitrary length.

Yow. How about combining it with Duff's Device to *really* make it confusing,
and allow the bit string to start on an arbitrary byte boundary.
-- 
Peter da Silva; Ferranti International Controls Corporation; +1 713 274 5180;
Sugar Land, TX  77487-5012;         `-_-' "Have you hugged your wolf, today?"

torek@elf.ee.lbl.gov (Chris Torek) (05/25/91)

In article <13555@dog.ee.lbl.gov> I wrote:
>... and I put together this program from other people's methods,
>plus a few variants of my own.

I had forgotten, but have now remembered, that I based it on someone
else's program.  Unfortunately, I have forgotten who the Someone Else
was, so one of the Teeming Millions will have to supply the Credit
Where Credit Is Due Department with the answer to that one.

I also have some more bitcount functions from Arthur Olson and others;
one of these (ab)uses the C preprocessor to build a 65536-entry count
table at compile time, so that 32-bit counts can be computed as

	tab[n & 0xffff] + tab[(unsigned long)n >> 16]

which is often faster than any of the methods in my previous posting.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

root@apollo.HP.COM ( root) (05/25/91)

for generating hi-bit counts... maybe someone can provide a reference?
From: gupta_d@apollo.HP.COM (Dipankar Gupta)
Path: apollo!gupta_d

Sorry for the vagueness of the post...


Dipankar


Dipankar Gupta
Hewlett-Packard
Bangalore, INDIA			messiah@india.hp.com

gideony@microsoft.UUCP (Gideon YUVAL) (05/27/91)

In article <49278@ut-emx.uucp> nather@ut-emx.uucp (Ed Nather) writes:
>For people who use computers for real-time data acquisition and control,
>and there are a lot of them, floating point operations are very seldom
>used, if at all.  ...

Obvious question: is FP avoided because it is the wrong thing, or is it
avoided due to its cost (a 387 is a few hundred $) and slowness
(memory-to-memory floating-add can be 100 cycles)?

If anyone has data on this, I would be grateful
-- 
Gideon Yuval, gideony@microsof.UUCP, 206-882-8080 (fax:206-883-8101;TWX:160520)

herrickd@iccgcc.decnet.ab.com (05/30/91)

In article <12526@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
[Herman supplies some of the specifics people have been asking for]

> There are provisions for octal and hex
> integers [in c],

I thought so too, and one day I tried to write an integer constant
in octal.  The compiler said, "Nuts to you!"  It took some hours,
but I finally convinced myself that the compiler manual and then
Kernigan and Ritchie provide octal notation for CHARACTERS.  Nothing
else!  I thought it was a major design flaw and was no accident.

dan herrick
herrickd@iccgcc.decnet.ab.com

herrickd@iccgcc.decnet.ab.com (05/30/91)

Aren't you afraid you will spoil the argument by bringing some
facts into it?

In article <1991May21.191034.25980@murdoch.acc.Virginia.EDU>, clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes:
[most of his wonderful data omitted - see the original article]
> 
> Given the C source code statement:
> 
> 	z = x % y;   /* z gets the remainder of x divided by y */
> 
> 
> 	movl	r6,r1		/* Transfer quotient to r1 */
> 	clrl	r0		/* Zero out upper word to form 64-bit r0/r1
> 				   register pair quotient */
> 	ediv	r7,r0,r2,r11	/* Divide r0-r1 pair by r7; throw away quotient
> 				   into r2 and keep remainder in r11 */
> 
This looks to me like it does only unsigned div/rem.  Is that a feature
of the ediv instruction or do we need a way to sign extend r1 into r0
for the signed division case?

dan herrick
herrickd@iccgcc.decnet.ab.com

hrubin@pop.stat.purdue.edu (Herman Rubin) (05/30/91)

In article <4712.2843b438@iccgcc.decnet.ab.com>, herrickd@iccgcc.decnet.ab.com writes:
> Aren't you afraid you will spoil the argument by bringing some
> facts into it?
> 
> In article <1991May21.191034.25980@murdoch.acc.Virginia.EDU>, clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes:
> [most of his wonderful data omitted - see the original article]
> 
> > Given the C source code statement:
 
> > 	z = x % y;   /* z gets the remainder of x divided by y */
 
 
> > 	movl	r6,r1		/* Transfer quotient to r1 */
> > 	clrl	r0		/* Zero out upper word to form 64-bit r0/r1
> > 				   register pair quotient */
> > 	ediv	r7,r0,r2,r11	/* Divide r0-r1 pair by r7; throw away quotient
> > 				   into r2 and keep remainder in r11 */
 
> This looks to me like it does only unsigned div/rem.  Is that a feature
> of the ediv instruction or do we need a way to sign extend r1 into r0
> for the signed division case?

This points out another reason for cheap instructions at little cost.  
There are a rather large number of cases in which unsigned arithmetic
is needed.  In fact, the EMUL description in the VAX manual even
indicates that, for multiple precision arithmetic, this is quite
definitely the case.

Now how much additional silicon would it take to provide both signed
and unsigned?  The VAX does have instructions to sign extend 8 and
16 bit units to 16 and 32; why not to extend 32 to 64?  These are
cheap instructions.  Of course, if it is fast enough, a shift of -32
would sign-extend a 32 to a 64.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

dab@ubitrex.mb.ca (Danny Boulet) (05/30/91)

In article <4711.2843a523@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes:
>In article <12526@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>[Herman supplies some of the specifics people have been asking for]
>
>> There are provisions for octal and hex
>> integers [in c],
>
>I thought so too, and one day I tried to write an integer constant
>in octal.  The compiler said, "Nuts to you!"  It took some hours,
>but I finally convinced myself that the compiler manual and then
>Kernigan and Ritchie provide octal notation for CHARACTERS.  Nothing
>else!  I thought it was a major design flaw and was no accident.
>
>dan herrick
>herrickd@iccgcc.decnet.ab.com

Octal constants are written with the first digit zero.  The following
program

    main()
    {
	printf("test number is %d\n",01234567);
	printf("reverse is 0%o\n",342391);
	exit(0);
    }

produced the following output using cc on my Sun Sparc IPC running
SunOS 4.1.1 and should produce identical output on any 32-bit int
implementation of C:

    test number is 342391
    reverse is 01234567

richard@aiai.ed.ac.uk (Richard Tobin) (05/30/91)

In article <4711.2843a523@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes:
>> There are provisions for octal and hex
>> integers [in c],

>I thought so too, and one day I tried to write an integer constant
>in octal.  The compiler said, "Nuts to you!"  It took some hours,
>but I finally convinced myself that the compiler manual and then
>Kernigan and Ritchie provide octal notation for CHARACTERS.  Nothing
>else!  I thought it was a major design flaw and was no accident.

But you were mistaken.  Octal constants are written by putting a
leading zero on the number (a horrid convention).  Check out "octal
constant" in the index of K&R (either edition).

   Example:    printf("%d\n", 01234);

-- Richard

-- 
Richard Tobin,                       JANET: R.Tobin@uk.ac.ed             
AI Applications Institute,           ARPA:  R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk
Edinburgh University.                UUCP:  ...!ukc!ed.ac.uk!R.Tobin

ted@nmsu.edu (Ted Dunning) (05/31/91)

In article <4711.2843a523@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes:

   > There are provisions for octal and hex
   > integers [in c],

   I thought so too, and one day I tried to write an integer constant
   in octal.  The compiler said, "Nuts to you!"  It took some hours,
   but I finally convinced myself that the compiler manual and then
   Kernigan and Ritchie provide octal notation for CHARACTERS.  Nothing
   else!  I thought it was a major design flaw and was no accident.

kythera% cat x.c
#include <stdio.h>

main()
{
    printf("%d\n", 010);
}
kythera% x
8
kythera%

herrickd@iccgcc.decnet.ab.com (05/31/91)

In article <4711.2843a523@iccgcc.decnet.ab.com>, herrickd@iccgcc.decnet.ab.com writes:
> In article <12526@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> [Herman supplies some of the specifics people have been asking for]
> 
>> There are provisions for octal and hex
>> integers [in c],

[Never being slow with the foot around an open mouth, I said]
> 
> I thought so too, and one day I tried to write an integer constant
> in octal.  The compiler said, "Nuts to you!"  It took some hours,
> but I finally convinced myself that the compiler manual and then
> Kernigan and Ritchie provide octal notation for CHARACTERS.  Nothing
> else!  I thought it was a major design flaw and was no accident.
> 
> dan herrick
> herrickd@iccgcc.decnet.ab.com

Several people have quietly pointed out that c recognizes constants
with a leading zero as octal numerals.  Even one from Edinburgh
Castle (according to his domain name)!

So now I know I was dumb that day several years ago - I honestly did
work hard trying to find it in the indexes and so on, but I was
fixated on the \377 notation, which probably blinded me to 0777.

Thank you all.

dan herrick

mac@gold.kpc.com (Mike McNamara) (05/31/91)

>  In article <4711.2843a523@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes:
>  >  In article <12526@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>  [Herman supplies some of the specifics people have been asking for]
>  
>  > There are provisions for octal and hex
>  > integers [in c],
>  
>  I thought so too, and one day I tried to write an integer constant
>  in octal.  The compiler said, "Nuts to you!"  It took some hours,
>  but I finally convinced myself that the compiler manual and then
>  Kernigan and Ritchie provide octal notation for CHARACTERS.  Nothing
>  else!  I thought it was a major design flaw and was no accident.
>  
>  dan herrick
>  herrickd@iccgcc.decnet.ab.com

	What vendors compiler are you using?

consider
main(){

    int a = 0xabcdef01;
    int b = 0123567;
        
    printf("A is %#x (%d)\n",a,a);
    printf("B is %#o (%d)\n",b,b);
}

Produces for me:
A is 0xabcdef01 (-1412567295)
B is 0123567 (42871)
	
--
+-----------+-----------------------------------------------------------------+
|mac@kpc.com| Increasing Software complexity lets us sell Mainframes as       |
|           | personal computers. Carry on, X windows/Postscript/emacs/CASE!! |
+-----------+-----------------------------------------------------------------+

abaum (Allen Baum) (05/31/91)

In article <12947@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

>This points out another reason for cheap instructions at little cost.  
>There are a rather large number of cases in which unsigned arithmetic
>is needed.  In fact, the EMUL description in the VAX manual even
>indicates that, for multiple precision arithmetic, this is quite
>definitely the case.
>
>Now how much additional silicon would it take to provide both signed
>and unsigned?  

I couldn't let this go by.
Herman, I accept the fact that you know far more about numerical
problems and what you need to speed them up. (Note that this is
damning with faint praise. Its easy to know more than I do)

However, you are not a hardware designer. Do not presume that because
unsigned & signed multiply are both multiplies, that including both
is cheap. It may not be the case. Boothe style mults. are implicitly
signed, for example. To pretend they aren't mean multiplying two
33 bit numbers.

This is just one example. There are others from your postings.

hrubin@pop.stat.purdue.edu (Herman Rubin) (06/01/91)

In article <182@armltd.uucp>, abaum (Allen Baum) writes:
> In article <12947@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

			.....................

> I couldn't let this go by.
> Herman, I accept the fact that you know far more about numerical
> problems and what you need to speed them up. (Note that this is
> damning with faint praise. Its easy to know more than I do)

> However, you are not a hardware designer. Do not presume that because
> unsigned & signed multiply are both multiplies, that including both
> is cheap. It may not be the case. Boothe style mults. are implicitly
> signed, for example. To pretend they aren't mean multiplying two
> 33 bit numbers.
> 
> This is just one example. There are others from your postings.

That is one way of doing it.  Another is to conditionally put in
adds depending on sign.  This is easily done in hardware.

The point is that to make the greater flexibility cheap, it MUST be
done in the design.

For addition, I agree that 2's complement arithmetic is easier, but
not for anything else.  For multiprecision work, I have not seen any
non-redundant procedure which makes any sense other than sign-magnitude.
It is true that by using some trickery, which to be efficient would have
to be hardware, that giving up one bit/word would enable the procedures
without carry propagation to be used, but the hardware for that would be
more complicated than allowing the choice.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

herrickd@iccgcc.decnet.ab.com (06/04/91)

In article <182@armltd.uucp>, abaum (Allen Baum) writes:
> In article <12947@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> 
[much omitted throughout the quotations]
>>This points out another reason for cheap instructions at little cost.  
>>
>>Now how much additional silicon would it take to provide both signed
>>and unsigned?  
> 
> Do not presume that because
> unsigned & signed multiply are both multiplies, that including both
> is cheap. It may not be the case. Boothe style mults. are implicitly
> signed, for example. To pretend they aren't mean multiplying two
> 33 bit numbers.

What fantastic advantage does one derive from using "Boothe style mults."
that makes up for losing unsigned arithmetic.  This sounds to me like the
reason for not using them in a general purpose computer.

dan herrick
herrickd@iccgcc.decnet.ab.com

martin@adpplz.UUCP (Martin Golding) (06/05/91)

In <182@armltd.uucp> abaum (Allen Baum) writes:

>In article <12947@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

>>There are a rather large number of cases in which unsigned arithmetic
>>is needed.  
>>Now how much additional silicon would it take to provide both signed
>>and unsigned?  

>                                          Do not presume that because
>unsigned & signed multiply are both multiplies, that including both
>is cheap. It may not be the case. Boothe style mults. are implicitly
>signed, for example. To pretend they aren't mean multiplying two
>33 bit numbers.

No, for a 32x32=>32 (or nxn=>n) you need do nothing at all. For
32x32=>64 (nxn=>2n) you need the (logical) extended bits to be
copies of the sign bit (signed) or copies of 0 (unsigned).
Since accomodating the sign is what requires hardware, multiplexing
whatever is done with a force to zero should take only a few gates.
In either case, of course, you need some extra logic for the status
bits (overflow, particularly), if you keep status bits.

>However, you are not a hardware designer. 

Are you a hardware designer, and if so, and if you think I'm an idiot
(which is entirely possible), could you email me a description (brief
if you prefer) of a signed multiply algorithm that can't easily
handle unsigned values? I'm always eager to learn, but the rest of
these people don't need to be annoyed. You can still flame me on the
net, of course.

Martin Golding    | sync, sync, sync, sank ... sunk:
Dod #0236         |  He who steals my code steals trash.
A poor old decrepit Pick programmer. Sympathize at:
{mcspdx,pdxgate}!adpplz!martin or martin@adpplz.uucp