jbs@WATSON.IBM.COM (05/20/91)
Herman Rubin writes: Fixed point arithmetic is little used now because the hardware to support it reasonably well does not exist. It is worse than the floating problems before hardware floating arithmetic, especially if floating is automatically normalized. THAT feature of "modern" architectures is, in my opinion, a sheer horror. I think fixed point arithmetic is little used now because the vast majority of users find it harder to use than floating point with no comp- ensating advantagres. Does anyone seriously believe if a few instructions were added to provide hardware support (btw what is missing? In what way has fixed point support deteriorated?) fixed point usage would increase significantly? Regarding unnormalized floating point what is this good for besides simulating multiple precision integer arithmetic? Herman Rubin also writes: In the early FP computers, much function calculation was done in fixed point, to get increased accuracy at little cost. This only makes sense when the floating point fraction length is less than a full integer. With 64-bit floating point there is little need for increased accuracy in any case. Herman Rubin also writes: How do you expect users who do not even know of the existence of the operations to use them? I expect the compiler to generate the instructions for them. If the compiler won't generate an instruction this is a strong reason for not having it in the instruction set. Herman Rubin also writes: There are many more algorithms than are in the philosophy of software, and especially hardware, designers. The argument here seems to be that there are numerous algorithms which are not used at all today which suddenly would be used all over the place if only a few changes were made to the instruction set. I don't agree. If two algorithms have similar performance both will be in use as there will be some problems which are particually suited to one or the other. If an algorithm is not used at all on today's machines this means it is not competive even on those problems for which it is particually suited. Making changes to the instruction set is unlikely to drastical- ly change the relative performance of two algorithms, hence is unlikely to drastically change the amount of usage each gets. I believe it is perfectly sensible for hardware designers to concentrate on speeding up the existing mix of applications. James B. Shearer
mac@eleazar.dartmouth.edu (Alex Colvin) (05/20/91)
>Fixed point arithmetic is little used now because the hardware to support >it reasonably well does not exist. Would anyone who works with DSPs comment on this? They seem to be built for fixed point and fractional arithmetic, with the ALU connected to the registers through shifters.
hrubin@pop.stat.purdue.edu (Herman Rubin) (05/20/91)
Subject: Re: new instructions Newsgroups: comp.arch References: <9105200213.AA05095@ucbvax.Berkeley.EDU> In article <9105200213.AA05095@ucbvax.Berkeley.EDU>, James B. Shearer (jbs@WATSON.IBM.COM) writes: > Herman Rubin writes: > Fixed point arithmetic is little used now because the hardware to support > it reasonably well does not exist. It is worse than the floating problems > before hardware floating arithmetic, especially if floating is automatically > normalized. THAT feature of "modern" architectures is, in my opinion, a sheer > horror. > > I think fixed point arithmetic is little used now because the vast > majority of users find it harder to use than floating point with no comp- > ensating advantagres. Does anyone seriously believe if a few instructions > were added to provide hardware support (btw what is missing? In what way > has fixed point support deteriorated?) fixed point usage would increase > significantly? Regarding unnormalized floating point what is this good > for besides simulating multiple precision integer arithmetic? The last question first: How about multiple precision floating (or fixed) arithmetic? Considering that there are quite a few papers on this, it is certainly a topic of interest. I do not believe it should be necessary here to go into the full range of situations I can list NOW where this would be useful. Now what is the benefit of allowing only normalized floating point? It eliminates the need for a normalization option in floating instructions, and it provides ONE more bit of accuracy. Is that ONE bit exactly what is needed? This is very unlikely. Now what is the cost of not having forced normalization, besides the one bit? There would have to be a method for indicating which result of the operation is wanted (upper, lower, normalized). There would be little additional hardware other than the decoding, by the floating unit, of this information. There are algorithms which benefit from using both Boolean and arithmetic operations on either fixed point or floats. These are not even readily available on many of the newer machines. > Herman Rubin also writes: > In the early FP computers, much function calculation was done in fixed point, > to get increased accuracy at little cost. > > This only makes sense when the floating point fraction length is > less than a full integer. With 64-bit floating point there is little need > for increased accuracy in any case. So why not have increased integer accuracy? It is no harder to do this, and the same units can be used. To someone with a mathematical outlook, the distinction is not integer/float but short integer/good arithmetic. There are algorithms which call, at some stage at least, for fixed point arithmetic. Infinite precision (no mistake here) methods of generating non-uniform random numbers tend to be of this type. Converting the fixed point results to floating can be a major problem, as 0 is a possible value, again a real problem only with automatic normalization. > Herman Rubin also writes: > How do you expect users who do not even know of the existence of the operations > to use them? > > I expect the compiler to generate the instructions for them. If > the compiler won't generate an instruction this is a strong reason for > not having it in the instruction set. The chicken and the egg again. Anyone who is willing to say that something is not useful is either ignorant, arrogant, or stupid. Nobody can, or should, attempt to ever do this. Even the best people can make big mistakes. This assumes the language designers and hardware designers perceive the need for the instructions. This is clearly not the case. An example is the attempt to add integer multiplication to the CDC 6x00 series. The fact that floating multiplication automatically normalized the product when both factors were normalized made the operation far less useful than intended. There was too much already in the hardware to correct this. How many current languages have user-definable operations? That the designers of C did not think of multiple precision arithmetic, or quotient and remainder, or exponentiation as an operation and not a function, etc., does not mean that these should have been omitted. There are provisions for octal and hex integers, but none for explicitly writing floats or fixed-point numbers to those bases. Even at my age, I am quite capable of operating entirely binary even if there is not a good reason for it, and often there is. > Herman Rubin also writes: > There are many more algorithms than are in the philosophy of software, and > especially hardware, designers. > The argument here seems to be that there are numerous algorithms > which are not used at all today which suddenly would be used all over the > place if only a few changes were made to the instruction set. I don't > agree. If two algorithms have similar performance both will be in use as > there will be some problems which are particually suited to one or the > other. If an algorithm is not used at all on today's machines this means > it is not competive even on those problems for which it is particually > suited. Making changes to the instruction set is unlikely to drastical- > ly change the relative performance of two algorithms, hence is unlikely > to drastically change the amount of usage each gets. I believe it is > perfectly sensible for hardware designers to concentrate on speeding up > the existing mix of applications. Like the infamous frexp function in C? For those not familiar with it, frexp(x,&n) took a floating number x and transformed it to y*2^k, where .5 <= |y| < 1, and the value of k is stored as n. Instead of completely inlining this operation, which would have made the whole thing simple and suggested the obvious alternate form y,n = frexp(x) which also assigns the values to registers or memory as wanted, and avoids a clumsy function call, they did the other. The 4.2BSD library even used a machine independent algorithm, which needless to say was frequently slower by a factor of more than 100. It even went into an infinite loop in 0. Of course the work has to be done in machine-dependent code, and in the integer registers, if they are separate. Except for ALGOL and possibly COMMON LISP, I know of no language designers who even made a real attempt to put in the variety needed. Neither group really succeeded. On vector and parallel processors, things which are of essentially no importance on scalar machines suddenly become important. On modern machines, even local transfers can be costly, and context switches more so. On parallel machines, how does one handle conditional transfers and conditional calls? They are deadly, and if one has 2^14 units (already in existence), a one-in-a-million condition becomes one-in-62. If the condition calls for major work, say 100,000 cycles, and I have examples of this, this can occupy most of the time. There are single operations, which deviate from the parallel idea, but are similar to those already in use there, which can alleviate the mess. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
nather@ut-emx.uucp (Ed Nather) (05/21/91)
In article <1991May20.121708.5023@dartvax.dartmouth.edu>, mac@eleazar.dartmouth.edu (Alex Colvin) writes: > >Fixed point arithmetic is little used now because the hardware to support > >it reasonably well does not exist. > > Would anyone who works with DSPs comment on this? They seem to be built for > fixed point and fractional arithmetic, with the ALU connected to the registers > through shifters. For people who use computers for real-time data acquisition and control, and there are a lot of them, floating point operations are very seldom used, if at all. Floating point is valuable for number-crunching jobs, but running stepping motors or adjudicating multiple interrupts doesn't need it, and really can't use it. People use computers for many different jobs, and it's not surprising that different jobs need different architectures. What's sort of amazing is how well our "general purpose" computer architectures do so many different things quite well indeed. -- Ed Nather Astronomy Dept, U of Texas @ Austin
dik@cwi.nl (Dik T. Winter) (05/21/91)
In article <12526@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: In reply to: > In article <9105200213.AA05095@ucbvax.Berkeley.EDU>, James B. Shearer > (jbs@WATSON.IBM.COM) writes: Who again replies to: > > Herman Rubin writes: etc. I hope I get everything right. JBS: > > Regarding unnormalized floating point what is this good > > for besides simulating multiple precision integer arithmetic? I do not find it even good for that either. > > The last question first: How about multiple precision floating (or fixed) > arithmetic? Considering that there are quite a few papers on this, it is > certainly a topic of interest. I do not believe it should be necessary here > to go into the full range of situations I can list NOW where this would be > useful. Considering that the oldest paper I know about multiple precision FP is from T.J.Dekker, and as it only considers normalizing FP (eh. standardizing as on the Electrologica X8, but he does not use it), I find this answer surprizing. > > Now what is the benefit of allowing only normalized floating point? It > eliminates the need for a normalization option in floating instructions, > and it provides ONE more bit of accuracy. Is that ONE bit exactly what > is needed? This is very unlikely. Nope. That bit is not exactly what is needed, but it came in as a surplus bit, so why not use it? For multi-precision FP arithmetic unnormalized arithmetic is not needed, as for multi-precision integer arithmetic. > > Now what is the cost of not having forced normalization, besides the one > bit? There would have to be a method for indicating which result of the > operation is wanted (upper, lower, normalized). There would be little > additional hardware other than the decoding, by the floating unit, of > this information. This is the 205 way, but the 205 implemented it awful (and the reason is just that: cost). On the 205 you could specify for an add that one of the three results was returned. But the normalizing add was just that: it first took the upper 47 bits of the result, normalizing afterwards. This gives a loss in precision. > > There are algorithms which benefit from using both Boolean and arithmetic > operations on either fixed point or floats. These are not even readily > available on many of the newer machines. Just the same as on other machines. If current machines do not implement a specific round float to integer instruction, one can add a simple constant to a float to get the integer part in a specific portion of the register. Again I ask, how much is it used? You say table look up for some elementary functions. My reply is, maybe on scalar machines, but if you do that on vector machines, expect a big loss in performance. So table look up is not in general the most efficient way. (Oh, btw, the same may hold for scalar machines because of cache misses etc.) > > > Herman Rubin also writes: > > In the early FP computers, much function calculation was done in fixed point, > > to get increased accuracy at little cost. Eh? I always thought it was because of speed (FP was slow, fixed point was fast). At least on the Electrologica X8 (1964 vintage) no calculation in those functions was done in fixed point. And yes, fixed point was much faster than floating point. Some timings (in milliseconds): fixed FP add 5.0 8.75-18.75 mul 8.75-40.0 12.5-68.75 (for some reasons I have a manual here at home!) As you see, especially fixed point add (and sub) were much faster. > > > So why not have increased integer accuracy? It is no harder to do this, and > the same units can be used. To someone with a mathematical outlook, the > distinction is not integer/float but short integer/good arithmetic. But you are not advocating 'good' arithmetic. At least from the viewpoint of many people. And why do you think that 64 bit integer arithmetic is not much harder? If we consider the time it took for micros to go from 8 to 16 and again to 32 bit integer arithmetic.... > > Converting the fixed > point results to floating can be a major problem, as 0 is a possible value, > again a real problem only with automatic normalization. I do not understand thgis at all. If the hardware does not provide instructions to convert fixed point results to floating points, only 2 instructions are required to do it. That 0 is a possible value is no problem at all. But I think you mean 'converting fixed point results to floating by a single boolean operation', i.e. or'ing in the exponent. But in that case negative numbers are also a problem. > > This is clearly not the case. An example is the > attempt to add integer multiplication to the CDC 6x00 series. The fact > that floating multiplication automatically normalized the product when > both factors were normalized made the operation far less useful than > intended. There was too much already in the hardware to correct this. This is taking it backwards. 1. Integer multiplication was not added later, but designed in from the start. 2. The fact that multiplication automatically normalizes has nothing to do with the problems. 3. The fact that mutliplication assumes that apparently normalized fp numbers are indeed normalized fp numbers has all to do with it. But on the CDC 6x00 series you could get a 94 bit integer product from two 47 bit integer factors; although it was undocumented. On the other hand, we see with the Cray (were normalization is done too late) that we can lose a lot of precision in the fp multiplication. > There are provisions for octal and hex > integers, but none for explicitly writing floats or fixed-point numbers to > those bases. Ada provides all this. However, upto now I have not seen many use of the possibility to write fp numbers in any than decimal base. To wit: 8#0.12345#e-4 means 0.12345 in octal times 8^-4. > Like the infamous frexp function in C? For those not familiar with it, > frexp(x,&n) took a floating number x and transformed it to y*2^k, where > .5 <= |y| < 1, and the value of k is stored as n. Not *took* Herman, it still *takes*. > Instead of completely > inlining this operation, which would have made the whole thing simple and > suggested the obvious alternate form > y,n = frexp(x) (Yup, *you* design the language where that is possible. Start looking at Mesa.) > which also assigns the values to registers or memory as wanted, and avoids > a clumsy function call, they did the other. The fact that it is formally declared as a function does not inhibit inlining, nor the assignement of values to registers. > The 4.2BSD library even used > a machine independent algorithm, which needless to say was frequently slower > by a factor of more than 100. It even went into an infinite loop in 0. Yes, yell at the language designers if the implementers goof it. > Of course the work has to be done in machine-dependent code, and in the > integer registers, if they are separate. Not at all. If the hardware completely implements IEEE, I want it in the FP registers! I get tired. Perhaps more tomorrow. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
bengtl@maths.lth.se (Bengt Larsson) (05/21/91)
In article <12526@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >The last question first: How about multiple precision floating (or fixed) >arithmetic? Considering that there are quite a few papers on this, it is >certainly a topic of interest. I do not believe it should be necessary here >to go into the full range of situations I can list NOW where this would be >useful. Please indulge. Where would this be useful? >So why not have increased integer accuracy? Yes, why not? But the important question is *why*? >There are algorithms which call, at some stage at least, for fixed point >arithmetic. Infinite precision (no mistake here) methods of generating >non-uniform random numbers tend to be of this type. Interesting progression. From "there are algorithms", to "generating non-uniform random numbers". That is 1 algorithm. >The chicken and the egg again. Anyone who is willing to say that something >is not useful is either ignorant, arrogant, or stupid. Nobody can, or should, >attempt to ever do this. You seem to have adopted a favorite metaphor. It doesn't apply here though. You can implement just about any algorithm you want to. If not, give counterexamples. As for ignorant, arrogant, or stupid, this may occasionally apply to others than the computer architects in the discussion. >Even the best people can make big mistakes. Well, name a favorite big mistake of your own doing, Mr Rubin :-) >How many current languages have user-definable operations? Practically all of them, assuming a function syntax. >There are provisions for octal and hex >integers, but none for explicitly writing floats or fixed-point numbers to >those bases. Ada. >Like the infamous frexp function in C? >The 4.2BSD library even used >a machine independent algorithm, which needless to say was frequently slower >by a factor of more than 100. Then the 4.2 BSD library could be rewritten, without changing the machine. >Of course the work has to be done in machine-dependent code, and in the >integer registers, if they are separate. Naturally. >On parallel machines, how does one handle conditional transfers and >conditional calls? They are deadly, and if one has 2^14 units (already >in existence), a one-in-a-million condition becomes one-in-62. If the >condition calls for major work, say 100,000 cycles, and I have examples >of this, this can occupy most of the time. There are single operations, >which deviate from the parallel idea, but are similar to those already >in use there, which can alleviate the mess. The programming of 2^14 parallelly working units using separate instruction streams is not a solved problem. I assume that a clarification of the operations which can alleviate the mess would be in order. Bengt Larsson. -- Bengt Larsson - Dep. of Math. Statistics, Lund University, Sweden Internet: bengtl@maths.lth.se SUNET: TYCHE::BENGT_L
clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) (05/22/91)
In article <9105200213.AA05095@ucbvax.Berkeley.EDU> jbs@WATSON.IBM.COM writes: > > > Herman Rubin also writes: >How do you expect users who do not even know of the existence of the operations >to use them? > > I expect the compiler to generate the instructions for them. If >the compiler won't generate an instruction this is a strong reason for >not having it in the instruction set. As a compiler writer in this RISCy age, I am inclined to agree with this statement. If we compiler writers, in our infinite brilliance, cannot figure out how to use an instruction, it doesn't belong in the instruction set. Before this becomes a mantra that is chanted mindlessly, let's look at an example that might make us think a little harder. On the VAX architecture, there are two common ways for a compiler to generate code for the integer remainder operation. Let's say that register r6 contains the value of variable "x", register r7 contains the value of variable "y", and we want register r11, currently unused, to contain the value of variable "z" (as determined by a register allocator.) All three variables are 32-bit integers (VAX "longword" types.) Registers r6 through r11 are the general purpose arithmetic registers; we can use others, such as r0 and r1, as scratch registers that are not live across a function call. We will see this scratch register usage in both examples below. Given the C source code statement: z = x % y; /* z gets the remainder of x divided by y */ We will see the "cc" compiler generate the following code: divl3 r7,r6,r0 /* divide r6/r7, put quotient in r0 */ mull2 r7,r0 /* multiply r7*r0, put product in r0 */ subl3 r0,r6,r11 /* subtract x - (x DIV y)*y; put in r11 */ So, the remainder is left in r11, the location for further uses of "z". Thus, we computed "z = x - ((x/y)*y);" where the division is integer division. The main alternative to the above is to use the nonorthogonal VAX instruction called "ediv". This takes a 64-bit (VAX "quadword") dividend and 32-bit divisor and divides them, producing a 32-bit quotient and a 32-bit remainder. Very CISCy 4-operand instruction; no equivalent exists for 32-bit dividend. We need to create a 64-bit dividend to use it; unfortunately, we tend to have our 32-bit quotient in a register surrounded by live registers. So we generate the 3-instruction sequence: movl r6,r1 /* Transfer quotient to r1 */ clrl r0 /* Zero out upper word to form 64-bit r0/r1 register pair quotient */ ediv r7,r0,r2,r11 /* Divide r0-r1 pair by r7; throw away quotient into r2 and keep remainder in r11 */ Intuitively, it seems like a lot of work to use the extended division when you really don't need 64-bit division. But the first two operations are very simple register operations (and optimization might eliminate the second instruction, as r0 might already be cleared.) I was told that the "cc" compiler did things the first way because "ediv" is such a slow instruction. But I got curious and timed the alternatives myself: [from an old description of the experiment] : I then created a shell script that executes the programs 20 times each, for a total of 20 times 30,000 == 600,000 runs through the loop. I ran the script three times to judge the variance. Results: ................................................... So, looking at the time added by the computation with which we are concerned, we get about a 24% speedup when we only want the remainder, and a 27% speedup when we want the quotient and the remainder. [end excerpt] (I was referring to running the experiment two ways: producing the quotient AND remainder by two methods, and producing just the remainder by the two methods. The advantage of ediv is greater when use it to get two results in one instruction, as the first method needs to copy the quotient result somewhere before destroying it.) So, a 24% speedup over what the "cc" compiler produces on a VAX 8600 running BSD Unix 4.2. When I heard that hardware changes in the VAX 8600 might have obsoleted an instruction selection made by "cc" for the VAX 11/780, I did the test again on a VAX 11/750 running Ultrix. More than 20% improvement for ediv there, too. MORAL: Just because the compiler doesn't use it doesn't mean that the compiler SHOULDN'T have used it. Use many compilers for many languages and applications to measure instruction usage, and second-guess the compiler to ensure good conclusions. Question for RISC architects: When evaluating a large set of customer programs to find instruction frequencies, how do you avoid biasing results because of inefficient compilers? Even if you ran all C programs through "cc" and "gcc", and let's say "gcc" uses ediv for remaindering, the count for ediv is lower than it ought to be. "cc" will only use it for true 64/32 bit division. Perhaps we should be counting intermediate code statement frequencies to decide what higher level operations are important, and then deciding what lower level instructions are needed? ----------------------------------------------------------------------------- "The use of COBOL cripples the mind; its teaching should, therefore, be regarded as a criminal offence." E.W.Dijkstra, 18th June 1975. ||| clc5q@virginia.edu (Clark L. Coleman)
rh@craycos.com (Robert Herndon) (05/22/91)
In article <1991May21.191034.25980@murdoch.acc.Virginia.EDU>, Clark Coleman writes: > As a compiler writer in this RISCy age, I am inclined to agree with this > statement. If we compiler writers, in our infinite brilliance, cannot > figure out how to use an instruction, it doesn't belong in the instruction > set. Before this becomes a mantra that is chanted mindlessly, let's look > at an example that might make us think a little harder. While I generally agree with his sentiment and his example, I am inclined to believe in at least one major exception. Not to mindlessly encourage H. L. Rubin, but population count and leading zero count on a word value are quite useful, yet (though I'm not a compiler writer) I can't think of easy methods for a (non-vector) compiler to use these instructions often. Yet I think they are undeniably useful. Many simple operations can be built easily and quickly in straight-line code with these instructions, and simple things like dispatching on a bit mask and converting octal mask bit constants to shift counts, etc. are much simplified with them. Examples of functions easily built using population_count(x) and leading_zeroes(x) include signum(x), abs(x), !a, a == b, a < b, a > b, ... trailing_zeroes(x), ceil(log2(x)), and more. At the same time, these operations are not that common in most codes, so I wouldn't expect large performance gains, except in programs that make heavy use of either assembly codes, compiler intrinsics, or library routines written in assembler or with intrinsics. Perhaps I am mistaken here, and you compiler writers can use these instructions to significant advantage. If so, I'd be curious to know how much these instructions typically save. Also, in this one case, I can sympathise with Mr. Rubin, because there isn't any elegant and portable way to express, for instance, leading_zeroes(x) in a language other than (say, in C) to make it a function or #define and define it according to the capabilities of the hardware/compiler at hand. Not terribly difficult, but it does seem to lack elegance, and not all compilers for machines that have these functions have such intrinsics or deal with them well. Still, IMHO that is something to take up with the compiler writers, rather than the language designers. I will note however, that I am very happy to see that many micros are beginning to include these instructions... Robert Herndon -- Robert Herndon -- not speaking officially for Cray Computer. Cray Computer Corporation 719/540-4240 1110 Bayfield Dr. rh@craycos.com Colorado Springs, CO 80906 Ignore these three words.
jlg@cochiti.lanl.gov (Jim Giles) (05/22/91)
In article <1991May23.084258.5062@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes: |> [...] |> You can do pop count quite quickly with table lookup. Yes, you have to |> break it up, but, on a byte-addressable machine, it's not a major hassle. |> You will need a 256-byte table, of course, but that, also, isn't a major |> hassle. CLZ I can also do somewhat quickly with table lookup on a |> byte-addressable machine. Now, let's see. A pop count instruction is 4 clocks on an X/MP. _ONE_ memory access is 14 (assuming no bank conflicts). So, to do a pop count on just one word (64 bits) using a 256 element table would take 8 memory references (start them together so they pipeline - still 16 clocks to issue all those references), plus the time taken to break the word into bytes, plus the time to add the results of the loads. I'd guess a minimum of 40 clocks total. Does this count as "quite quickly"? Leading zero count is a 3 clock instruction. I doubt that could be implemented in fewer clocks than the pop count if you did it with a table. Now, slower CPUs (for which the memory would seem _relatively_ faster) would take fewer clocks to load the table entries. If the table _happened_ to be in the cache (for machines that _have_ a cache) the memory fetches are sometimes only a couple of clocks. Even so, on such a slower machine, the hardware pop count and lead zero instructions could also be implemented as _relatively_ faster instructions - one clock each wouldn't be a surprise. It would amaze me to find any machine (on which the test could be done) where a table lookup came within an order of magnitude of a hardware instruction on these functions. J. Giles
jlg@cochiti.lanl.gov (Jim Giles) (05/23/91)
In article <1991May23.192557.7558@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes: |> [...] How long would |> |> char *byte = (char *)&word; |> pop_count = table[byte[0]] + table[byte[1]] + table[byte[2]] + |> table[byte[3]]; |> |> take on a machine with somewhat better memory accesses? Say, an R6000, or |> even a Sparc? First, this code only does a pop count on a 32 bit object, not 64. Second, I mentioned this case on my last posting: I would bet that an implementation of pop count as a hardware instruction on either of these machines (using the technology they were built with) would be _ONE_ clock long. The above sequence takes in excess of 10. |> And don't forget that, for serial code, the R6000 is faster than the Cray. |> So that doesn't quite count as a "slow machine," does it? What is the relevance of a comparison of the R6000 to the Cray in the context of this discussion? The issue is whether pop count can be performed as quickly in software as in hardware. This is an issue to be decided on the basis of each machine individually. The R6000 would be even _faster_ if pop count were a hardware instruction!! J. Giles
firth@sei.cmu.edu (Robert Firth) (05/23/91)
In article <1991May21.191034.25980@murdoch.acc.Virginia.EDU> clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes: >Given the C source code statement: > z = x % y; /* z gets the remainder of x divided by y */ >... we generate >the 3-instruction sequence: > > movl r6,r1 /* Transfer quotient to r1 */ > clrl r0 /* Zero out upper word to form 64-bit r0/r1 > register pair quotient */ > ediv r7,r0,r2,r11 /* Divide r0-r1 pair by r7; throw away quotient > into r2 and keep remainder in r11 */ I hope not. From the previous code fragment, it is clear you are expecting the remainder from SIGNED division. If you want the same answer as before, the code must be MOVL R6,R1 ; construct the sign-extended 64-bit ... ASHQ #-32,R0,R0 ; dividend in the register pair <R0,R1> EDIV ... as before You might like to time THAT sequence, and rethink your post. Or you could take my word for it, that when you include the cost of having to reserve and target into an even-odd register pair, the EDIV is almost always slower.
sef@kithrup.COM (Sean Eric Fagan) (05/23/91)
In article <1991May22.001620.751@craycos.com> rh@craycos.com (Robert Herndon) writes: >but population count and leading zero count >on a word value are quite useful, yet (though I'm not a compiler writer) >I can't think of easy methods for a (non-vector) compiler to use these >instructions often. You can do pop count quite quickly with table lookup. Yes, you have to break it up, but, on a byte-addressable machine, it's not a major hassle. You will need a 256-byte table, of course, but that, also, isn't a major hassle. CLZ I can also do somewhat quickly with table lookup on a byte-addressable machine. -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
peter@ficc.ferranti.com (peter da silva) (05/23/91)
In article <24216@lanl.gov>, jlg@cochiti.lanl.gov (Jim Giles) writes: > Leading zero count is a 3 clock instruction. I doubt that could be > implemented in fewer clocks than the pop count if you did it with a > table. I suspect it could. You would only have to do a table-lookup on the first non-zero byte. You could even use binary search techniques to reduce the number of compares with zero. But the question isn't whether you can do these operations in software anywhere near as fast as they can be done in hardware, but rather whether they can be done fast enough. That is, how much faster would applications that make heavy use of these instructions run if you sped them up by a factor of 10 (which is about what I would have guessed the speed difference to be)? If you speed up 3% of the program by a factor of 10, it's pretty much lost in the noise. -- Peter da Silva; Ferranti International Controls Corporation; +1 713 274 5180; Sugar Land, TX 77487-5012; `-_-' "Have you hugged your wolf, today?"
jlg@cochiti.lanl.gov (Jim Giles) (05/23/91)
In article <1991May23.220143.8515@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes: |> [...] |> It would? John, would you care to comment on that? It seems that MIPS |> engineers were derelict in their job, and didn't realize that adding |> hardware pop-count would speed up everything else the chip does. As they would have said on the old Perry Mason show: incompetent, irrelevant, and immaterial. I never said that pop count would speed the rest of the chip's functions - so your comment has _absolutely_ _no_relevance_ to the discussion at all. Pop count _would_ speed the machine by allowing the operation to be done in hardware instead of your slow table lookup. This would speed the very benchmarks you were bragging about in your previous post (the R6000 is faster than Cray on scalar - sometimes this is even true). J. Giles
jlg@cochiti.lanl.gov (Jim Giles) (05/23/91)
In article <QAIBQ2@xds13.ferranti.com>, peter@ficc.ferranti.com (peter da silva) writes: |> [...] |> But the question isn't whether you can do these operations in software |> anywhere near as fast as they can be done in hardware, but rather whether |> they can be done fast enough. That is, how much faster would applications |> that make heavy use of these instructions run if you sped them up by a |> factor of 10 (which is about what I would have guessed the speed difference |> to be)? If you speed up 3% of the program by a factor of 10, it's pretty |> much lost in the noise. Exactly so. The only thing I was saying is that the software mechanisms are _not_ done "quite quickly" compared to a hardware implementation of the same thing. Whether is is worthwhile to put an instruction into the hardware is dependent on two things: 1) the commonality of the operation, and 2) the amount by which the hardware is faster. Sean Fagan was claiming the the speedup was insignificant - it isn't. Now, if you want to claim the instruction is rarely needed, that's a completely different issue. It also depends on two things: a) what is the application domain, and b) what features _are_ present in the hardware (individual instructions cannot be selected in isolation from the rest of the instruction set). J. Giles
sef@kithrup.COM (Sean Eric Fagan) (05/24/91)
In article <24216@lanl.gov> jlg@cochiti.lanl.gov (Jim Giles) writes: >It would amaze me to find any machine (on which the test could be done) >where a table lookup came within an order of magnitude of a hardware >instruction on these functions. How about a Cyber? A Cyber, without the pop-count hardware, takes something like 60 cycles to do a popcount. And the Cray has lousy memory-access times, and isn't a byte-addressable machine. How long would char *byte = (char *)&word; pop_count = table[byte[0]] + table[byte[1]] + table[byte[2]] + table[byte[3]]; take on a machine with somewhat better memory accesses? Say, an R6000, or even a Sparc? And don't forget that, for serial code, the R6000 is faster than the Cray. So that doesn't quite count as a "slow machine," does it? So here is a way of doing pop-count, quite quickly (it's possible for a compiler to put the byte[x] into registers and not have to access memory, the first reference to table could put quite a bit of table into a cache, and if you have pipelined loads, it *does* go *very* quickly), that doesn't require any special instructions. And will work the same way, if not faster, on later versions of the processor. This is not true with instructions that don't get a lot of use: witness the 68040 and transcendental instructions. -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
jimb@shograf.com (Jim battle) (05/24/91)
One place I worked used to ask job applicants to write a program to count the number of 1's in an integer. Everyone pretty much did the same thing; see if the number is odd, tally, and shift. One person came up with an ingenious method that does not involve tables or conditionals. Here is his solution, given in pseudo assembly language; it assumes 32-bit integers, although it would work for any word size with little modification: Given: r1 is a 32-bit integer with an unknown number of 1's set. move #n,r1 /* consider r1 to have 32 1-bit sums */ move r1,r2 /* destinations are on the right */ shr #1,r2 and #55555555,r1 and #55555555,r2 add r2,r1 /* now r1 has 16 two-bit sums; b0+b1, b2+b3, etc */ move r1,r2 shr #2,r2 and #33333333,r1 and #33333333,r2 add r2,r1 /* now r1 has 8 four-bit sums; b0+b1+b2+b3, etc */ ... move r1,r2 shr #16,r2 and #0000ffff,r1 and #0000ffff,r2 add r2,r1 /* now r1 has the total of all 1 bits */ for a 64-bit number, just one more step is needed. maybe not useful, but I thought I'd mention it. As far as I know, this is an original approach; credit goes to Richard Poppen. I believe he works at Etak.
clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) (05/24/91)
In article <25874@as0c.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: >In article <1991May21.191034.25980@murdoch.acc.Virginia.EDU> clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes: > >>Given the C source code statement: > >> z = x % y; /* z gets the remainder of x divided by y */ > >>... we generate >>the 3-instruction sequence: >> >> movl r6,r1 /* Transfer quotient to r1 */ >> clrl r0 /* Zero out upper word to form 64-bit r0/r1 >> register pair quotient */ >> ediv r7,r0,r2,r11 /* Divide r0-r1 pair by r7; throw away quotient >> into r2 and keep remainder in r11 */ > >I hope not. From the previous code fragment, it is clear you are >expecting the remainder from SIGNED division. If you want the same >answer as before, the code must be > > MOVL R6,R1 ; construct the sign-extended 64-bit ... > ASHQ #-32,R0,R0 ; dividend in the register pair <R0,R1> > EDIV ... as before > Thanks for pointing out my error. I looked into the VAX Architecture Handbook and it seems that you are trying to get "ASHL #-32,R1,R0" in your second statement. "ASHQ #-32,R0,R0" takes a heck of a long time and gives the wrong answer. "ASHL #-32,R1,R0" takes the sign bit of R1 and fills R0 with it. Unfortunately, this seems to be the best way to sign extend on the VAX. (The coercion instructions don't include CVTLQ == coerce longword to quadword, so the apparently slower pseudo-shift is the best we can do.) >You might like to time THAT sequence, and rethink your post. Or you >could take my word for it, that when you include the cost of having >to reserve and target into an even-odd register pair, the EDIV is >almost always slower. Well, I timed the new sequence, and a little over half of my speedup disappeared, but it is still faster by more than 10% compared to what "cc -O" does. As for register allocation issues, that is a complex subject on the VAX. Registers R6 through R11 are "allocable" general-purpose registers. When translating most source code statements, you can consider R0 through R5 available. (BTW, Robert, this little tutorial is not directed at you, but at those new to the VAX register set.) As long as you aren't doing weird assembly language instructions like CRC or POLY or the string instructions, R2 through R5 are not going to be trampled by anything. R0 and R1 are used for the return value of a function, so usage of them has to be temporary and not live across a function call. Really good register allocation will use R0 through R11 as much as is legal. Simpler and poorer register allocation will only use R6 through R11 except when a single intermediate-code statement is translated into multiple assembly language statements, and those statements need scratch registers that will be dead upon conclusion of the single intermediate code operation. Thus, my code above used R0 through R2 temporarily. The point here is that "cc -O" is doing the same thing. It generated a sequence of 3 instructions for the remainder operation, and used R0 as a scratch register. Thus, for simple and stupid register allocators, R0 and R1 are always available as a nice even-odd register pair for scratch usage. (Although the VAX does not care about even-odd pairs, so I am not sure why you mentioned them. A contiguous pair is all that is needed.) A smarter allocator might want to avoid using "ediv" for the remainder operation because of the need to reserve a pair of registers. (A REALLY smart code generator might look to see if a pair is available for scratch use, and generate the "ediv" code if it were, and the "cc -O" sequence otherwise. And the first instruction of my sequence is unnecessary if the next lower numbered register is unused at the moment; the ASHL and EDIV can be done in place.) The point still remains: "cc -O" produces less than optimal code that biases instruction count analysis of the architecture. I am still wondering how system architects handle this bias when determining the future path of the architecture. And how it affects the famous pronouncements about how CISCs all have umpteen never-used instructions and umpteen more rarely-used instructions. (BTW, I will check the timings again on the VAX 11/750. The speedup I confirmed was on the VAX 8600. If it is different on the VAX 11/750, that just points out that a code generator can get outdated and bias instruction counts. So the point remains the same.) ----------------------------------------------------------------------------- "The use of COBOL cripples the mind; its teaching should, therefore, be regarded as a criminal offence." E.W.Dijkstra, 18th June 1975. ||| clc5q@virginia.edu (Clark L. Coleman)
sef@kithrup.COM (Sean Eric Fagan) (05/24/91)
In article <24263@lanl.gov> jlg@cochiti.lanl.gov (Jim Giles) writes: >The R6000 would >be even _faster_ if pop count were a hardware instruction!! It would? John, would you care to comment on that? It seems that MIPS engineers were derelict in their job, and didn't realize that adding hardware pop-count would speed up everything else the chip does. -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
bruce@contingency.think.com (Bruce Walker) (05/24/91)
As long as we are on the subject of bitcounting algorithms... One of my favorite toys is the fact that the expression "N & (N-1)" turns off the lowest order one-bit in N. Hence, it's a quick "find first", and can be made into a 4-cycle SPARC popcount loop which runs proportional to the number of ones (repeat N &= N-1 until N==0). I like the Poppen partial-sum algorithm, though. -- --Bruce Walker Thinking Machines Corporation, Cambridge, MA bruce@think.com; +1 617 234 4810; N1IKV
christer@cs.umu.se (Christer Ericson) (05/24/91)
In article <1991May23.204624.22872@shograf.com> jimb@shograf.UUCP (Jim battle) writes: >One place I worked used to ask job applicants to write a program to count >the number of 1's in an integer. Everyone pretty much did the same thing; >see if the number is odd, tally, and shift. One person came up with an >ingenious method that does not involve tables or conditionals. Here is his >solution, given in pseudo assembly language; it assumes 32-bit integers, >although it would work for any word size with little modification: > > [ code deleted] > >As far as I know, this is an original approach; credit goes to Richard Poppen. >I believe he works at Etak. Nope, it isn't really. See Mike Morton's article in the December 1990 issue of Computer Language for numerous ways to count the 1s in an integer. | Christer Ericson Internet: christer@cs.umu.se | | Department of Computer Science, University of Umea, S-90187 UMEA, Sweden |
rabin@CS.Yale.Edu (Dan Rabin) (05/24/91)
In article <1991May23.204624.22872@shograf.com> jimb@shograf.com (Jim
battle) gives a technique (credited to Richard Poppen) for counting
the number of one-bits in a word.
For historical interest, a similar technique was proposed in
HAKMEM - Artificial Intelligence Memo No. 239
Massachusetts Institute of Technology A. I. Laboratory
M. Beeler, R. W. Gosper & R. Schroeppel, February 29,1972
The relevant excerpt is:
ITEM 25 (in order of one-ups-manship: Gosper, Mann, Lenard, [Root and
Mann]):
To count the ones in a PDP-6/10 word:
LDB B,[014300,,A] ;or MOVE B,A then LSH B,-1
AND B,[333333,,333333]
SUB A,B
LSH B,-1
AND B,[333333,,333333]
SUBB A,B ;each octal digit is replaced by number of 1's in it
LSH B,-3
ADD A,B
AND A,[070707,070707]
IDIVI A,77 ;casting out 63.'s
These ten instructions, with constants extended, would work on word
lengths up to 62.; eleven suffice up to 254..
This proposal uses an integer divide instruction to carry out the
summation; the remainder is the result of interest. This is
ingenious, but possibly disadvantageous on a RISC that lacks a divide
instruction. The same result could be obtained by a short sequence of
shifts and adds, I think.
-- Dan Rabin
rabin-dan@cs.yale.edu
craig@netcom.COM (Craig Hansen) (05/25/91)
The C code below uses the well-known doubling technique to count ones in a bit string of arbitrary length. #define UNSINT_BIT 32 #define LOG_UNSINT_Bit 5 typedef unsigned int UnsInt; typedef UnsInt *PToUnsInt; typedef PToUnsInt Bits; UnsInt bcount(Bits b, UnsInt size) { UnsInt count = 0; UnsInt words = (size+UNSINT_BIT-1)>>LOG_UNSINT_BIT; UnsInt m = ~((-2)<<(UNSINT_BIT-1 - ((words<<LOG_UNSINT_BIT)-size))); UnsInt i; for (i=0; i!=words; i++) { UnsInt v = b[i]; if (i==words-1) v &= m; #if UNSINT_BIT == 32 v = ((v>> 1)&0x55555555) + (v&0x55555555); v = ((v>> 2)&0x33333333) + (v&0x33333333); v = ((v>> 4)&0x0f0f0f0f) + (v&0x0f0f0f0f); v = ((v>> 8)&0x00ff00ff) + (v&0x00ff00ff); v = ((v>>16)&0x0000ffff) + (v&0x0000ffff); count += v; #else while (v != 0) { count++; v = v & (v-1); } #endif } return count; } With a little effort, this code can be further improved by doubling up words before reducing the word down to a single sum. This improvement takes advantage of the fact that an n-bit field holds sums of up to 2**n-1. After the "v = ((v>> 2)&0x33333333) + (v&0x33333333);" statement, you have sums in 4 bit fields that can hold sums from two words, and after the "v = ((v>> 4)..." statement you have sums in 8 bit fields that can hold sums from 32 words. I would also note that a branch-free "count leading zeros" sequence can be constructed from the sequence: v = v | (v>>16); v = v | (v>> 8); v = v | (v>> 4); v = v | (v>> 2); v = v | (v>> 1); ...followed by the population count sequence, and a subtraction of the result from the word size. Regards, Craig Hansen craig@microunity.com
torek@elf.ee.lbl.gov (Chris Torek) (05/25/91)
In article <1991May24.122453.11011@cs.umu.se> christer@cs.umu.se (Christer Ericson) writes: >... See Mike Morton's article in the December 1990 issue >of Computer Language for numerous ways to count the 1s in an integer. There was an interesting little battle on `fastest way to count 1s' in comp.lang.c some time ago, and I put together this program from other people's methods, plus a few variants of my own. The comments in front of a function are usually the originals, sometimes edited a bit by me. Note that the fastest count function is quite machine dependent. #ifndef lint static char rcsid[] = "$Id: bct.c,v 1.5 90/10/13 08:44:12 chris Exp $"; #endif /* * bct - bitcount timing * * Runs a bunch of different functions-to-count-bits-in-a-longword * (many assume 32 bits per long) with a timer around each loop, and * tries to calcuate the time used. */ #include <stdio.h> #include <sys/types.h> #ifdef FD_SETSIZE # define USE_GETRUSAGE # include <sys/time.h> # include <sys/resource.h> #else # include <sys/times.h> #endif #define SIZEOF(a) (sizeof(a)/sizeof(a[0])) /* * This function is used to calibrate the timing mechanism. * This way we can subtract the loop and call overheads. */ int calibrate(n) register unsigned long n; { return 0; } /* * This function counts the number of bits in a long. * It is limited to 63 bit longs, but a minor mod can cope with 511 bits. * * It is so magic, an explanation is required: * Consider a 3 bit number as being * 4a+2b+c * if we shift it right 1 bit, we have * 2a+b * subtracting this from the original gives * 2a+b+c * if we shift the original 2 bits right we get * a * and so with another subtraction we have * a+b+c * which is the number of bits in the original number. * Suitable masking allows the sums of the octal digits in a * 32 bit number to appear in each octal digit. This isn't much help * unless we can get all of them summed together. * This can be done by modulo arithmetic (sum the digits in a number by molulo * the base of the number minus one) the old "casting out nines" trick they * taught in school before calculators were invented. * Now, using mod 7 wont help us, because our number will very likely have * more than 7 bits set. So add the octal digits together to get base64 * digits, and use modulo 63. * (Those of you with 64 bit machines need to add 3 octal digits together to * get base512 digits, and use mod 511.) * * This is HACKMEM 169, as used in X11 sources. */ int t0_hackmemmod(n) register unsigned long n; { register unsigned long tmp; tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111); return ((tmp + (tmp >> 3)) & 030707070707) % 63; } /* * This is the same as the above, but does not use the % operator. * Most modern machines have clockless division, and so the modulo is as * expensive as, say, an addition. */ int t1_hackmemloop(n) register unsigned long n; { register unsigned long tmp; tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111); tmp = (tmp + (tmp >> 3)) & 030707070707; while (tmp > 63) tmp = (tmp & 63) + (tmp >> 6); return tmp; } /* * Above, without using while loop. It takes at most 5 iterations of the * loop, so we just do all 5 in-line. The final result is never 63 * (this is assumed above as well). */ int t2_hackmemunrolled(n) register unsigned long n; { register unsigned long tmp; tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111); tmp = (tmp + (tmp >> 3)) & 030707070707; tmp = (tmp & 63) + (tmp >> 6); tmp = (tmp & 63) + (tmp >> 6); tmp = (tmp & 63) + (tmp >> 6); tmp = (tmp & 63) + (tmp >> 6); tmp = (tmp & 63) + (tmp >> 6); return (tmp); } /* * This function counts the bits in a long. * * It removes the lsb and counting the number of times round the loop. * The expression (n & -n) yields the lsb of a number, * but it only works on 2's compliment machines. */ int t3_rmlsbsub(n) register unsigned long n; { register int count; for (count = 0; n; n -= (n & -n)) count++; return count; } int t4_rmlsbmask(n) register unsigned long n; { register int count; for (count = 0; n; count++) n &= n - 1; /* take away lsb */ return (count); } /* * This function counts the bits in a long. * * It works by shifting the number down and testing the bottom bit. */ int t5_testlsb(n) register unsigned long n; { register int count; for (count = 0; n; n >>= 1) if (n & 1) count++; return count; } /* * This function counts the bits in a long. * * It works by shifting the number left and testing the top bit. * On many machines shift is expensive, so it uses a cheap addition instead. */ int t6_testmsb(n) register unsigned long n; { register int count; for (count = 0; n; n += n) if (n & ~(~(unsigned long)0 >> 1)) count++; return count; } int t7_testsignandshift(n) register unsigned long n; { register int count; for (count = 0; n; n <<= 1) if ((long)n < 0) count++; return (count); } /* * This function counts the bits in a long. * * It works by masking each bit. * This is the second most intuitively obvious method, * and is independent of the number of bits in the long. */ int t8_testeachbit(n) register unsigned long n; { register int count; register unsigned long mask; count = 0; for (mask = 1; mask; mask += mask) if (n & mask) count++; return count; } /* * This function counts the bits in a long. * * It works by masking each bit. * This is the most intuitively obvious method, * but how do you a priori know how many bits in the long? * (except for ''sizeof(long) * CHAR_BITS'' expression) */ int t9_testeachbit1shl(n) register unsigned long n; { register int count; register int bit; count = 0; for (bit = 0; bit < 32; ++bit) if (n & ((unsigned long)1 << bit)) count++; return count; } static char nbits[256] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, }; static int inbits[256] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, }; int tA_tableshift(n) register unsigned long n; { return (nbits[n & 0xff] + nbits[(n >> 8) & 0xff] + nbits[(n >> 16) & 0xff] + nbits[n >> 24]); } int tB_tableuchar(n) unsigned long n; { register unsigned char *p = (unsigned char *)&n; return (nbits[p[0]] + nbits[p[1]] + nbits[p[2]] + nbits[p[3]]); } int tC_tableshiftcast(n) register unsigned long n; { return nbits[(unsigned char) n] + nbits[(unsigned char) (n >> 8)] + nbits[(unsigned char) (n >> 16)] + nbits[(unsigned char) (n >> 24)]; } int tD_itableshift(n) register unsigned long n; { return (inbits[n & 0xff] + inbits[(n >> 8) & 0xff] + inbits[(n >> 16) & 0xff] + inbits[n >> 24]); } int tE_itableuchar(n) unsigned long n; { register unsigned char *p = (unsigned char *)&n; return (inbits[p[0]] + inbits[p[1]] + inbits[p[2]] + inbits[p[3]]); } int tF_itableshiftcast(n) register unsigned long n; { return inbits[(unsigned char) n] + inbits[(unsigned char) (n >> 8)] + inbits[(unsigned char) (n >> 16)] + inbits[(unsigned char) (n >> 24)]; } /* * Explanation: * First we add 32 1-bit fields to get 16 2-bit fields. * Each 2-bit field is one of 00, 01, or 10 (binary). * We then add all the two-bit fields to get 8 4-bit fields. * These are all one of 0000, 0001, 0010, 0011, or 0100. * * Now we can do something different, because for the first * time the value in each k-bit field (k now being 4) is small * enough that adding two k-bit fields results in a value that * still fits in the k-bit field. The result is four 4-bit * fields containing one of {0000,0001,...,0111,1000} and four * more 4-bit fields containing junk (sums that are uninteresting). * Pictorially: * n = 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh * n>>4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg * sum = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ * where W, X, Y, and Z are the interesting sums (each at most 1000, * or 8 decimal). Masking with 0x0f0f0f0f extracts these. * * Now we can change tactics yet again, because now we have: * n = 0000WWWW0000XXXX0000YYYY0000ZZZZ * n>>8 = 000000000000WWWW0000XXXX0000YYYY * so sum = 0000WWWW000ppppp000qqqqq000rrrrr * where p and r are the interesting sums (and each is at most * 10000, or 16 decimal). The sum `q' is junk, like i, j, and * k above; but it is not necessarry to discard it this time. * One more fold, this time by sixteen bits, gives * n = 0000WWWW000ppppp000qqqqq000rrrrr * n>>16 = 00000000000000000000WWWW000ppppp * so sum = 0000WWWW000ppppp000sssss00tttttt * where s is at most 11000 and t is it most 100000 (32 decimal). * * Now we have t = r+p = (Z+Y)+(X+W) = ((h+g)+(f+e))+((d+c)+(b+a)), * or in other words, t is the number of bits set in the original * 32-bit longword. So all we have to do is return the low byte * (or low 6 bits, but `low byte' is typically just as easy if not * easier). * * This technique is also applicable to 64 and 128 bit words, but * 256 bit or larger word sizes require at least one more masking * step. */ int tG_sumbits(n) register unsigned long n; { n = (n & 0x55555555) + ((n >> 1) & 0x55555555); n = (n & 0x33333333) + ((n >> 2) & 0x33333333); n = (n + (n >> 4)) & 0x0f0f0f0f; n += n >> 8; n += n >> 16; return (n & 0xff); } /* * build a long random number. * The standard rand() returns at least a 15 bit number. * We use the top 9 of 15 (since the lower N bits of the usual rand() * repeat with a period of 2^N). */ unsigned long bigrand() { #define randbits() ((unsigned long)((rand() >> 6) & 0777)) register int r; r = randbits() << 27; r |= randbits() << 18; r |= randbits() << 9; r |= randbits(); return (r); } /* * Run the test many times. * You will need a running time in the 10s of seconds * for an accurate result. * * To give a fair comparison, the random number generator * is seeded the same for each measurement. * * Return value is seconds per iteration. */ #ifndef REPEAT #if defined(mips) || defined(sparc) #define REPEAT (1L<<20) #else #define REPEAT (1L<<18) #endif #endif double measure(func) register int (*func)(); { #ifdef USE_GETRUSAGE struct rusage ru0, ru1; register long j; srand(1); (void) getrusage(RUSAGE_SELF, &ru0); for (j = 0; j < REPEAT; ++j) func(bigrand()); (void) getrusage(RUSAGE_SELF, &ru1); ru1.ru_utime.tv_sec -= ru0.ru_utime.tv_sec; if ((ru1.ru_utime.tv_usec -= ru0.ru_utime.tv_usec) < 0) { ru1.ru_utime.tv_usec += 1000000; ru1.ru_utime.tv_sec--; } return ((ru1.ru_utime.tv_sec + (ru1.ru_utime.tv_usec / 1000000.0)) / (double)REPEAT); #else register long j; struct tms start; struct tms finish; srand(1); times(&start); for (j = 0; j < REPEAT; ++j) func(bigrand()); times(&finish); return ((finish.tms_utime - start.tms_utime) / (double)REPEAT); #endif } struct table { char *name; int (*func)(); double measurement; int trial; }; struct table table[] = { { "hackmemmod", t0_hackmemmod }, { "hackmemloop", t1_hackmemloop }, { "hackmemunrolled", t2_hackmemunrolled }, { "rmlsbsub", t3_rmlsbsub }, { "rmlsbmask", t4_rmlsbmask }, { "testlsb", t5_testlsb }, { "testmsb", t6_testmsb }, { "testsignandshift", t7_testsignandshift }, { "testeachbit", t8_testeachbit }, { "testeachbit1shl", t9_testeachbit1shl }, { "tableshift", tA_tableshift }, { "tableuchar", tB_tableuchar }, { "tableshiftcast", tC_tableshiftcast }, { "itableshift", tD_itableshift }, { "itableuchar", tE_itableuchar }, { "itableshiftcast", tF_itableshiftcast }, { "sumbits", tG_sumbits }, }; main(argc, argv) int argc; char **argv; { double calibration, v, best; int j, k, m, verbose; verbose = argc > 1; /* * run a few tests to make sure they all agree */ srand(getpid()); for (j = 0; j < 10; ++j) { unsigned long n; int bad; n = bigrand(); for (k = 0; k < SIZEOF(table); ++k) table[k].trial = table[k].func(n); bad = 0; for (k = 0; k < SIZEOF(table); ++k) for (m = 0; m < SIZEOF(table); ++m) if (table[k].trial != table[m].trial) bad = 1; if (bad) { printf("wrong: %08lX", n); for (k = 0; k < SIZEOF(table); ++k) printf(" %3d", table[k].trial); printf("\n"); } } /* * calibrate the timing mechanism */ calibration = measure(calibrate); if (verbose) printf("calibration: %g\n", calibration); /* * time them all, keeping track of the best (smallest) */ for (j = 0; j < SIZEOF(table); ++j) { v = measure(table[j].func) - calibration; if (verbose) printf("%s: %g\n", table[j].name, v); table[j].measurement = v; if (!j || v < best) best = v; } (void) printf("%-24s %-14sratio\n", "function", "time"); for (j = 0; j < SIZEOF(table); ++j) { (void) printf("%-24s %#10.8g %6.3f\n", table[j].name, table[j].measurement, table[j].measurement / best); } exit(0); } -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
peter@ficc.ferranti.com (peter da silva) (05/25/91)
In article <1991May24.175323.9730@netcom.COM>, craig@netcom.COM (Craig Hansen) writes: > The C code below uses the well-known doubling technique > to count ones in a bit string of arbitrary length. Yow. How about combining it with Duff's Device to *really* make it confusing, and allow the bit string to start on an arbitrary byte boundary. -- Peter da Silva; Ferranti International Controls Corporation; +1 713 274 5180; Sugar Land, TX 77487-5012; `-_-' "Have you hugged your wolf, today?"
torek@elf.ee.lbl.gov (Chris Torek) (05/25/91)
In article <13555@dog.ee.lbl.gov> I wrote: >... and I put together this program from other people's methods, >plus a few variants of my own. I had forgotten, but have now remembered, that I based it on someone else's program. Unfortunately, I have forgotten who the Someone Else was, so one of the Teeming Millions will have to supply the Credit Where Credit Is Due Department with the answer to that one. I also have some more bitcount functions from Arthur Olson and others; one of these (ab)uses the C preprocessor to build a 65536-entry count table at compile time, so that 32-bit counts can be computed as tab[n & 0xffff] + tab[(unsigned long)n >> 16] which is often faster than any of the methods in my previous posting. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
root@apollo.HP.COM ( root) (05/25/91)
for generating hi-bit counts... maybe someone can provide a reference? From: gupta_d@apollo.HP.COM (Dipankar Gupta) Path: apollo!gupta_d Sorry for the vagueness of the post... Dipankar Dipankar Gupta Hewlett-Packard Bangalore, INDIA messiah@india.hp.com
gideony@microsoft.UUCP (Gideon YUVAL) (05/27/91)
In article <49278@ut-emx.uucp> nather@ut-emx.uucp (Ed Nather) writes: >For people who use computers for real-time data acquisition and control, >and there are a lot of them, floating point operations are very seldom >used, if at all. ... Obvious question: is FP avoided because it is the wrong thing, or is it avoided due to its cost (a 387 is a few hundred $) and slowness (memory-to-memory floating-add can be 100 cycles)? If anyone has data on this, I would be grateful -- Gideon Yuval, gideony@microsof.UUCP, 206-882-8080 (fax:206-883-8101;TWX:160520)
herrickd@iccgcc.decnet.ab.com (05/30/91)
In article <12526@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes: [Herman supplies some of the specifics people have been asking for] > There are provisions for octal and hex > integers [in c], I thought so too, and one day I tried to write an integer constant in octal. The compiler said, "Nuts to you!" It took some hours, but I finally convinced myself that the compiler manual and then Kernigan and Ritchie provide octal notation for CHARACTERS. Nothing else! I thought it was a major design flaw and was no accident. dan herrick herrickd@iccgcc.decnet.ab.com
herrickd@iccgcc.decnet.ab.com (05/30/91)
Aren't you afraid you will spoil the argument by bringing some facts into it? In article <1991May21.191034.25980@murdoch.acc.Virginia.EDU>, clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes: [most of his wonderful data omitted - see the original article] > > Given the C source code statement: > > z = x % y; /* z gets the remainder of x divided by y */ > > > movl r6,r1 /* Transfer quotient to r1 */ > clrl r0 /* Zero out upper word to form 64-bit r0/r1 > register pair quotient */ > ediv r7,r0,r2,r11 /* Divide r0-r1 pair by r7; throw away quotient > into r2 and keep remainder in r11 */ > This looks to me like it does only unsigned div/rem. Is that a feature of the ediv instruction or do we need a way to sign extend r1 into r0 for the signed division case? dan herrick herrickd@iccgcc.decnet.ab.com
hrubin@pop.stat.purdue.edu (Herman Rubin) (05/30/91)
In article <4712.2843b438@iccgcc.decnet.ab.com>, herrickd@iccgcc.decnet.ab.com writes: > Aren't you afraid you will spoil the argument by bringing some > facts into it? > > In article <1991May21.191034.25980@murdoch.acc.Virginia.EDU>, clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes: > [most of his wonderful data omitted - see the original article] > > > Given the C source code statement: > > z = x % y; /* z gets the remainder of x divided by y */ > > movl r6,r1 /* Transfer quotient to r1 */ > > clrl r0 /* Zero out upper word to form 64-bit r0/r1 > > register pair quotient */ > > ediv r7,r0,r2,r11 /* Divide r0-r1 pair by r7; throw away quotient > > into r2 and keep remainder in r11 */ > This looks to me like it does only unsigned div/rem. Is that a feature > of the ediv instruction or do we need a way to sign extend r1 into r0 > for the signed division case? This points out another reason for cheap instructions at little cost. There are a rather large number of cases in which unsigned arithmetic is needed. In fact, the EMUL description in the VAX manual even indicates that, for multiple precision arithmetic, this is quite definitely the case. Now how much additional silicon would it take to provide both signed and unsigned? The VAX does have instructions to sign extend 8 and 16 bit units to 16 and 32; why not to extend 32 to 64? These are cheap instructions. Of course, if it is fast enough, a shift of -32 would sign-extend a 32 to a 64. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
dab@ubitrex.mb.ca (Danny Boulet) (05/30/91)
In article <4711.2843a523@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes: >In article <12526@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >[Herman supplies some of the specifics people have been asking for] > >> There are provisions for octal and hex >> integers [in c], > >I thought so too, and one day I tried to write an integer constant >in octal. The compiler said, "Nuts to you!" It took some hours, >but I finally convinced myself that the compiler manual and then >Kernigan and Ritchie provide octal notation for CHARACTERS. Nothing >else! I thought it was a major design flaw and was no accident. > >dan herrick >herrickd@iccgcc.decnet.ab.com Octal constants are written with the first digit zero. The following program main() { printf("test number is %d\n",01234567); printf("reverse is 0%o\n",342391); exit(0); } produced the following output using cc on my Sun Sparc IPC running SunOS 4.1.1 and should produce identical output on any 32-bit int implementation of C: test number is 342391 reverse is 01234567
richard@aiai.ed.ac.uk (Richard Tobin) (05/30/91)
In article <4711.2843a523@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes: >> There are provisions for octal and hex >> integers [in c], >I thought so too, and one day I tried to write an integer constant >in octal. The compiler said, "Nuts to you!" It took some hours, >but I finally convinced myself that the compiler manual and then >Kernigan and Ritchie provide octal notation for CHARACTERS. Nothing >else! I thought it was a major design flaw and was no accident. But you were mistaken. Octal constants are written by putting a leading zero on the number (a horrid convention). Check out "octal constant" in the index of K&R (either edition). Example: printf("%d\n", 01234); -- Richard -- Richard Tobin, JANET: R.Tobin@uk.ac.ed AI Applications Institute, ARPA: R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk Edinburgh University. UUCP: ...!ukc!ed.ac.uk!R.Tobin
ted@nmsu.edu (Ted Dunning) (05/31/91)
In article <4711.2843a523@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes: > There are provisions for octal and hex > integers [in c], I thought so too, and one day I tried to write an integer constant in octal. The compiler said, "Nuts to you!" It took some hours, but I finally convinced myself that the compiler manual and then Kernigan and Ritchie provide octal notation for CHARACTERS. Nothing else! I thought it was a major design flaw and was no accident. kythera% cat x.c #include <stdio.h> main() { printf("%d\n", 010); } kythera% x 8 kythera%
herrickd@iccgcc.decnet.ab.com (05/31/91)
In article <4711.2843a523@iccgcc.decnet.ab.com>, herrickd@iccgcc.decnet.ab.com writes: > In article <12526@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes: > [Herman supplies some of the specifics people have been asking for] > >> There are provisions for octal and hex >> integers [in c], [Never being slow with the foot around an open mouth, I said] > > I thought so too, and one day I tried to write an integer constant > in octal. The compiler said, "Nuts to you!" It took some hours, > but I finally convinced myself that the compiler manual and then > Kernigan and Ritchie provide octal notation for CHARACTERS. Nothing > else! I thought it was a major design flaw and was no accident. > > dan herrick > herrickd@iccgcc.decnet.ab.com Several people have quietly pointed out that c recognizes constants with a leading zero as octal numerals. Even one from Edinburgh Castle (according to his domain name)! So now I know I was dumb that day several years ago - I honestly did work hard trying to find it in the indexes and so on, but I was fixated on the \377 notation, which probably blinded me to 0777. Thank you all. dan herrick
mac@gold.kpc.com (Mike McNamara) (05/31/91)
> In article <4711.2843a523@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes: > > In article <12526@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes: > [Herman supplies some of the specifics people have been asking for] > > > There are provisions for octal and hex > > integers [in c], > > I thought so too, and one day I tried to write an integer constant > in octal. The compiler said, "Nuts to you!" It took some hours, > but I finally convinced myself that the compiler manual and then > Kernigan and Ritchie provide octal notation for CHARACTERS. Nothing > else! I thought it was a major design flaw and was no accident. > > dan herrick > herrickd@iccgcc.decnet.ab.com What vendors compiler are you using? consider main(){ int a = 0xabcdef01; int b = 0123567; printf("A is %#x (%d)\n",a,a); printf("B is %#o (%d)\n",b,b); } Produces for me: A is 0xabcdef01 (-1412567295) B is 0123567 (42871) -- +-----------+-----------------------------------------------------------------+ |mac@kpc.com| Increasing Software complexity lets us sell Mainframes as | | | personal computers. Carry on, X windows/Postscript/emacs/CASE!! | +-----------+-----------------------------------------------------------------+
abaum (Allen Baum) (05/31/91)
In article <12947@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >This points out another reason for cheap instructions at little cost. >There are a rather large number of cases in which unsigned arithmetic >is needed. In fact, the EMUL description in the VAX manual even >indicates that, for multiple precision arithmetic, this is quite >definitely the case. > >Now how much additional silicon would it take to provide both signed >and unsigned? I couldn't let this go by. Herman, I accept the fact that you know far more about numerical problems and what you need to speed them up. (Note that this is damning with faint praise. Its easy to know more than I do) However, you are not a hardware designer. Do not presume that because unsigned & signed multiply are both multiplies, that including both is cheap. It may not be the case. Boothe style mults. are implicitly signed, for example. To pretend they aren't mean multiplying two 33 bit numbers. This is just one example. There are others from your postings.
hrubin@pop.stat.purdue.edu (Herman Rubin) (06/01/91)
In article <182@armltd.uucp>, abaum (Allen Baum) writes: > In article <12947@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: ..................... > I couldn't let this go by. > Herman, I accept the fact that you know far more about numerical > problems and what you need to speed them up. (Note that this is > damning with faint praise. Its easy to know more than I do) > However, you are not a hardware designer. Do not presume that because > unsigned & signed multiply are both multiplies, that including both > is cheap. It may not be the case. Boothe style mults. are implicitly > signed, for example. To pretend they aren't mean multiplying two > 33 bit numbers. > > This is just one example. There are others from your postings. That is one way of doing it. Another is to conditionally put in adds depending on sign. This is easily done in hardware. The point is that to make the greater flexibility cheap, it MUST be done in the design. For addition, I agree that 2's complement arithmetic is easier, but not for anything else. For multiprecision work, I have not seen any non-redundant procedure which makes any sense other than sign-magnitude. It is true that by using some trickery, which to be efficient would have to be hardware, that giving up one bit/word would enable the procedures without carry propagation to be used, but the hardware for that would be more complicated than allowing the choice. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
herrickd@iccgcc.decnet.ab.com (06/04/91)
In article <182@armltd.uucp>, abaum (Allen Baum) writes: > In article <12947@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: > [much omitted throughout the quotations] >>This points out another reason for cheap instructions at little cost. >> >>Now how much additional silicon would it take to provide both signed >>and unsigned? > > Do not presume that because > unsigned & signed multiply are both multiplies, that including both > is cheap. It may not be the case. Boothe style mults. are implicitly > signed, for example. To pretend they aren't mean multiplying two > 33 bit numbers. What fantastic advantage does one derive from using "Boothe style mults." that makes up for losing unsigned arithmetic. This sounds to me like the reason for not using them in a general purpose computer. dan herrick herrickd@iccgcc.decnet.ab.com
martin@adpplz.UUCP (Martin Golding) (06/05/91)
In <182@armltd.uucp> abaum (Allen Baum) writes: >In article <12947@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >>There are a rather large number of cases in which unsigned arithmetic >>is needed. >>Now how much additional silicon would it take to provide both signed >>and unsigned? > Do not presume that because >unsigned & signed multiply are both multiplies, that including both >is cheap. It may not be the case. Boothe style mults. are implicitly >signed, for example. To pretend they aren't mean multiplying two >33 bit numbers. No, for a 32x32=>32 (or nxn=>n) you need do nothing at all. For 32x32=>64 (nxn=>2n) you need the (logical) extended bits to be copies of the sign bit (signed) or copies of 0 (unsigned). Since accomodating the sign is what requires hardware, multiplexing whatever is done with a force to zero should take only a few gates. In either case, of course, you need some extra logic for the status bits (overflow, particularly), if you keep status bits. >However, you are not a hardware designer. Are you a hardware designer, and if so, and if you think I'm an idiot (which is entirely possible), could you email me a description (brief if you prefer) of a signed multiply algorithm that can't easily handle unsigned values? I'm always eager to learn, but the rest of these people don't need to be annoyed. You can still flame me on the net, of course. Martin Golding | sync, sync, sync, sank ... sunk: Dod #0236 | He who steals my code steals trash. A poor old decrepit Pick programmer. Sympathize at: {mcspdx,pdxgate}!adpplz!martin or martin@adpplz.uucp