kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (02/28/91)
In article <1991Feb27.183718.638@jetsun.weitek.COM> weaver@jetsun.WEITEK.COM (Michael Weaver) writes: >Does anyone know of a pipelined divider that gives a result every cycle? I've read of at least one real product. No ref -- memory fault. >As far as I know, they exist only on paper, and for good reason: >every algorithm I have heard of for an 'array' divider starts with ^^^^^^^^^^^ >an iterative algorithm, and simply repeats the hardware for each ^^^^^^^^^^^^^^^^^^^^^^^ >iteration. This means that the amount hardware is increased by I thought they started with a good table lookup. I seem to remember that it's been shown that a good initial approx is more important than the actual iterative scheme. >the same factor as the increase in throughput, and the latency >remains the same, which is not very attractive. Also, the actual >number of transistors is horrendous -- it would take perhaps >ten times as many transistors as an array multiplier, which >is a large thing. I don't know about the order of magnitude difference, but I'll agree I wouldn't usually build such things on the basis of economics. It always seemed like nature has it in for (integer) division -- from a stats perspective many of the answers will be 0 yet we have to wait much longer (i.e. O(n^k) or O(n log n) as opposed to O(1)) in the worst case. If the numerator & denom are taken from uniform distibutions (which is pretty much the case in modular arithmetic & a few other realms) 1/2 the time num < denom and this is obviously reasonably quick. A reconfiguring pipe anyone? -kym
baum@Apple.COM (Allen J. Baum) (03/01/91)
[] >In article <1991Feb28.010339.20055@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >In article <1991Feb27.183718.638@jetsun.weitek.COM> weaver@jetsun.WEITEK.COM (Michael Weaver) writes: >>Does anyone know of a pipelined divider that gives a result every cycle? > >I've read of at least one real product. No ref -- memory fault. I believe that HP had a chip that was a purely combinatorial divider. I don't remember what system it was shipped with, though, and I don't think it was IEEE conforming -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/01/91)
In article <6968@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: [there pipes that divide in 1 cycle] >Of the machines I know, this is not the case. The CYBER 205, which >pipelines just about everything, does not pipeline divide, square root >(the same unit), or convert between decimal and binary, which I believe >also uses that unit. The CRAY does not provide a division at all, presumably >because it could not be pipelined. It does have a pipelined approximation to >reciprocal, and a pipelined correction factor. I know the Cray-2 has problems with chaining (:-), but on the X-MP wasn't it possible to chain recip & multiply? -kym
lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) (03/07/91)
In article <1991Mar1.033353.28445@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >In article <6968@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >[there pipes that divide in 1 cycle] >>Of the machines I know, this is not the case. The CYBER 205, which >>pipelines just about everything, does not pipeline divide, square root >>(the same unit), >>also uses that unit. The CRAY does not provide a division at all, presumably >>because it could not be pipelined. >... but on the X-MP >wasn't it possible to chain recip & multiply? > >-kym Several points here: Many of the postings have confused result-within-1-cycle with 1-result-per-cycle (or some small number, due to pipelining e.g.). I don't know of any machine which produces the result-within-1-cycle. However, there are a number of machines which have pipelined divides. The Cyber 205 *scalar* divide unit was not pipelined. The vector unit *was* pipelined. In the basic 2 pipe machine, it was 1 result/3.57 cycles (that is not a typo- I don't know why it is not an integral number of cycles), with a faster (roughly twice as fast) pipeline available (more hardware had to be added) on the four pipe version. Divide and Square Root were both provided. The Cray provides a similar level of performance in a different way: by explicitly pipelining the reciprocal approximation unit with add/multiply to get the final result. A question that is sometimes asked by vendors, is, how important is divide? Because divide is slow, many codes avoid using it whenever possible. Because they do, dynamic instruction counts show divide as seldom used. etc. A familiar situation to readers of this group. I don't know of any effort to determine the "natural" frequency of divides, though codes written for vector machines with pipelined divides are probably a lot closer to the truth then old codes written on machines that took 60-100 cycles to produce one, non-pipelined result. Hugh LaMaster, M/S 233-9, UUCP: ames!lamaster NASA Ames Research Center Internet: lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 With Good Mailer: lamaster@george.arc.nasa.gov Phone: 415/604-6117 #include <std.disclaimer>
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/07/91)
In article <1991Mar6.183927.16256@news.arc.nasa.gov> lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes: >Many of the postings have confused result-within-1-cycle with >1-result-per-cycle (or some small number, due to pipelining e.g.). I don't think we are confused, just indulging (certainly I'm guilty) in the kind of sloppy `shop talk' mathematicians accuse physicists of all the time. :-) >A question that is sometimes asked by vendors, is, how important is divide? >Because divide is slow, many codes avoid using it whenever possible. >Because they do, dynamic instruction counts show divide as seldom used. etc. >A familiar situation to readers of this group. I don't know of >any effort to determine the "natural" frequency of divides, though codes >written for vector machines with pipelined divides are probably a lot >closer to the truth then old codes written on machines that took 60-100 >cycles to produce one, non-pipelined result. This is of course a nontrivial question with interesting ramifications from an `philosophy of engineering' point of view. An engineer tries to get a job done as economically as possible (in the ideal case, anyway). One technique used is to follow the maxim ``what must be done must be done well''. But how do you know what _needs_ to be done? Obviously the RISC community and others uses the approximation -- what is done often is more important than what is done less. However, this brings up the question -- is what _is_ done the best way that thangs _can_ be done. I think the obvious unswer is -- we can't be _sure_ (but the pocketbook has gotta count for something :-). As a bumbling attempt to try to figure out what the `natural frequency' of divides might be we have to consider quite a few things. The algebraic properties of division -- as representing either an inverse over the field of reals (:-) or, more simplistically, repeated subtraction -- shows that it should occur less often than, say, addition/subtraction (i.e. it can be made to distribute over these operations). So why not take addition as the `model' of a low-cost operation that may represent at least an upper bound on the natural frequency of division (or any other arithmetic operation for that matter)? Simple measurements show that addition can account for 20% or so (actually less) of dynamic opcodes, whereas division normally occurs at the 1% to 3% range. Let's presume all operations other than divide take about the same work. If we say division occurs at a frequency alpha and takes time D then the proportion of time taken for all divisions is then obviously: 1 --------------- 1+(1-alpha)/(alpha D) If we plug in .2 for alpha and make a little table we find that, for example on a Magnum where (on average, for uniform random operands) a divide takes about 4 adds we find that about 55% of the runtime --------------------------------- might be accounted for by division with the above (obviously very loose) ---------------------------------- assumptions. I also feel some rough analysis, akin (in accuracy) to the ``editor's formula'', might be jury-rigged to provide other measures of natural operator frequencies. Obviously Herman Rubin & others would be far more qualified than I to do such an analysis, but I shall attempt something very simple. Let's say 3% of operations are divides, given that they are thought to be very expensive. Of the remaining 97% of instructions _some_ of them also represent divisions, although in disguise. Let's presume the work of an average divide is D and, further, that at most D instructions will be `substitute' for a divide. If we start with N instructions we will therefore have d/N = 3% other divides = at least 3% of (N-d)/D where d = number of divides. The lower bound on the `natural frequency' of divisions would then be 3% + other divides/N >= 3% + 3%(N-d)/(N D) >= 3% + 1/4 3% - 1/4 3% 3% >=about 4.2% Thus the natural frequency of division is found to be little more (as a lwb) than actually measured -- somewhat contradicting the previous discussion. But this may not be too far wrong -- the more expensive division _IS_ then the fewer will be the divides that can result (given division becomes really cheap) by substituting it for these other instructions. I would be interested in seeing any other discussion (although less interested in poking fun at my lame arguments above :-|). -kym
mash@mips.com (John Mashey) (03/07/91)
In article <1991Mar7.043931.13552@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >>A question that is sometimes asked by vendors, is, how important is divide? >>Because divide is slow, many codes avoid using it whenever possible. >>Because they do, dynamic instruction counts show divide as seldom used. etc. >>A familiar situation to readers of this group. I don't know of >>any effort to determine the "natural" frequency of divides, though codes .... >done as economically as possible (in the ideal case, anyway). One technique >used is to follow the maxim ``what must be done must be done well''. But how >do you know what _needs_ to be done? Obviously the RISC community and others >uses the approximation -- what is done often is more important than what is >done less. However, this brings up the question -- is what _is_ done the best >way that thangs _can_ be done. I think the obvious unswer is -- we can't be >_sure_ (but the pocketbook has gotta count for something :-). ... >fewer will be the divides that can result (given division becomes really >cheap) by substituting it for these other instructions. (AS one reference to the following, see Hennessy & Patterson's Appendix on Computer Arithmetic. Note: cheap = 1, fairly cheap = 4, moderate-cost = 12, expensive = 32 or more, very expensive can be hundreds.) You can make floating point go faster and faster by throwing more and more hardware at it. Current and/or forthcoming chips illustrate fact that you can profitably spend 600K transistors or more on an FPU all byt itself. + & - You can make integer addition cheap by throwing more hardware at it; the nature of these is that you must put enough hardware in to make integer add be as fast as it needs to be. * You can make integer multiplication moderate-cost by throwing some hardware at it, or fairly cheap by using more. It is a useful analysis to consider tradeoffs here of best use of space. / You can make integer divison very expensive by throwing zero hardware at it (SPARC), somewhat expensive by throwing a little bit of hardware at it, or somewhat less expensive by throwing more hardware at it (R3000, 88K, etc). No matter what you do with CPU logic, you are NOT likely to make integer division "very cheap" or even "somewhat cheap"; once you have a reasonable implementaion, it's HARD to speed it up by allocating it more and more space.... (Of course, if you could afford the appropriately-sized lookup table, you could do it by tabe lookup, but that is a LARGE table not likely to be placed onto a CPU.:-) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/08/91)
In article <777@spim.mips.COM> mash@mips.com (John Mashey) writes: >You can make integer divison very expensive by throwing zero hardware at it >(SPARC), somewhat expensive by throwing a little bit of hardware >at it, or somewhat less expensive by throwing more hardware at it >(R3000, 88K, etc). >No matter what you do with CPU logic, you are NOT likely to make >integer division "very cheap" or even "somewhat cheap"; >once you have a reasonable implementaion, it's HARD to speed it >up by allocating it more and more space.... >(Of course, if you could afford the appropriately-sized lookup table, >you could do it by tabe lookup, but that is a LARGE table not likely >to be placed onto a CPU.:-) A table lookup, sigh... Maybe! What with all these sneaky compression techniques around these days -- not that I've looked very closely, mind you -- maybe you _could_ fit a whole division table in there. Beside the actual bits needed to store all the results (which works out at about 1 bit per result due to divides nice nature (e.g. >1/2 results are either 0 or 1)), we need to also handle all those address lines. Maybe some kind of `perfect hashing' scheme? (I remember Knuth remarked somewhere that perfect hashing was going to be very difficult -- and then some smartys came _up_ with some fairly good approximations to it in several domains.) I am not sure what has/can be done with a table and (very) simple hardware to do interpolation. And then again -- maybe you can just live with a (relatively) fast reciprocal as in the program below. (Of course this _still_ isn't cheap in terms of gates). Although the algorithm appears to bristle with multiples (apart from the one we'll need to actually get a division going), they aren't all general. One is simply a shift. (It turns out playing around with exponents via C's standard library functions is not worth the effort). The other is a `small' multiply (although indicated as a short x double in the program it could be implemented in a relatively small amount of dedicated hardware). The only remaining multiply is a full one -- but doesn't need to be performed in the majority of cases (given uniform operand to recip). (Some extant hardware relegates similar final `correction steps' to a separate function unit.) Although it's similar to a one-srep Newton-Raphson, it isn't. It's a simpler scheme based on the expansion: (a+b)^-1 = a^-1 (1 + a^-1 b)^-1 =about a^-1 (1 - a^-1 b) when b<<a. We ensure b<<a by the appropriate FFO-like manipulations at the start of the routine. The software recip compares with the built-in (or not :-) versions on several machines as follows: Machine ratio of software/`builtin' VAX 6440 24% Mips Magnum 30% VAX 8530 30% Sun 3/60 55% Sun SS1 70% ?! I shall not comment further. In summary: it seems as though there are some mechanisms for obtaining divide pipes that return a result per clock, but at the relative low proportion of overall dynamic opcodes (<=3% as pixie & I measure) and possibly not much more not matter HOW cheap and/or fast it could be made, further effort is probably wasted. -kym C code ahead -- BEWARE. -------------------------------------------------------------------------------- #include <stdio.h> #include <math.h> #define NIT 100000 #define NTAB 1048 int mask[32]; double shift[32]; double rtab[NTAB]; #define recip(x) {\ register y;\ register n;\ for(y=x,n=0; y>=NTAB; y>>=1,n++);\ res = rtab[y];\ if(n) {\ register short b;\ res *= shift[n];\ b = x & mask[n];\ if(b) res *= 1-res*b;\ }} double test1(){ register i; register double ans=0; register double res; for(i=1;i<=NIT;i++){ recip(i); ans+= res; } return ans; } double test2() { register i; register double ans=0; for(i=1;i<=NIT;i++) ans+= 1.0/i; return ans; } main(){ double t0,t1,t2; init(); t0=clock(); printf("%g\n", test1()); t1=clock(); printf("%g\n", test2()); t2=clock(); printf("%g\n",(t2-t1)/(t1-t0)); } init(){ int i; rtab[0]=1e38; for(i=1;i<NTAB;i++) rtab[i]=1.0/i; for(i=0;i<32;i++) mask[i]=(1<<i) - 1; for(i=0;i<32;i++) shift[i]=1.0/(1<<i); } --------------------------------------------------------------------------------
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/11/91)
My earlier mumblings about ``interpolation'' & ``table-lookups not being quite as expensive as all that'' have resulted in a modest amount of email asking for clarification. So after a bit of experimentation I am posting yet _another_ reciprocal routine which attempts to illustrate some of the ideas to which I alluded. This present routine uses linear interpolation (after some initial adjustment of the argument) to find the reciprocal of an unsigned 16-bit number. It uses 512 words of table storage. It is based, again, on an approximate expansion for 1/(L a+b) = 1/L 1/a (1 - 1/L 1/a b) = 1/L (A[a] - B[a] (1/L b)) where A[a] = 1/a and B[a] = 1/a^2 It is convenient to choose L = 2^k :-). Over the argument range the absolute error in the result is small enough to make any division accurate (i.e. the largest abs err * largest numerator < 1 bit). Since the scheme only involves (c) a `small' multiply (a) a subtract (b) two `small' shifts it is slightly :-) cheaper in terms of h/w than the scheme I posted previously. It again could perform at the rate of one result per cycle with a suitable tree multiplier and shifters. The accuracy of the calculation can be improved somewhat if the parameters A[a] and B[a] are displaced; I have tried a simple regression to minimize the square error, and that seems satisfactory; but for the numerical purists a minimax procedure might be better. Although obviously a toy, expansion to your preference in wordsize is straightforward (if expensive). -kym Since the abs error was originally largest for a power of two, I have included a shift in the program to remove trailing 0's in the initial argument; this would probably be reduced to a single-bit shift in a hardware version, or eliminated entirely if the table entries A[a] and B[a] were adjusted appropriately. The program also includes a possible single Newton-Raphson correction step; but it does not seem to be required for the purposes of implementing integer division. C Code ahead. --- program --- #include <assert.h> #include <math.h> #include <stdio.h> /* 1/(La+b) = 1/L 1/a (1 + 1/L 1/a b)^-1 =about 1/L 1/a (1 - 1/L 1/a b) = 1/L 1/a - 1/L^2 1/a^2 b = 1/L A(a) - 1/L^2 B(a) b Good if L = 2^k. */ #define MAX 0xffff #define NTAB (1<<8) float atab[NTAB],btab[NTAB]; int mask[10]; float shift[16]; float recip0(x) { int a; unsigned short b; unsigned char m=0; unsigned char n; /* trim trailing 0's from initial argument */ for(a=x; (a&1)==0; a>>=1, m++); /* top `nbits' of x -> a ; rest -> b */ for(n=0; a>=NTAB; a>>=1, n++); assert(n<=10); b=x & mask[n]; m += n; assert(m<=15); return shift[m]*(atab[a] - shift[n]*btab[a]*b); } #ifdef NEWTON float recip(x) { /* perform single Newton-Raphson correction step */ float r = recip0(x); r *= 2 - r*x; return r; } #else /* perform no Newton-Raphson correction step */ #define recip(x) recip0(x) #endif main() { int x; float mine,std; float abs_err,rel_err; float prod_rel_err=1,sum_abs_err=0; float max_abs_err=0,min_abs_err=1e38; int min_abs_err_x= -1,max_abs_err_x= -1; float max_rel_err=0,min_rel_err=1e38; int min_rel_err_x= -1,max_rel_err_x= -1; init(); for(x=1; x<=MAX; x++) { std=1.0/x; mine=recip(x); assert(mine>=0); /* just in case... */ abs_err=fabs(mine-std); sum_abs_err+=abs_err; rel_err=abs_err/std; prod_rel_err *=rel_err; if(abs_err>max_abs_err) max_abs_err=abs_err,max_abs_err_x=x; else if(abs_err<min_abs_err) min_abs_err=abs_err,min_abs_err_x=x; if(rel_err>max_rel_err) max_rel_err=rel_err,max_rel_err_x=x; else if(rel_err<min_rel_err) min_rel_err=rel_err,min_rel_err_x=x; } printf("av abs err=%g ",sum_abs_err/MAX ); printf("min abs err =%g ",min_abs_err); printf("(%d) ",min_abs_err_x); printf("max abs err =%g ",max_abs_err); printf("(%d) ",max_abs_err_x); putchar('\n'); printf("av rel err=%g ", pow(prod_rel_err,1.0/MAX) ); printf("min rel err =%g ",min_rel_err); printf("(%d) ",min_rel_err_x); printf("max rel err =%g ",max_rel_err); printf("(%d) ",max_rel_err_x); putchar('\n'); exit(0); } init() { int i; atab[0] = 0; btab[0] = 0; for(i=1; i<NTAB; i++) { atab[i] = 1.0/i; btab[i] = 1.0/((long)i*i); } for(i=0; i<10; i++) mask[i] = (1L<<i) - 1; for(i=0; i<16; i++) shift[i] = 1.0/(1L<<i); } --- make #1 --- cc recip.c -lm a.out --- log file #1 --- av abs err=5.93951e-08 min abs err =0 (1) max abs err =7.59971e-06 (514) av rel err=0 min rel err =0 (1) max rel err =0.00732425 (33008) --- make #2 --- cc -DNEWTON recip.c -lm --- log file #2 --- av abs err=1.51684e-10 min abs err =0 (1) max abs err =2.96859e-08 (514) av rel err=0 min rel err =0 (1) max rel err =5.36168e-05 (33008) --- end ---
bob@tera.com (Bob Alverson) (03/12/91)
Although the Tera computer is not built yet, it will be able to divide by an arbitrary integer *constant* (or loop invariant value, etc.), while producing one result per cycle. Mind you, I can't imagine any reason anyone would care. The latency is the same as for a "fused multiply add" a la IBM's RS/6000 (because the divide is done with the same logic). For the unlucky whose divisors aren't known to the compiler and aren't loop invariant, the divide rate drops to one result every nine ticks. The only significant hardware dedicated to divide is a 256 entry lookup table and an 8x8 -> 16 multiplier for the initial reciprocal approximation. Now, if I could just figure out how to do float divides at one per tick... Bob (bob@tera.com)
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/13/91)
In article <1991Mar12.043839.11068@tera.com> bob@tera.com (Bob Alverson) writes: | For the unlucky whose divisors aren't known to the compiler and aren't loop | invariant, the divide rate drops to one result every nine ticks. The only | significant hardware dedicated to divide is a 256 entry lookup table and | an 8x8 -> 16 multiplier for the initial reciprocal approximation. I got these results from a 386-25: # System id: Dell 325, 4MB, 150MB, Xenix/386 2.3.3, 387 # # Math operations, effective instructions/sec (thousands) # # Add Sub Mpy Div Wtd HM # short: 7451.0 7378.6 3023.3 2933.3 4656.6 # long: 7600.0 7368.4 2692.3 2000.0 4031.5 # float: 1168.8 1168.8 975.6 933.3 1074.9 # double: 1025.6 1012.7 750.0 789.5 899.2 # # Test and branch timing: # integer compare and branch 0.688 uSec, 1453.5K/sec # float compare and branch 4.320 uSec, 231.5K/sec The divide speed would indicate about 9 ticks for 16 bit, about 12.5 for 32 bit. The 486-25 looks like this: # System id: HP 486-25, SCO ODT 1.0, 10MB, 300MB, cc # # Math operations, effective instructions/sec (thousands) # # Add Sub Mpy Div Wtd HM # short: 17934.1 18000.0 3483.9 2936.2 6400.4 # long: 20400.0 19695.6 3225.8 2042.6 5528.7 # float: 4258.1 4252.0 3829.8 1515.8 3276.1 # double: 4129.0 4087.9 3260.9 1345.8 2992.5 # # Test and branch timing: # integer compare and branch 0.247 uSec, 4054.0K/sec # float compare and branch 0.850 uSec, 1176.5K/sec This would indicate that the 486 didn't get much help on divide, but add and subtract, as well as compare and branch, got a big boost. My overall results for a bunch of programs was that the 486 was about 2.6x faster than the 386 at the same speed. Note: these figures are presented as ballpark figures, and represent measured performance obtained using C rather than assembler. While they are proportional to actual hardware performance, they are not best case performance. On the other hand I started building this benchmark suite in 1970... it measures performance of individual performance aspects, looking for those "jackpot cases" where performance is really bad. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"