baum@Apple.COM (Allen J. Baum) (10/18/88)
[] >In article <156@gloom.UUCP> cory@gloom.UUCP (Cory Kempf) writes: >A while back, I was really hot on the idea of RISC. Then a friend >pointed out a few things that set me straight... > >First, there is no good reason that all of the cache and pipeline >enhancements cannot be put on to a CISC processor. Some are trying to do it, but they can't do it as well as a RISC processor because they don't have room on the chip, because they're busy supporting instructions and addressing modes that hardly ever get used. >Second, to perform a complex task, a RISC chip will need more >instructions than a CISC chip. First of all, not necessarily true. Second of all, so what? The number of instructions used by RISCS has been greatly exaggerated. First generation RISCS may have a code expansion of 50% in bytes. However, this is a static measurement. It doesn't mean that the inner loops will use twice as many cycles. And, since the datapath of a RISC machine is simpler, its cycles are likely faster. >Third, given the same level of technology for each (ie caches, pipelines, >etc), a microcode fetch is faster than a memory fetch. Perhaps, but the microcode is only good for one thing: interpreting macro- code. A the code in a cache is good for whatever I happen to be executing at the time. >As an aside, the 68030 can do a 32 bit multiply in about (If I remember >correctly -- I don't have the book in front of me) 40 cycles. A while >back, I tried to write a 32 bit multiply macro that would take less >than the 40 or so that the '030 took. I didn't even come close (even >assuming lots of registers and a 32 bit word size (which the 6502 >doesn't have)). Check out the ASPLOS II proceedings. There is an article the by Dan Magenheimer of HP explaining how the do 11 cycle average multiplies, without a multiplier, and without a multiply step instruction. Besides, if you have a RISC machine, and multiply is truly important, the you have room for multiply support, up to having a full multiplier. You may find, howver, that it won't make any difference in your performance because no one needs an integer multiplier very often. Like a lot of things that RISC designers have left out. -- {decwrl,hplabs}!nsc!baum@apple.com (408)973-3385
snoopy@sopwith.UUCP (Snoopy T. Beagle) (10/31/88)
In article <18931@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: | You may find, howver, that it won't make any difference in your | performance because no one needs an integer multiplier very | often. Like a lot of things that RISC designers have left out. I humbly disagree. Just because *you* never use an integer multiply does not imply that noone else ever does. _____ /_____\ Snoopy /_______\ |___| tektronix!tekecs!sopwith!snoopy |___| sun!nosun!illian!sopwith!snoopy
cik@l.cc.purdue.edu (Herman Rubin) (10/31/88)
In article <40@sopwith.UUCP>, snoopy@sopwith.UUCP (Snoopy T. Beagle) writes: > In article <18931@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: > | You may find, howver, that it won't make any difference in your > | performance because no one needs an integer multiplier very > | often. Like a lot of things that RISC designers have left out. > > I humbly disagree. Just because *you* never use an integer multiply > does not imply that noone else ever does. > There are many other operations which are cheap in hardware and expensive in software. Maybe we should have a "competition" to see how many operations we can come up with for which silicon can do a quick, efficient, accurate job, but which are clumsy and expensive in software. I will toss in a few for starters. Find the distance to the next one in a bit stream FAST. It would be good to have an exception handler if one is not found. I am considering algorithms not worth implementing if the operation is slow. Divide an integer by an integer, obtaining a quotient and a remainder, the choice of remainder depending on the signs of the dividend and the divisor. Divide a floating point number by a floating point number, obtaining an integer quotient and a floating point remainder. It would be nice to have the choice of remainder depending on the signs here also. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
david@ms.uky.edu (David Herron -- One of the vertebrae) (10/31/88)
Yea really ... and compilers use integer multiplies all the time when referencing into arrays... I for one would like to see floating point tossed out, since I never use it (and I happen to dislike my numerical analysis class :-) -- <-- David Herron; an MMDF guy <david@ms.uky.edu> <-- ska: David le casse\*' {rutgers,uunet}!ukma!david, david@UKMA.BITNET <-- <-- Controlled anarchy -- the essence of the net.
david@sun.uucp (David DiGiacomo) (11/01/88)
In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >There are many other operations which are cheap in hardware and expensive in >software. ... I will toss in a few for starters. I think you're going to have to find some better examples... > Find the distance to the next one in a bit stream FAST. It would >be good to have an exception handler if one is not found. I am considering >algorithms not worth implementing if the operation is slow. This is very cheap in software (runs at close to memory bandwidth). Find the first nonzero word, then the first nonzero byte in that word, then do a table lookup. If it really has to be FAST, use a 16 bit table. > Divide an integer by an integer, obtaining a quotient and a remainder, >the choice of remainder depending on the signs of the dividend and the divisor. > > Divide a floating point number by a floating point number, obtaining >an integer quotient and a floating point remainder. It would be nice to have >the choice of remainder depending on the signs here also. These operations are *extremely* expensive in hardware. -- David DiGiacomo, Sun Microsystems, Mt. View, CA sun!david david@sun.com
baum@Apple.COM (Allen J. Baum) (11/01/88)
[] >In article <40@sopwith.UUCP> snoopy@sopwith.UUCP (Snoopy T. Beagle) writes: >In article <18931@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: >| You may find, howver, that it won't make any difference in your >| performance because no one needs an integer multiplier very >| often. Like a lot of things that RISC designers have left out. > >I humbly disagree. Just because *you* never use an integer multiply >does not imply that noone else ever does. Oh, please! I didn't say noone. I didn't say ever. Of course there are applications that are integer multiplication intensive (as opposed to floating point). What I did say is that they are quite rare. Integer floating point intensive is defined (here and now, by me) to be an application that will suffer a performance degradation of more than 3% without a fast hardware multiplier (2-3 cycles, vs. the average 11 cycles that HP can do in pure software. (A back of the envelope calculation will show that means .3%- pretty high for multiply) Most integer multiplies that I am aware of are used for index scaling and other address calculations. Good optimizing compilers will strength reduce these away -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
kyriazis@rpics (George Kyriazis) (11/01/88)
In article <10471@s.ms.uky.edu> you write: >Yea really ... and compilers use integer multiplies all the >time when referencing into arrays... > >I for one would like to see floating point tossed out, since I >never use it (and I happen to dislike my numerical analysis class :-) >-- Floating point is getting faster and faster every day. One example that I know of is the AT&T DSP32 Digital Signal Processor. It needs 1 CPU cycle to do a floating point multiplication and accumulation. An interesting side-effect is in the calculation of array referencing. It expands the array index to FP, does the array indexing, and then truncates the resulting address to an integer. (The compiler I am talking about is the one that is the one that AT&T gives for their Pixel Machine). -- George Kyriazis kyriazis@turing.cs.rpi.edu ------------------------------ George Kyriazis kyriazis@turing.cs.rpi.edu ------------------------------
ok@quintus.uucp (Richard A. O'Keefe) (11/01/88)
In article <10471@s.ms.uky.edu> david@ms.uky.edu (David Herron -- One of the vertebrae) writes: >I for one would like to see floating point tossed out, since I >never use it (and I happen to dislike my numerical analysis class :-) But it _has_ been tossed out of many designs: the IBM RT PC (at any rate the old ones) had no floating-point instructions, but used a National chip (32081?). The AMD 29000 manual says that floating-point ops are currently extracodes emulated by a trap handler. And of course there are all the 680x0, 320xx, 80*86 and so on, with floating-point done by a co-processor.
cik@l.cc.purdue.edu (Herman Rubin) (11/01/88)
In article <19762@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes: > [] > >In article <40@sopwith.UUCP> snoopy@sopwith.UUCP (Snoopy T. Beagle) writes: > >In article <18931@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: > >| You may find, howver, that it won't make any difference in your > >| performance because no one needs an integer multiplier very > >| often. Like a lot of things that RISC designers have left out. > > > >I humbly disagree. Just because *you* never use an integer multiply > >does not imply that noone else ever does. > > Oh, please! > I didn't say noone. > I didn't say ever. > Of course there are applications that are integer multiplication > intensive (as opposed to floating point). > What I did say is that they are quite rare. They are rare because a good programmer knows that they are slow and difficult to program. If an operation takes 10 lines of code, each of which is expanded into one or more hardware instructions, why use it unless you cannot find a way around it? The way around it may be clumsy, but it is very likely to beat the unavailable operation. > Integer floating point intensive is defined (here and now, by me) to > be an application that will suffer a performance degradation of more > than 3% without a fast hardware multiplier (2-3 cycles, vs. the > average 11 cycles that HP can do in pure software. (A back of the > envelope calculation will show that means .3%- pretty high for > multiply) Most integer multiplies that I am aware of are used for > index scaling and other address calculations. Good optimizing > compilers will strength reduce these away If the double-precision product of two single-precision integers is required, and only single-precision products are available, it is necessary to go to single-precision products of half-precision numbers. This takes about 20 instructions. How does the poster expect to do it in an average of 11 cycles? Many of these jobs are not being done, or are being kludged by finding ways to accomplish more-or-less the same results in 10 instructions. And if a subroutine call is made, double the time. Many mathematical computations should be made in fixed-point arithmetic. If one does not have the hardware available, the cost is much greater than floating point. If the hardware is available, it is much cheaper. None of the major languages support fixed point. So none of the hardware gurus put it in, so none of the machines have it, so no one programs in it, so the inclusion of it is objected to as a waste of resources, etc. Another hardware operation missing on most machines is square root. So one does not use algorithms requiring square roots. Again, the circular argument. An application using accurate arithmetic heavily will be spending most of its time in multiple-precision subroutines, even with good hardware. A time penalty of a factor of 10 here is obviously costly. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
guy@auspex.UUCP (Guy Harris) (11/02/88)
>Yea really ... and compilers use integer multiplies all the >time when referencing into arrays... Integer multiplies by *constants*, most of the time; you can often do those better with shifts and adds than you can with the machine's multiply instruction (and no, I'm not talking about the 68010, which doesn't have a 32x32 multiply instruction).
baum@Apple.COM (Allen J. Baum) (11/02/88)
[] >In article <1002@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: (in response to me about integer multiplies) >They are rare because a good programmer knows that they are slow and >difficult to program. If an operation takes 10 lines of code, each of >which is expanded into one or more hardware instructions, why use it >unless you cannot find a way around it? The way around it may be clumsy, >but it is very likely to beat the unavailable operation. I'm going further than that. I'm saying they are rare because the are unnecessary. They are rare because in the USUAL case they can be strength reduced to additions by an optimizing compiler. This is faster than using the obvious multiply instruction. Integer numeric intensive applications are relatively RARE. You (Herman Rubin personally) may do them all the time, in which case you don't think they are rare, but as a precentage of programs run by all computers, they are. >If the double-precision product of two single-precision integers is required, >and only single-precision products are available, it is necessary to go to >single-precision products of half-precision numbers. This takes about 20 >instructions. How does the poster expect to do it in an average of 11 cycles? >Many of these jobs are not being done, or are being kludged by finding ways to >accomplish more-or-less the same results in 10 instructions. And if a >subroutine call is made, double the time. First of all, the "if double-precision is required" is a BIG if. It isn't, usually (i.e. rare-- see above definition of rare). Secondly, many multipliers do not support single x single -> double. Thirdly, most languages don't support it either. And fourthly (sp?)- the fact that the operation yields a double precision result doesn't mean that all bits are significant. When they aren't (see rare, above), software can shortcut the operation, making it end faster, ON THE AVERAGE. Fifthly- the requirement for double precision INTEGER precision is rare (see rare, above...) >Many mathematical computations should be made in fixed-point arithmetic. If >one does not have the hardware available, the cost is much greater than >floating point. If the hardware is available, it is much cheaper. None >of the major languages support fixed point. So none of the hardware gurus >put it in, so none of the machines have it, so no one programs in it, so >the inclusion of it is objected to as a waste of resources, etc. > >Another hardware operation missing on most machines is square root. So one >does not use algorithms requiring square roots. Again, the circular >argument. > >An application using accurate arithmetic heavily will be spending most of its >time in multiple-precision subroutines, even with good hardware. A time >penalty of a factor of 10 here is obviously costly. I agree that language and hardware don't support multiple precision fixed point arithmetic. The circular argument is valid, but I maintain that it starts because, as a percentage of all programs any particular family of processors will run, these kinds of problems are exceedingly rare. Square root is the same category as divide. Hardware is slow, so algorithms tend to avoid them. The reason is fundamental. The hardware is slow, and it is exceedingly difficult to make it faster. Strangly enough, floating point divide can be made to run much faster, because of its normalized operands. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
PLS@cup.portal.com (Paul L Schauble) (11/02/88)
> Find the distance to the next 1 in a bit stream FAST
Scan for the first non-zero word. Load word and do a floating normalize. Look
at the exponent.
++PLS
seanf@sco.COM (Sean Fagan) (11/02/88)
In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >Maybe we should have a "competition" to see how many operations >we can come up with for which silicon can do a quick, efficient, accurate job, >but which are clumsy and expensive in software. > Find the distance to the next one in a bit stream FAST. It would >be good to have an exception handler if one is not found. I am considering >algorithms not worth implementing if the operation is slow. > Divide an integer by an integer, obtaining a quotient and a remainder, >the choice of remainder depending on the signs of the dividend and the divisor. > Divide a floating point number by a floating point number, obtaining >an integer quotient and a floating point remainder. It would be nice to have >the choice of remainder depending on the signs here also. Quick background: Cyber 170's have 24 regsiters: A, B, and X, eight of each (0-7). A and B registers are 18 bits wide, and are, generally, used for address variables. X registers are 60-bits wide, and are usually used for computation. To load a value into X<n>, you load the address into A<n>, where n=[1,5]. To store a value from X<n>, you load the address into A<n>, where n=[6,7]. A0 has no special values, and B0 is a hardwired 0. Ok, a few from the Wonderful World of the Cyber: Count the set bits in a word (register or memory, it doesn't matter). Very useful for some trivial applications (such as playing Othello), but I haven't seen much else done with it. It was put in, it looks like, because the hardware was already there (in the form of parity checking), but I could be wrong. Given the, um, interesting setup described above, write a routine to save all of the registers into memory (hint: use shifts). It ain't easy, and I don't remember how to do it, so don't mail me 8-). It is, however, trivial to do in hardware. Who else has some fun examples? >Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 -- Sean Eric Fagan | "Engineering without management is *ART*" seanf@sco.UUCP | Jeff Johnson (jeffj@sco) (408) 458-1422 | Any opinions expressed are my own, not my employers'.
khb%chiba@Sun.COM (Keith Bierman - Sun Tactical Engineering) (11/02/88)
In article <1002@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >> Of course there are applications that are integer multiplication >> intensive (as opposed to floating point). >> What I did say is that they are quite rare. > >They are rare because a good programmer knows that they are slow and >difficult to program. Integer multiplication hard to program ? Slow ? Is this really what is meant ? > >> Integer floating point intensive is defined (here and now, by me) to >> be an application that will suffer a performance degradation of more >> than 3% without a fast hardware multiplier (2-3 cycles, vs. the >> average 11 cycles that HP can do in pure software. (A back of the >> envelope calculation will show that means .3%- pretty high for >> multiply) Most integer multiplies that I am aware of are used for >> index scaling and other address calculations. Good optimizing >> compilers will strength reduce these away > >If the double-precision product of two single-precision integers is required, >and only single-precision products are available, it is necessary to go to >single-precision products of half-precision numbers. This takes about 20 >instructions. How does the poster expect to do it in an average of 11 cycles? >Many of these jobs are not being done, or are being kludged by finding ways to >accomplish more-or-less the same results in 10 instructions. And if a >subroutine call is made, double the time. I belive the poster is refering to numerious HP publications (open lit and manuals). Their algorithm is quite clever and makes use of the fact that certain multipliers are much more common than others. Special instructions in Spectrum are employed in conjuction with delayed branches to perform multiply in a max of 11 cycles (and often one wins and it is less). I do not think that a discussion of DP was meant. > >Many mathematical computations should be made in fixed-point arithmetic. NO! Having witnessed far too much of this in DoD embedded computers. The fixed point math saved some hardware; but the software was awful. Because of the handsprings fixedpoint math required, it was not possble to focus on the real big issues (like is this algorithm numerically stable ? Can we cut the compute cost by a factor of 10 by altering the problem ?). Fixed point math is sometimes appropriate, but if anything too much is done in fixed point. If >one does not have the hardware available, the cost is much greater than >floating point. If the hardware is available, it is much cheaper. None >of the major languages support fixed point. So none of the hardware gurus >put it in, so none of the machines have it, so no one programs in it, so >the inclusion of it is objected to as a waste of resources, etc. Check out DSP chips. Check out generation after generation of military chips. fixed point is very common in some envirnoments. PL/1 PL/"x" (dialects) and misc special mil languages support it. People code in it, and it is usually a bad design choice. > >Another hardware operation missing on most machines is square root. So one >does not use algorithms requiring square roots. Well, I came from the world of kalman filtering where the best algorithms tend to use square-roots (though sometimes these can be avoided). sqrt improves numerical reliability of many algorithms (when employed correctly) which more than makes up for its speed (if you get the right answer in fewer iterations, the extra cost doesn't necessarily matter. I have not done a head count, but many machines have sqrt (8087, 6888x, vax fpa, ibm fpa, univac). > >An application using accurate arithmetic heavily will be spending most of its >time in multiple-precision subroutines, Not if good algorithms are employed. For example, in the early '70's JPL'ers G. Bierman and K. Thornton proved that UD and SRIF mechanized kalman filters could be run in SP (36-bits) rather than the DP (72 bits) required by competing algorithms to acheive superior results. Better algorithm, fewer bits, better results. Better math logic (like ieee machines with extended accumulators) won't spend any time in extended precision routines. vax fpa and IBM fpas also do their math in 64-bits, so that the penalty for extended precision is the cost of moving more bits to memory (and more paging, etc.) Keith H. Bierman It's Not My Fault ---- I Voted for Bill & Opus
firth@sei.cmu.edu (Robert Firth) (11/02/88)
In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >There are many other operations which are cheap in hardware and expensive in >software...I will toss in a few for starters. > Find the distance to the next one in a bit stream FAST. It would >be good to have an exception handler if one is not found. I am considering >algorithms not worth implementing if the operation is slow. This is implemented in hardware on the MC68020 by the 'BFFFO' instruction. It takes 18 cycles to find the first bit in a 32-bit register, and upto 32 cycles to find the first bit in a 32-bit bitfield in memory. I find it hard to believe that a software algorithm on a RISC machine would do worse. In the register case, for instance, a simple binary chop takes at most 5 steps, and each step is just (and-immediate-with-mask; branch if result zero). One can get more cunning, but that gives you the answer in about 12 cycles.
tim@crackle.amd.com (Tim Olson) (11/03/88)
In article <7575@aw.sei.cmu.edu> firth@bd.sei.cmu.edu (Robert Firth) writes: | In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: | >There are many other operations which are cheap in hardware and expensive in | >software...I will toss in a few for starters. | | > Find the distance to the next one in a bit stream FAST. It would | >be good to have an exception handler if one is not found. I am considering | >algorithms not worth implementing if the operation is slow. | | This is implemented in hardware on the MC68020 by the 'BFFFO' instruction. | It takes 18 cycles to find the first bit in a 32-bit register, and upto | 32 cycles to find the first bit in a 32-bit bitfield in memory. "hardware"? Well, microcode anyway. If it were implemented in hardware, it can be done in a single cycle, as it is on the Am29000 with the CLZ (Count Leading Zeros) instruction. -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)
phil@diablo.amd.com (Phil Ngai) (11/03/88)
In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >There are many other operations which are cheap in hardware and expensive in >software...I will toss in a few for starters. > Find the distance to the next one in a bit stream FAST. It would The Am29000 has a CLZ instruction which runs in one clock cycle(33 ns): Operation: Determine number of leading zeros in a (32-bit) word. Description: A count of the number of zero-bits to the first one-bit in the SRCB operand is placed into the DEST location. If the MSB of the SRCB operand is 1, the resulting count is zero. If the SRCB operand is zero, the resulting count is 32. "In the West, to waste water is not to consume it, to let it flow unimpeded and undiverted down rivers. Use of water is, by definition, beneficial use." (from _Cadillac Desert_) Phil Ngai, {ucbvax,decwrl,allegra}!amdcad!phil or phil@amd.com
ok@quintus.uucp (Richard A. O'Keefe) (11/03/88)
In article <19811@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: [Talking about integer multiplication and division.] >I'm going further than that. I'm saying they are rare because the are >unnecessary. They are rare because in the USUAL case they can be strength >reduced to additions by an optimizing compiler. This is faster than using >the obvious multiply instruction. Did you notice the implicit assumption that multiplications are only for address calculations? Avoidable multiplications are rare because a generation of programmers has been brainwashed that Hardware Rules, and if some potentially useful operation is expensive it is their job to avoid it rather than have the hardware and compiler people get it right. People are still avoiding procedure calls (and RISC designers are assuming that procedure calls are not deeply nested) because old designs made procedure calls expensive. The one which is really painful is division. When one codes up a hash table, one knows (having read the literature) that remainder with a prime is a Good Thing. But one also knows that whizzbang machine X has no hardware support for division, so to avoid a subroutine call to a routine not known for its speed one sighs, puts in X & 4095 (instead of X1 % 4097 or whatever), and wishes... But to be realistic about this, let's compare a couple of CISCs with what a good RISC might do. The issue is not absolute speed, but the intensity of the temptation to distort your code to avoid a function perceived as expensive. I measure this as cost/(cost of ADD). MC68020 80386 generic R2000 WBMX 88k MULS.L/ADD.L ~ 20 ~ 5-10 ~18 14 ~44 4 DIVS.L/ADD.L ~ 45 ~ 20 ~35 35 ~150 39+possible trap MC68020 figures from manufacturer's manual 80386 figures from manufacturer's manual generic figures assume 2-bit-at-a-time multiply step, 1-b.a.a.t. divide step WBMX multiply from Whizzbang Ltd's manual, divide _estimated_ from manual; figures include procedure call overhead. WBMX has no divide step. 88k figures from article <4759@pdn.UUCP> (Alan Lovejoy) R2000 figures from article <7472@winchester.mips.COM> (Charlie Price) (1 for main op, + delay time 12 or 33, + 1 to pick up result) The 80386 and 88k multiplies deliver 32 bits, the others 64. The R2000 figures are worst case: other integer operations can be overlapped with all but two of these cycles. It would be intersting to know how often this pays off. The bottom line is that architectures should _support_ the operations programmers find useful, but that some architects have shown that good enough support can be had by doing part of an operation in hardware, part in software. Too bad about Whizzbang.
yuval@taux02.UUCP (Gideon Yuval) (11/03/88)
If you want to find the first (bottom) "1" bit in a machine word X, try: table [ (X xor (X-1)) mod 37 ] Proof is left as an exercise for the reader, as is contruction of the array "table". Thus, any machine that can implement "C" can do "find first bit" with reasonable efficiency. N.B. On a 16-bit CPU, replace 37 by 19. (I replaced "^" by "xor", and "%" by "mod", to avoid illegibility). -- Gideon Yuval, yuval@taux01.nsc.com, +972-2-690992 (home) ,-52-522255(work) Paper-mail: National Semiconductor, 6 Maskit St., Herzliyah, Israel TWX: 33691, fax: +972-52-558322
csimmons@hqpyr1.oracle.UUCP (Charles Simmons) (11/03/88)
In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >There are many other operations which are cheap in hardware and expensive in >software. Maybe we should have a "competition" to see how many operations >we can come up with for which silicon can do a quick, efficient, accurate job, >but which are clumsy and expensive in software. > >I will toss in a few for starters. > > Find the distance to the next one in a bit stream FAST. It would >be good to have an exception handler if one is not found. I am considering >algorithms not worth implementing if the operation is slow. I once had a program that needed the above algorithm (I think we ended up using a floating point normalize, which was still fairly slow). In the same program, one of the algorithms that we really wanted was Find all legal moves for an Othello board. If you have 1 72-bit register, this can be done in roughly 100 instructions. With dedicated silicon, I think you can implement this with something like 1024 gates. Depending on the exact implementation of the silicon, using the dedicated silicon would probably take about 10 instructions. If that instruction is too high level, there are interesting things one can do if a few instructions existed for permuting the bits in a register. For example: pretend that the register holds an 8x8 board. Now, rotate the board 90 degrees. Or reflect the board about an axis. Or, permute the board so that the diagonals through the board are laid end to end... -- Chuck
cik@l.cc.purdue.edu (Herman Rubin) (11/03/88)
In article <7575@aw.sei.cmu.edu>, firth@sei.cmu.edu (Robert Firth) writes: > In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > >There are many other operations which are cheap in hardware and expensive in > >software...I will toss in a few for starters. > > > Find the distance to the next one in a bit stream FAST. It would > >be good to have an exception handler if one is not found. I am considering > >algorithms not worth implementing if the operation is slow. > > This is implemented in hardware on the MC68020 by the 'BFFFO' instruction. > It takes 18 cycles to find the first bit in a 32-bit register, and upto > 32 cycles to find the first bit in a 32-bit bitfield in memory. > > I find it hard to believe that a software algorithm on a RISC machine > would do worse. In the register case, for instance, a simple binary > chop takes at most 5 steps, and each step is just (and-immediate-with-mask; > branch if result zero). One can get more cunning, but that gives you > the answer in about 12 cycles. For the purposes I had in mind, this is painful. I am thinking of what I call "infinite precision" methods of generating random variables (A technical report is available on request.); these procedures have no errors due to accuracy of computer operations. For generating exponential random variables, there are several methods which use ~5 BITS on the average, and then only use Boolean operations to generate the final result. Finding the next 1 counts as 2 bits. However, for such a procedure to be practical, it must compete with procedures which use about 10 operations, for one the slowest being two table lookups and a floating-point test. These competing procedures are obviously more complex computationally. However, the architecture makes them much cheaper. Also, since random numbers of bits are used, one has to worry about where the current bit pointer is and that the current word may become exhausted. It is impossible to modify any good algorithm for generating non-uniform random numbers to avoid uncertainties of this nature. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
henry@utzoo.uucp (Henry Spencer) (11/04/88)
In article <611@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: >But it _has_ been tossed out of many designs: the IBM RT PC (at any >rate the old ones) had no floating-point instructions, but used a >National chip (32081?). I.e., it had floating-point hardware. Why do you say that floating point was "tossed out" of its design?? >The AMD 29000 manual says that floating-point >ops are currently extracodes emulated by a trap handler... Which may well use the 29xyz (don't remember the number offhand) chip to do all the real work. Again, it hasn't been tossed out, just moved around a bit. >And of course >there are all the 680x0, 320xx, 80*86 and so on, with floating-point >done by a co-processor. So? On most of those chips, the software can't tell the difference. Does the FP hardware have to be on the CPU chip to be "real" somehow? Why? -- The Earth is our mother. | Henry Spencer at U of Toronto Zoology Our nine months are up. |uunet!attcan!utzoo!henry henry@zoo.toronto.edu
ok@quintus.uucp (Richard A. O'Keefe) (11/04/88)
In article <232@taux02.UUCP> yuval@taux02.UUCP (Gideon Yuval) writes: [That finding the first bit in a machine word requires NO special hardware -- his emphasis. His formula, for a 32-bit machine, is] > table[(X xor (X-1)) mod 37] But this *DOES* require a fast modulo-37 operator!
greg@sce.carleton.ca (Greg Franks) (11/05/88)
In article <19811@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: #>In article <1002@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: #(in response to me about integer multiplies) #>They are rare because a good programmer knows that they are slow and #>difficult to program. If an operation takes 10 lines of code, each of #>which is expanded into one or more hardware instructions, why use it #>unless you cannot find a way around it? The way around it may be clumsy, #>but it is very likely to beat the unavailable operation. # #I'm going further than that. I'm saying they are rare because the are #unnecessary. They are rare because in the USUAL case they can be strength #reduced to additions by an optimizing compiler. This is faster than using #the obvious multiply instruction. # #Integer numeric intensive applications are relatively RARE. You #(Herman Rubin personally) may do them all the time, in which case you #don't think they are rare, but as a precentage of programs run by all #computers, they are. # Integer multiplies are VERY POPULAR, if you look in the right place. TI (TMS320) and Motorola (The NeXT machine) and many others are making lots of money selling digital signal processing chips. These chips perform FFT's and all sorts of other operations which perform multiple MULTIPLY and ADD very quickly. However, many chip makers are now going to floating point in their DSP's, so the lowly integer multiply may once again become somewhat useless. -- Greg Franks Systems and Computer Engineering, utzoo!dciem!nrcaer!sce!greg Carleton University, Ottawa, Ontario, Canada. greg@sce.carleton.ca ACME Electric: When you can't find your shorts, call us!
cik@l.cc.purdue.edu (Herman Rubin) (11/05/88)
In article <232@taux02.UUCP>, yuval@taux02.UUCP (Gideon Yuval) writes: > > If you want to find the first (bottom) "1" bit in a machine word X, try: > > table [ (X xor (X-1)) mod 37 ] The question I asked is the spacing between BITS in a BIT stream. It is possible that the best way on a particular machine is to use words, but not here. What is wanted is to get the distance to the next bit in the stream on demand (this can be vectorized) and then pass over to the next location for the next step. Once a particular bit in the stream is read, it cannot be used for anything else. A similar problem is a conditional transfer on the next bit. The whole process is bit-oriented, and some extraordinary procedure must be used to use larger units. There is also the problem of the bit stream becoming empty and needing refilling. In some cases, this problem can be minimized, even to the point of some cases being able to ignore it. But it cannot be eliminated. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
henry@utzoo.uucp (Henry Spencer) (11/05/88)
In article <75772@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes: >... I have not done a head count, but many machines >have sqrt (8087, 6888x, vax fpa, ibm fpa, univac). In fact, if I recall correctly, you cannot have an IEEE-conforming floating-point implementation without it. -- The Earth is our mother. | Henry Spencer at U of Toronto Zoology Our nine months are up. |uunet!attcan!utzoo!henry henry@zoo.toronto.edu
ok@quintus.uucp (Richard A. O'Keefe) (11/05/88)
In article <1988Nov3.185535.28850@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >In article <611@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: >>But it _has_ been tossed out of many designs: the IBM RT PC (at any >>rate the old ones) had no floating-point instructions, but used a >>National chip (32081?). >I.e., it had floating-point hardware. Why do you say that floating >point was "tossed out" of its design?? I was responding to someone else who said that floating point ops _ought_ to be tossed out, and by his phrasing appeared to be implying that this was a prospect for improvement which RISC designers had so far ignored. "tossed out" was _his_ phrase, not mine. As for whether the RT PC "had floating-point hardware", the floating- point chip was not accessed as a co-processor, but as a device, and fast it was not. By that criterion, an 8008 talking down an RS232 line to a /370 to get floating-point done could be regarded as "having floating- point hardware". The RT's instruction set has _no_ floating-point ops, not even co-processor ones. >>The AMD 29000 manual says that floating-point >>ops are currently extracodes emulated by a trap handler... >Which may well use the 29xyz (don't remember the number offhand) chip >to do all the real work. Again, it hasn't been tossed out, just moved >around a bit. Again, your reply misses the point of my message. I was responding to someone who claimed that RISC designers _ought_ to be throwing floating- point calculation out of the CPU chip, and I was merely pointing out that this had already been done. (My copy of the AMD29000 manual does not mention a 29xyz, but says FP is currently done in software. If you do recall the part number & what manual to ask for, please let me know.) >So? On most of those chips, the software can't tell the difference. >Does the FP hardware have to be on the CPU chip to be "real" somehow? >Why? This is a gross distortion of my position. You will search my former posting in vain for any suggestion that offboarding floating point is _bad_, only that it has already been _done_.
cjosta@taux01.UUCP (Jonathan Sweedler) (11/06/88)
In article <1988Nov4.194026.22806@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >In article <75772@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes: >>... I have not done a head count, but many machines >>have sqrt (8087, 6888x, vax fpa, ibm fpa, univac). > >In fact, if I recall correctly, you cannot have an IEEE-conforming >floating-point implementation without it. Yes, but IEEE conformity may be accomplished in hardware or software. A *system* conforms to the IEEE standard, not necessarily the processor. -- Jonathan Sweedler === National Semiconductor Israel UUCP: ...!{amdahl,hplabs,decwrl}!nsc!taux01!cjosta Domain: cjosta@taux01.nsc.com
mac3n@babbage.acc.virginia.edu (Alex Colvin) (11/09/88)
In article <475@oracle.UUCP>, csimmons@hqpyr1.oracle.UUCP (Charles Simmons) writes: > In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > > Find the distance to the next one in a bit stream FAST. > ended up using a floating point normalize, which was still fairly slow). In 1972 in "Structured Programming", C.A.R. Hoare suggests using (floating) standardisation to locate elements in a powerset representation. This operation seems to have been lost in Pascal. Charles Simmons may remember the GE635's GTB (Gray to Binary) instruction, which executes a long chain of XORs, used to convert radar data to binary. We had real trouble coming up with another use for this. Maybe Hypercube routing?
randys@omews2.intel.com (Randy Steck) (11/11/88)
In article <10774@cup.portal.com> PLS@cup.portal.com (Paul L Schauble) writes: >> Find the distance to the next 1 in a bit stream FAST > >Scan for the first non-zero word. Load word and do a floating normalize. Look >at the exponent. > > ++PLS I'm afraid that it isn't quite this simple. Leaving aside the complexities of what FP format you use to load it with, you first of all have to determine what exponent you are going to use. Zero exponents are a special case. You have to preset the exponent, shift and mask the bit stream, do the normalization, shift and mask to get the exponent value out, and unbias it from the original input. Hmmmmmm. Sounds messy to me. However, this is the right idea. On the 80960, we used the floating point normalizer, without all the checks on the values, to implement the scan_bit operation (finds the first set bit in a word). The thing that saved us here was independent access to the exponent values and fast transfer and operation on bit values. Randy Steck Intel Corp.