JBS@IBM.COM (02/21/91)
Herman Rubin says: Do you mean to tell me that computations involving both integers and floating-point numbers are not important? Or that dividing floats by floats, obtaining an integer quotient and a floating remainder, likewise? That particular step is the first step of any trigonometric or exponential function computation when it is not known in advance that the argument is small. There are other periodic and related functions for which this is useful. It would also speed up interpolation, etc. Since argument reduction for the trigonometric functions is done in practice by multiplying by 1/pi I do not see how it would ben- efit from your proposed instruction. I believe this instruction would have little general utility. An efficient way to do the fortran "x=dint(y)" is important. This is lacking on the Risc System 6000 as was pointed out in a Los Alamos report. By the way is it true that it is impossible to reasonably express this in ADA? Peter Montgomery says: Yes, most programs are written in these languages. As Dan says, the language designers made mistakes. During one review period for Fortran 90, I requested an operation which takes four nonnegative integers a, b, c, n with a < n or b < n (c is optional and defaults to 0). The requested operation returns q and/or r, where a*b + c = q*n + r and 0 <= r < n Why didn't you ask for an integer*8 data type? Wouldn't that give you what you want and have much more general utility? I would like an instruction which counts the number of ones in the binary representation of an integer but I find it hard to argue that this would be widely used. James B. Shearer
dik@cwi.nl (Dik T. Winter) (02/21/91)
In article <9102210042.AA12291@ucbvax.Berkeley.EDU> JBS@IBM.COM writes: >> Herman Rubin says: >> Or that dividing floats by >> floats, obtaining an integer quotient and a floating remainder, likewise? >> That particular step is the first step of any trigonometric or exponential >> function computation when it is not known in advance that the argument is >> small. > Since argument reduction for the trigonometric functions is > done in practice by multiplying by 1/pi I do not see how it would ben- > efit from your proposed instruction. You must have a very limited experience. You do it only by multiplying by 1/pi if you are willing to forego a lot of precision. (Check your favorite hand-held calculator. If sin(pi) = 0, the implementation is wrong!) >> Peter Montgomery says: >> I requested an operation which >> takes four nonnegative integers a, b, c, n with >> a < n or b < n (c is optional and defaults to 0). >> The requested operation returns q and/or r, where >> a*b + c = q*n + r and 0 <= r < n > Why didn't you ask for an integer*8 data type? Wouldn't > that give you what you want and have much more general utility? No. That would require that the hardware has some support for 64 bit ints. What Peter Montgomery asks is: given three ints (a, b, c), multiply a and b together to a double size int; add c to it; calculate quotient and remainder when dividing by n. Most 32 bit processors provide the instructions to do just that. Asking for an integer*8 data type would be asking for a 64*64 multiply which would than be used, and would be much slower. > I would like an instruction which counts the number of ones in > the binary representation of an integer but I find it hard to argue > that this would be widely used. Ask Seymour Cray why that instruction was put in the Cray-1 as an afterthought. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
hrubin@pop.stat.purdue.edu (Herman Rubin) (02/21/91)
In article <9102210042.AA12291@ucbvax.Berkeley.EDU>, JBS@IBM.COM writes: > > Herman Rubin says: > Do you mean to tell me that computations involving both integers and > floating-point numbers are not important? Or that dividing floats by > floats, obtaining an integer quotient and a floating remainder, likewise? > That particular step is the first step of any trigonometric or exponential > function computation when it is not known in advance that the argument is > small. There are other periodic and related functions for which this is > useful. It would also speed up interpolation, etc. > > Since argument reduction for the trigonometric functions is > done in practice by multiplying by 1/pi I do not see how it would ben- > efit from your proposed instruction. I believe this instruction would > have little general utility. There is somewhat greater loss of accuracy in this, and it is still needed to extract the integer part to an integer register and the fractional part. Thus, it needs at least three operations, instead of one. Also, if one is calculating something like x - ln(1+x), a natural operation in certain problems, the computing problems become a little larger than one would expect. In fact, to avoid a loss of accuracy in some quite usual situations, it would even be a good idea to have Boolean operations on floats, and unnormalized floating arithmetic. Floating arithmetic, apart from some preliminaries, and normalization problems, is exactly the same as integer, so why should there be a separate arithmetic unit even? > An efficient way to do the fortran "x=dint(y)" is important. > This is lacking on the Risc System 6000 as was pointed out in a Los > Alamos report. By the way is it true that it is impossible to > reasonably express this in ADA? > Peter Montgomery says: > Yes, most programs are written in these languages. > As Dan says, the language designers made mistakes. During one > review period for Fortran 90, I requested an operation which > takes four nonnegative integers a, b, c, n with > a < n or b < n (c is optional and defaults to 0). > The requested operation returns q and/or r, where > > a*b + c = q*n + r and 0 <= r < n > > Why didn't you ask for an integer*8 data type? Wouldn't > that give you what you want and have much more general utility? Yes and no. It would have much more general utility, but it would do an abysmally inefficient job in this situation. You would need to have a way of indicating that the product of two integers*4 is an integer*8, which I do not know how to do in any language with which I am familiar without writing it as a function, and I do not think that one should have to write mult48(a,b) instead of a*b. In addition, how would you write the operation which, when applied to a*b+c and n, yields both q and r? Is the reason why it is difficult to get double length multiplication, and division with quotient and remainder, that the language designers left these out, and then the As far as adding this type, why not allow the user to add any type? This was present for up to three additional types on Fortran VI(?) on the CDC3600. > I would like an instruction which counts the number of ones in > the binary representation of an integer but I find it hard to argue > that this would be widely used. Cray seems to find this one important enough to include on his machines, which have a real paucity of instructions. I think you will find that those who want to add "bizarre" instructions in both hardware and software to do what the present stuff does not do well understand the problems of both, and have some idea of what is feasible at reasonable cost. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
khb@chiba.Eng.Sun.COM (Keith Bierman fpgroup) (02/22/91)
In article <2998@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
favorite hand-held calculator. If sin(pi) = 0, the implementation is
wrong!)
No. Since that is what people expect, many library writers go to
special pains to make sure that this case works out "right".
Historically, Suns (for example) provided user selectable pi precision
(see the Numerical Computation Guide for details) .... this hasn't
proved to be very useful because few know or seem to care; but has
caused a bit of grief once or twice.
--
----------------------------------------------------------------
Keith H. Bierman kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33 | (415 336 2648)
Mountain View, CA 94043
new@ee.udel.edu (Darren New) (02/22/91)
How about this? A language in which it is possible to write functions in assembler and have them inlined automatically, and to have the compiler smart enough to do dataflow analysis on the resultant code and such. Possibly some syntax close to what PCC uses could be used inside such functions. I would imagine that GCC is close to capable of doing this if it isn't already. The minor problem of efficient multi-variable returns might need to be worked out. Then, the only stumbling block would be the flexible syntax for procedure calls, which could be addressed with a preprocessor rather than an entire language. Am I missing something here, or is this problem really as easy to solve as it seems? -- Darren -- --- Darren New --- Grad Student --- CIS --- Univ. of Delaware --- ----- Network Protocols, Graphics, Programming Languages, Formal Description Techniques (esp. Estelle), Coffee, Amigas ----- =+=+=+ Let GROPE be an N-tuple where ... +=+=+=
JBS@IBM.COM (02/22/91)
Regarding my comment: Since argument reduction for the trigonometric functions is done in practice by multiplying by 1/pi I do not see how it would ben- efit from your proposed instruction. Dik Winter says: You must have a very limited experience. You do it only by multiplying by 1/pi if you are willing to forego a lot of precision. (Check your favorite hand-held calculator. If sin(pi) = 0, the implementation is wrong!) I don't understand this comment. This problem can occur with the divide with remainder instruction as well. The problem is that approximating x mod pi by x mod y where y is the machine rep- resentation of pi may lose all accuracy. Herman Rubin says: There is somewhat greater loss of accuracy in this, and it is still needed to extract the integer part to an integer register and the fractional part. Thus, it needs at least three operations, instead of one. Also, if one is calculating something like x - ln(1+x), a natural operation in certain problems, the computing problems become a little larger than one would expect. In fact, to avoid a loss of accuracy in some quite usual situations, it would even be a good idea to have Boolean operations on floats, and unnormalized floating arithmetic. Floating arithmetic, apart from some preliminaries, and normalization problems, is exactly the same as integer, so why should there be a separate arithmetic unit even? I don't understand the comments about loss of accuracy. As noted above some or all of the bits of the remainder will be bogus because pi is not a machine number. Accurate argument re- duction requires first finding the integer part, n, of the quotient then computing x-n*pi carefully using pi to more than machine pre- cision. It is more convenient to do this if n is in floating for- mat. Counting operations is a poor test of speed when floating divide typically takes 5 times or more as long as floating mult- iply. I don't understand the reference to x-ln(1+x). This will always lose accuracy if evaluated in this form for x near 0. It is usually possible to perform boolean operations on floats al- though it may be a little awkward. I presume the reason for a separate floating point unit is that this leads to faster machines. Regarding my suggestion for an integer*8 type Herman Rubin comments: Yes and no. It would have much more general utility, but it would do an abysmally inefficient job in this situation. You would need to have a way of indicating that the product of two integers*4 is an integer*8, which I do not know how to do in any language with which I am familiar without writing it as a function, and I do not think that one should have to write mult48(a,b) instead of a*b. In addition, how would you write the operation which, when applied to a*b+c and n, yields both q and r? Regarding a function to get a the 8 byte product of 2 4 byte integers I don't see why this is any worse than Montgomery's function. In any case it is not needed. Let i4,j4 be 4 byte i8,j8,k8 be 8 byte. Then write i8=i4 j8=j4 k8=i8*j8 A sufficiently intelligent compiler will do the right thing. It may work to write k8=i4*j4 This sometimes works in the analogous case where k8 is 4 byte and i4 and j4 are 2 bytes. However I am not sure strictly speaking that it should. What does the Fortran standard say? As for getting both q and r this is easy just write iq=ia/ib ir=mod(ia,ib) A reasonable compiler will only generate 1 divide with remainder. I will confess however that while this works when everything is 4 bytes I don't quite see how to make it work when ia is 8 bytes (since it seems unreasonable to define the quotient of a 8 byte integer by a 4 byte integer to be 4 bytes and if it is defined to be 8 bytes it is then unsafe to use the usual 8 byte by 4 byte giving 4 byte quotient instruction). James B. Shearer
ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/22/91)
new@ee.udel.edu (Darren New) wrote: >How about this? A language in which it is possible to write functions >in assembler and have them inlined automatically, and to have the >compiler smart enough to do dataflow analysis on the resultant code and >such. Possibly some syntax close to what PCC uses could be used inside >such functions. I would imagine that GCC is close to capable of doing >this if it isn't already. The minor problem of efficient multi-variable >returns might need to be worked out. GCC is already. I saw a 68881 floating-point include file for gcc that was nothing more than static inline const double sin(register double x) { register double ans; asm("fsin %1,%0" : "=f" (ans) : "f" (x)); return ans; } The syntax is a bit cryptic to express everything you can tell the optimizer about, but basically, the first string is a pattern, and the colon-separated parts are output and input operands (you can add another section for clobbered registers). "=f" means it needs an "f" type operand (which the machine description uses for floating point args) and it's assigned to. "f" means it's not assigned to. Gcc will merge your code in with the caller, hoist sin() out of loops it's invariant in (const in the function type says you can do that with this function, although once it's inlined, gcc can see that for itself), and generally behave as if sin() was an intrinsic rather than a library call. You can also express the idea that a given input is not read until after the outputs are written, that an input and output have to be in the same place, and other things. >Then, the only stumbling block would be the flexible syntax for >procedure calls, which could be addressed with a preprocessor rather >than an entire language. > >Am I missing something here, or is this problem really as easy to >solve as it seems? -- Darren Well, the syntax is clumsy for really large blocks, but it works well for dropping, for eaxmple, an atomic test-and-set or compare-and-swap into some C code. I believe the reason it isn't more commonly done is that few compilers have the strong separation gcc does between the compiler and the machine description. So you can't easily add a description of an instruction at run time and have the optimizer work with it smoothly. -- -Colin
khb@chiba.Eng.Sun.COM (Keith Bierman fpgroup) (02/23/91)
In article <15485@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
approximation you use). Any user who needs trig functions should be
_expected_ to know that PI is not a rational number and that floating
Tell that to a spreadsheet user. It is impossible to legislate what
people _will_ expect. We can pontificate on what they should, but
unless we are working in the education field we can't expect it work.
--
----------------------------------------------------------------
Keith H. Bierman kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33 | (415 336 2648)
Mountain View, CA 94043
herrickd@iccgcc.decnet.ab.com (daniel lance herrick) (02/23/91)
In article <2998@charon.cwi.nl>, dik@cwi.nl (Dik T. Winter) writes: > > I would like an instruction which counts the number of ones in > > the binary representation of an integer but I find it hard to argue > > that this would be widely used. > Ask Seymour Cray why that instruction was put in the Cray-1 as an > afterthought. I havn't seen Seymour posting here, and I don't often meet him. Now that I am suitably impressed, would you share with us what he would tell us if we did ask him? > -- > dik t. winter, cwi, amsterdam, nederland > dik@cwi.nl dan herrick herrickd@iccgcc.decnet.ab.com
ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/23/91)
new@ee.udel.edu (Darren New) wrote: >How about this? A language in which it is possible to write functions >in assembler and have them inlined automatically, and to have the >compiler smart enough to do dataflow analysis on the resultant code and >such. Possibly some syntax close to what PCC uses could be used inside >such functions. I would imagine that GCC is close to capable of doing >this if it isn't already. The minor problem of efficient multi-variable >returns might need to be worked out. Here's a (sort of) working example using gcc 1.39 on a uVAX. "g" is the "general" operand class that covers all the VAX addressing modes. int main() { int i, j, k, l; i = 12; for (k =0; k < 10; k++) { asm("foo %1,%0": "=g" (j) : "g" (i)); asm("bar %1,%0": "=g" (l) : "g" (i)); } printf("%d\n",j); return 0; } Note that foo is loop-invariant, while bar is dead. gcc -O -fstrength-reduce produces: #NO_APP gcc_compiled.: .text LC0: .ascii "%d\12\0" .align 1 .globl _main _main: .word 0x0 #APP foo $12,r1 #NO_APP movl $9,r0 L4: decl r0 jgeq L4 pushl r1 pushab LC0 calls $2,_printf clrl r0 ret I'm slightly annoyed by gcc's failure to delete the empty loop. Testing reveals it fails to delete even trivially empty loops. At least it's not a commo case. Interestingly, if we make bar depend on the loop index k, it's still dead code, and still deleted, but it inhibits the loop optimisation to count-down form. I guess it's the order in which optimisations are applied. The code is deleted after the loop is generated in the count-up form: int main() { int i, j, k, l; i = 12; for (k =0; k < 10; k++) { asm("foo %1,%0": "=g" (j) : "g" (i)); asm("bar %1,%0": "=g" (l) : "g" (k)); } printf("%d\n",j); return 0; } #NO_APP gcc_compiled.: .text LC0: .ascii "%d\12\0" .align 1 .globl _main _main: .word 0x0 clrl r0 #APP foo $12,r1 #NO_APP L4: incl r0 cmpl r0,$9 jleq L4 pushl r1 pushab LC0 calls $2,_printf clrl r0 ret Oh, well, 2.0 is coming, and there has to be something to fix... -- -Colin
sef@kithrup.COM (Sean Eric Fagan) (02/23/91)
In article <3386.27c55e18@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes: >I havn't seen Seymour posting here, and I don't often meet him. >Now that I am suitably impressed, would you share with us what >he would tell us if we did ask him? Pop count. Count the set bits in a word. CXi on the CDC Cyber series. -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
glew@pdx007.intel.com (Andy Glew) (02/23/91)
> > > I would like an instruction which counts the number of ones in > > > the binary representation of an integer but I find it hard to argue > > > that this would be widely used. > > Ask Seymour Cray why that instruction was put in the Cray-1 as an > > afterthought. > > I havn't seen Seymour posting here, and I don't often meet him. > Now that I am suitably impressed, would you share with us what > he would tell us if we did ask him? I don't know about Cray, but I'm told that POP count was put in the Gould machines (which tended to be used for things like telemetry, i.e. listening to and decoding Russian phone calls and missile launches) was put in for the security boys. -- Andy Glew, glew@ichips.intel.com Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, Hillsboro, Oregon 97124-6497
vhs@britesun.radig.de (Volker Herminghaus-Shirai) (02/24/91)
JBS@IBM.COM writes: > I would like an instruction which counts the number of ones in >the binary representation of an integer but I find it hard to argue >that this would be widely used. It does seem to be widely used indeed; as far as I know the inmos transputer T800 has this instruction, along with other instructions like inverting the sequence of the bits in an int etc. I think those are used for CRC calculation and picture processing, though I'm not sure about that. -- Volker Herminghaus-Shirai (vhs@britesun.radig.de) panic: 80x86! Trying to vomit...
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (02/24/91)
In article <6503@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >I think you will find that those who want to add "bizarre" instructions in >both hardware and software to do what the present stuff does not do well >understand the problems of both, and have some idea of what is feasible >at reasonable cost. In fact, the old literature (and old machines!) are full of "bizarre" instructions that proved ill-advised. A few examples: - "uninterruptible" instructions which complicated the memory interface to the point of slowing down normal accesses. - addressing modes which studies could not find a single use of. (PDP-11 autodecrement deferred - omitted from the VAX.) - my personal favorite, the Star-100 instruction which took Fortran source (a vector of characters), and returned a vector with: comments and sequence numbers stripped; continuation lines catenated; and non-significant blanks elided. This was done because of the then-famous observation that the XPL compiler spent more time in "get next character" than in any other routine. The drawbacks, however, are serious. How to add new compiler directives? How to map line numbers back to the original line numbers, so as to have decent error messages? And most important, why bother, since modern compilers now spend their time elsewhere. - The Symbol machine, which put the entire compiler in hardware. Do not be fooled by all the laudatory retrospective articles. They were written by the machine's designers, and the machine's gross failure is underdescribed in the literature. I collect these little bird droppings of history. Any other favorites (that hold lessons)? -- Don D.C.Lindsay .. temporarily at Carnegie Mellon Robotics
pmk@craycos.com (Peter Klausler) (02/24/91)
In article <2998@charon.cwi.nl>, dik@cwi.nl (Dik T. Winter) writes: > > I would like an instruction which counts the number of ones in > > the binary representation of an integer but I find it hard to argue > > that this would be widely used. > Ask Seymour Cray why that instruction was put in the Cray-1 as an > afterthought. Almost right. A scalar population count instruction was not new to the CRAY-1, since the CDC 6600 and descendents all supported it ("CXi Xk", opcode 47). What was missing on some CRAY-1As and CRAY-1Bs was a vector population count instruction. The S and M models do support that. The CRAY-1 (and X-MP/Y-MP followons) *is* missing a vector leading zero count instruction, though it's there in scalar form. Need a CRAY-2 for the vector version. Both operations are quite useful, as has been noted in comp.arch often in the past.
hrubin@pop.stat.purdue.edu (Herman Rubin) (02/25/91)
In article <9102220245.AA14853@ucbvax.Berkeley.EDU>, JBS@IBM.COM writes: ................ > Regarding a function to get a the 8 byte product of 2 > 4 byte integers I don't see why this is any worse than Montgomery's > function. In any case it is not needed. Let i4,j4 be 4 byte > i8,j8,k8 be 8 byte. Then write > i8=i4 > j8=j4 > k8=i8*j8 > A sufficiently intelligent compiler will do the right thing. It may > work to write > k8=i4*j4 > This sometimes works in the analogous case where k8 is 4 byte and i4 > and j4 are 2 bytes. However I am not sure strictly speaking that it > should. What does the Fortran standard say? The first approach would require a very intelligent compiler, indeed. The second approach is the "right" one. However, every standard I have read would try the first. Since i8*j8 might not even exist on the machine, at best an inlined function would have to be brought in instead of the single hardware instruction for k8=i4*j4. I have never seen any language description in which the type of the result in a replacement statement affected what operation is performed. The closest I know is the C format, which may or may not work, k8 = (*int8)i4*j4. > As for getting both q and r this is easy just write > iq=ia/ib > ir=mod(ia,ib) > A reasonable compiler will only generate 1 divide with remainder. > I will confess however that while this works when everything is > 4 bytes I don't quite see how to make it work when ia is 8 bytes > (since it seems unreasonable to define the quotient of a 8 byte > integer by a 4 byte integer to be 4 bytes and if it is defined > to be 8 bytes it is then unsafe to use the usual 8 byte by 4 byte > giving 4 byte quotient instruction). > James B. Shearer I do not see why the case of ia being 8 bits is any worse. Sure, there is the problem of overflow, but so what? Also, it should not be ever necessary to do something as complicated as the mod(ia,ib) notation for basic operators, or even not so basic, and more than it should be necessary to write add(a,b) for a+b. But what is really needed in notation is something like iq,ir = ia/ib. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
peter@ficc.ferranti.com (Peter da Silva) (02/25/91)
In article <12071@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: > - addressing modes which studies could not find a single use of. > (PDP-11 autodecrement deferred - omitted from the VAX.) Oh, that's not nearly as exciting as autodecrementing the PC. (immediate and immediate indirect mode were implemented as autoincrement and autoincrement deferred on the PC) TST -(PC) (on the other hand, I would expect that autodecrement deferred would have cost some to leave out, given the PDP-11 instruction coding). -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
augustss@cs.chalmers.se (Lennart Augustsson) (02/26/91)
In article <1991Feb25.134714.23523@linus.mitre.org> bs@gauss.mitre.org (Robert D. Silverman) writes: >In article <9102220245.AA14853@ucbvax.Berkeley.EDU> JBS@IBM.COM writes: >> > >stuff deleted > >> As for getting both q and r this is easy just write >> iq=ia/ib >> ir=mod(ia,ib) >>A reasonable compiler will only generate 1 divide with remainder. > >I have done numerical analysis work and multi-precision arithmetic on many >different machines and operating systems. > >Only ONCE have I ever seen a compiler bright enough to do the above >optimization. (Burroughs 6700; a stack machine; Algol 68) > I've seen another one! Well, it's not a compiler it's the MIPS assembler which recognises what you are doing. I think this was very nice, since that meant that I didn't have to build this optimisation into my compiler, I got it for free from the assembler. -- Lennart Augustsson [This signature is intentionally left blank.]
torek@elf.ee.lbl.gov (Chris Torek) (02/26/91)
In article <1991Feb25.134714.23523@linus.mitre.org> bs@gauss.mitre.org (Robert D. Silverman) writes: >In article <9102220245.AA14853@ucbvax.Berkeley.EDU> JBS@IBM.COM writes: >... what one usually wants is (A*B + C)/D and (A*B + C) mod D. >Even on machines that support double length integer multiplies, one >cannot put the above operations into HLL because the compiler will not >generate the double length multiply (say 32 x 32 --> 64) nor will it >then do the (64 /32 --> 32 bit quotient & remainder). Since A*B can overflow >32 bits one is FORCED to call assembler routines to do this. Ah yes. Clearly the following does not work.... /* * return quotient and remainder from (a*b + c) divrem d */ #if 0 /* * This is the way we would like to do it, but gcc emits one extra * instruction, as it is not smart enough to completly eliminate the * addressing on r (it uses a register for r, rather than a pointer, * but never quite goes all the way). */ static __inline int divrem(int a, int b, int c, int d, int *r) { int q; double tmp; /* force reg pair allocation */ asm("emul %1,%2,%3,%0" : "=g"(tmp) : "g"(a), "g"(b), "g"(c)); asm("ediv %3,%2,%0,%1" : "=g"(q), "=g"(*r) : "r"(tmp), "g"(d)); return q; } #else /* * So instead we will use rather peculiar gcc syntax. * Note that the macro uses a, b, c, d, q, and r exactly once each, * and thus side effects (*p++, etc.) are safe. */ #define divrem(q, r, a, b, c, d) ({ \ double divrem_tmp; \ asm("emul %1,%2,%3,%0" : "=g"(divrem_tmp) : \ "g"(a), "g"(b), "g"(c)); \ asm("ediv %3,%2,%0,%1" : "=g"(q), "=g"(r) : \ "r"(divrem_tmp), "g"(d)); \ }) #endif int a[100], b[100], c[100], d[100]; int q[100], r[100]; void doit(int n) { int i; for (i = 0; i < n; i++) { #if 0 q[i] = divrem(a[i], b[i], c[i], d[i], &r[i]); #else divrem(q[i], r[i], a[i], b[i], c[i], d[i]); #endif } } But wait! Maybe, just *maybe*, we should try it out before dismissing it. Well goll-ee, it seems to work! When compiled on a Tahoe (the Tahoe is a `RISC'---a `Reused Instruction Set Computer'; its emul and ediv are just like those on the VAX) with `gcc -O -S' this compiles to (compiler comments and other drek stripped): _doit: .word 0x3c0 movl 4(fp),r4 clrl r2 cmpl r2,r4 jgeq L13 movab _a,r9 movab _b,r8 movab _c,r7 movab _q,r6 movab _r,r5 movab _d,r3 L12: emul (r9)[r2],(r8)[r2],(r7)[r2],r0 ediv (r3)[r2],r0,(r6)[r2],(r5)[r2] incl r2 cmpl r2,r4 jlss L12 L13: ret (Note that the Tahoe does not have auto-increment addressing modes, and this is in fact the best that can be done.) On the VAX the loop changes to (gcc 1.37.1, -fstrength-reduce -mgnu): L12: emul (r2)+,(r3)+,(r4)+,r0 # the registers ediv (r5),r0,r1,r0 # are allocated in movl r1,(r6)+ # a different order. movl r0,(r7)+ addl2 $4,r5 jaoblss r9,r8,L12 Apparently the machine-dependent part has not been taught to combine `ediv' properly; it should be: L12: emul (r2)+,(r3)+,(r4)+,r0 ediv (r5)+,r0,(r6)+,(r7)+ jaoblss r9,r8,L12 A bit of work on vax.md should fix it. This has its drawbacks: the syntax is distinctly un-pretty, and it requires gcc, and it is machine-dependent. It does, however, work. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
bs@faron.mitre.org (Robert D. Silverman) (02/26/91)
In article <10244@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: :In article <1991Feb25.134714.23523@linus.mitre.org> bs@gauss.mitre.org :(Robert D. Silverman) writes: :>In article <9102220245.AA14853@ucbvax.Berkeley.EDU> JBS@IBM.COM writes: :>... what one usually wants is (A*B + C)/D and (A*B + C) mod D. :>Even on machines that support double length integer multiplies, one :>cannot put the above operations into HLL because the compiler will not :>generate the double length multiply (say 32 x 32 --> 64) nor will it :>then do the (64 /32 --> 32 bit quotient & remainder). Since A*B can overflow :>32 bits one is FORCED to call assembler routines to do this. : :Ah yes. Clearly the following does not work.... : stuff deleted... :This has its drawbacks: the syntax is distinctly un-pretty, and it :requires gcc, and it is machine-dependent. It does, however, work. Ah yes. A universal solution. Machine dependent. Ugly syntax. Only works with one compiler. Furthermore, it does nothing to contradict the assertions I made above that the arithmetic could not be done in the HLL and that assembler routines had to be called. It does inline the subroutine, however. -- Bob Silverman #include <std.disclaimer> Mitre Corporation, Bedford, MA 01730 "You can lead a horse's ass to knowledge, but you can't make him think"
terry@venus.sunquest.com (Terry R. Friedrichsen) (02/26/91)
Donald Lindsay writes: >In fact, the old literature (and old machines!) are full of "bizarre" >instructions that proved ill-advised. A few examples: > > [ some good examples removed] > >- addressing modes which studies could not find a single use of. > (PDP-11 autodecrement deferred - omitted from the VAX.) This example differs somewhat from his others, which were basically deliberate design decisions that are not now generally seen as the wise way to go. But the autodecrement deferred addressing mode arose more out of symmetry (== simple instruction decoding) than somebody saying "gee, I'll bet that would be a useful addressing mode"). The pdp-ll had autoincrement/autodecrement, and it had deferred/non-deferred. The "ill-advised" combination of autodecrement deferred was just a result of symmetrical instruction decode logic. DEC's pdp-10 instruction set had the same oddities. The machine had a plethora of no-op equivalent instructions (some of which referenced memory and some of which did not), not because the designers thought it would be advisable to have lots of no-ops, but because the very simple, highly gate-efficient instruction decode logic took advantage of instruction-set symmetries. (I think I have notes around here that say the instruction decode for the original instruction set was done in something like 700 gates - but don't quote this; I may have the number wrong ...) The whole point is just that unused instructions or opcodes don't necessarily equate to stupidity or lack of foresight on the part of the designers. Sometimes they're really a result of cleverness :-). Terry R. Friedrichsen terry@venus.sunquest.com (Internet) uunet!sunquest!terry (Usenet) terry@sds.sdsc.edu (alternate address; I live in Tucson) Quote: "Do, or do not. There is no 'try'." - Yoda, The Empire Strikes Back
hassey@matrix.rtp.dg.com (John Hassey) (02/26/91)
In article <1991Feb25.134714.23523@linus.mitre.org> bs@gauss.mitre.org (Robert D. Silverman) writes: >In article <9102220245.AA14853@ucbvax.Berkeley.EDU> JBS@IBM.COM writes: >> > >stuff deleted > >> As for getting both q and r this is easy just write >> iq=ia/ib >> ir=mod(ia,ib) >>A reasonable compiler will only generate 1 divide with remainder. > >I have done numerical analysis work and multi-precision arithmetic on many >different machines and operating systems. > >Only ONCE have I ever seen a compiler bright enough to do the above >optimization. (Burroughs 6700; a stack machine; Algol 68) > >Furthermore, what one usually wants is (A*B + C)/D and (A*B + C) mod D. >(see Peter Montgomery's requests) >Even on machines that support double length integer multiplies, one >cannot put the above operations into HLL because the compiler will not >generate the double length multiply (say 32 x 32 --> 64) nor will it >then do the (64 /32 --> 32 bit quotient & remainder). Since A*B can overflow >32 bits one is FORCED to call assembler routines to do this. Calling an >assembler routine for simple arithmetic is EXPENSIVE because of subroutine >call overhead. > > I disagree ! Below is the output produced by gcc on a 386. It looks like gcc is able to handle both of these cases. It does cheat some, in the multiply it only uses the lower 32 bits of the result (even though the machine does 64). This would be a fairly simple problem to fix. ===== sample 1 ===== int i,j,k,l; main() { i = k/l; j = k%l; } .file "showem.c" gcc_compiled.: .text .align 4 .globl main main: pushl %ebp movl %esp,%ebp subl $4,%esp movl k,%eax cltd movl %eax,k idivl l movl %eax,i movl %edx,j leave ret .comm l,4 .comm k,4 .comm j,4 .comm i,4 ==== sample 2 ==== int a,b,c,d,j,k; main() { j = (a*b+c)/d; k = (a*b+c)%d; } .file "showem2.c" gcc_compiled.: .text .align 4 .globl main main: pushl %ebp movl %esp,%ebp movl a,%eax imull b,%eax addl c,%eax cltd idivl d movl %eax,j movl %edx,k leave ret .comm k,4 .comm j,4 .comm d,4 .comm c,4 .comm b,4 .comm a,4 ======
torek@elf.ee.lbl.gov (Chris Torek) (02/26/91)
[Bob Silverman asks for a way to compute (A * B + C) divrem D and claims that `Even on machines that support double length integer multiplies, one cannot put the above operations into HLL' (because the compiler cannot use 32x32=>64 bit operations) and that `one is FORCED to call assembler routines to do this'.] In article <10244@dog.ee.lbl.gov> I demonstrate that the last claim is false. >>This has its drawbacks: the syntax is distinctly un-pretty, and it >>requires gcc, and it is machine-dependent. It does, however, work. In article <1991Feb25.203629.5059@linus.mitre.org> bs@faron.mitre.org (Robert D. Silverman) writes: >Ah yes. A universal solution. Machine dependent. Ugly syntax. Only >works with one compiler. Do you want a machine-independent solution for using the machine instruction(s) that compute(s) (A * B + C) divrem D? This is clearly impossible, since some machines have no such instructions. Do you want a machine-independent (but language-dependent) syntax for expressing `(A * B + C) divrem D', and one which does not require a subroutine call per operation? This is what I provided. If you believe you can improve on the syntax, go to it; knock yourself out. (Incidentally, the only reason it `only works with one compiler' is that only one compiler compiles the particular language GCC implements. GCC is not a C compiler [well, it is if you run it with `-ansi -pedantic']. There is, however, no constraint that says that GCC's language may not be adopted by other compilers.) I dare say you do not want a language-independent syntax, since `syntax' and `language' are tightly-interwoven concepts. Do you want machine-independent semantics, given some syntax, for (A * B + C) divrem D? I believe this not to be the case (since I believe that you would prefer 64-bit A,B,C,D with 128-bit intermediate, which I believe you believe is not available on many machines). >Furthermore, it does nothing to contradict the assertions I made above >that the arithmetic could not be done in the HLL and that assembler >routines had to be called. It does inline the subroutine, however. I have only a vague idea what `could not be done in the HLL' means. Did I use a construct that is not provided by the compiler? Or is `assembler routines had to be called' the key phrase? Do you mean `machine dependent instructions had to be emitted'? Of course this is the case. Do you mean `the rules for emitting the machine dependent instructions were not built into the compiler'? This depends on what one means by `built in'. Is the following sin() routine `built in' to the compiler?: static __inline const double sin(double x) { double result; __asm("sin %1,%0" : "=f"(result) : "f"(x)); return result; } I claim that if this appears in <math.h>, it *is* built-in to the compiler, despite the fact that the compiler reads these rules from an external file rather than having them embedded directly in its executable. I further claim that if you require an <mp.h> to exist and to define a mul_add_div_rem operation, and make constructing a proper <mp.h> part of `building the compiler' (just as constructing a proper <math.h> and <string.h> and <stdlib.h> is already part of building any hosted ANSI C compiler), that your mul_add_div_rem operation will be `built in'. Or maybe this is not what you wanted? What exactly was it you *did* want? I know what *I* would want: I would want a syntax I could use to write mul_add_div_rem operations such that: - they were not `fragile' (i.e., could be inserted anywhere); - they could directly use machine instructions that implement the operation wherever such are available; - they did not impose extra overhead; and this is exactly what I constructed. Perhaps I could have done a better job; but perhaps I would have spent more than an hour on it.... -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
dik@cwi.nl (Dik T. Winter) (02/26/91)
In article <10278@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: > [Bob Silverman asks for a way to compute (A * B + C) divrem D and > claims that `Even on machines that support double length integer > multiplies, one cannot put the above operations into HLL' (because > the compiler cannot use 32x32=>64 bit operations) and that `one is > FORCED to call assembler routines to do this'.] etc. Bob Silverman complains about the solution; Chris Torek writes: > I further claim that if you require an <mp.h> to exist and > to define a mul_add_div_rem operation, and make constructing a proper > <mp.h> part of `building the compiler' (just as constructing a proper > <math.h> and <string.h> and <stdlib.h> is already part of building any > hosted ANSI C compiler), that your mul_add_div_rem operation will be > `built in'. And this is exactly what Bob Silverman wants (if I read his mind correctly that is). As it is now, every user has to write his own for every platform he encounters (and I have written some 40 upto now, though through the use of subroutine calls, not through inlining in gcc). And that was also the complaint of Peter Montgomery who has proposed such a thing for the Fortran standard, and it is not there. The current state of affairs is that it is not in the standard for most languages. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
JBS@IBM.COM (02/26/91)
In discussing what a compiler should do with i8=j4*k4 where i8 is integer*8, j4,k4 are integer*4, Herman Rubin says I have never seen any language description in which the type of the result in a replacement statement affected what operation is performed. I would not propose this. I would propose defining the type of an integer*4 x integer*4 multiplication to be integer*8. Then the right thing will be done whether i8 is 8 bytes or 4 bytes long. Several people have discussed various instruction sequences with the assumption that all fixed point instructions take the same amount of time. While on many machines most fixed point instructions are 1 cycle, this is generally not true of fixed point multiply and especially not true of fixed point divide. Thus a compiler which uses 2 divides for iq=ia/ib iq=mod(ia,ib) may be wasting the equivalent of 20+ typical fixed point instructions. In fact if you know the compiler will do this the sequence iq=ia/ib ir=ia-iq*ib will be much faster on many machines. The 11 instruction pop subroutine used a divide with remainder. On many machines this instruction will be considerably slower than the other 10 combined. If one is doing many divides by the same integer p it will often pay to replace them with multiplies by some "magic constant" based on the binary representation of 1/p. Regarding wrongheaded instructions I believe there are numerous examples where a complicated instruction was implemented in such a way that a sequence of simpler instructions doing exactly the same thing would execute faster. Of course this may be an implementation problem rather than an architecture problem. James B. Shearer
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/26/91)
In article <10278@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: > Do you want machine-independent semantics, given some syntax, for > (A * B + C) divrem D? Yes. That's exactly the problem here: given an operation glorp, we need to express glorp in a portable way with machine-independent semantics. Worse than that: when a chip supports glorp, we should be able to write the code so that it will be compiled the right way for that chip---but we shouldn't have to lose portability. Worse than that: the language designer probably never heard of glorp. No matter how nearsighted the language designer was, the chip designer has to be able to show his support for glorp. And no matter how nearsighted the language designer and chip designer were, the programmer has to be able to write glorp portably yet efficiently. Is this too much to ask? I don't think so. See my previous article for my thoughts on how this can and should be done. > I further claim that if you require an <mp.h> to exist and > to define a mul_add_div_rem operation, and make constructing a proper > <mp.h> part of `building the compiler' (just as constructing a proper > <math.h> and <string.h> and <stdlib.h> is already part of building any > hosted ANSI C compiler), that your mul_add_div_rem operation will be > `built in'. That's exactly the wrong answer. Programmers can't wait for chip designers, and nobody can wait for language designers. It has to be possible to set up a good enough structure beforehand that (1) chip designers can make innovations available without changing the language, (2) programmers can use those innovations without changing the chips or the language, and (3) programmers don't have to sacrifice portability for efficiency. I believe this is possible, and requires only a small addition to languages and a bit of discipline on the part of compiler writers. ---Dan
joeo@sharp..westford.ccur.com (Joe Orost) (02/26/91)
In article <10278@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: > >I have only a vague idea what `could not be done in the HLL' means. >Did I use a construct that is not provided by the compiler? Or is >`assembler routines had to be called' the key phrase? Do you mean >`machine dependent instructions had to be emitted'? Of course this is >the case. Do you mean `the rules for emitting the machine dependent >instructions were not built into the compiler'? This depends on what >one means by `built in'. Is the following sin() routine `built in' >to the compiler?: > > static __inline const double sin(double x) { > double result; > __asm("sin %1,%0" : "=f"(result) : "f"(x)); > return result; > } > >I claim that if this appears in <math.h>, it *is* built-in to the >compiler, despite the fact that the compiler reads these rules from an >external file rather than having them embedded directly in its >executable. Not so fast. Our C3Ada-68K compiler and our Fortran VII-3200 compiler has the transcendentals built-in. This means much more than just compiling the calls inline. Built-in means that: 1. Constant evaluation is performed at compile time: SIN(0.0) is replaced by 0.0, etc. 2. Common subexpressions and loop invarient code motion is performed on these operations at compile time: SIN(A) ... SIN(A) replaced by one call using a tempo. 3. Special-case optimizations are performed: A**0.5 -> SQRT(A) A**1.0 -> A 4. The compiler knows the argument profile for the routines. In Fortran, SQRT(2) gets a warning and is converted to SQRT(2.0). 4. The calls generate inline code. regards, joe -- Full-Name: Joseph M. Orost Email: joeo@tinton.ccur.com Phone: (908) 758-7284 Fax: (908) 758-7113 US Mail: MS 322; Concurrent Computer Corporation; 106 Apple St Tinton Falls, NJ 07724
marc@marc.watson.ibm.com (Marc Auslander) (02/26/91)
The C compiler for the Risc System/6000 produces one divide for i/j and i%j. -- Marc Auslander <marc@ibm.com>
gah@mt-xia.watson.ibm.com (Gary A. Hoffman) (02/26/91)
In fact, RS/6000 xlc compiler does have builtins. -- g
don@dgbt.doc.ca (Donald McLachlan) (02/26/91)
>In article <2998@charon.cwi.nl>, dik@cwi.nl (Dik T. Winter) writes: >> > I would like an instruction which counts the number of ones in >> > the binary representation of an integer but I find it hard to argue >> > that this would be widely used. >> Ask Seymour Cray why that instruction was put in the Cray-1 as an >> afterthought. > >I havn't seen Seymour posting here, and I don't often meet him. >Now that I am suitably impressed, would you share with us what >he would tell us if we did ask him? >> -- >> dik t. winter, cwi, amsterdam, nederland >> dik@cwi.nl > >dan herrick >herrickd@iccgcc.decnet.ab.com I don't know what Seymour Cray would say, but I have an application that would really benefit from this. When performing digital communication over HF (highly error prone media) major problems are encountered. 1. Most HF modems are syncronous. 2. Due to the error proneness of the media, you will rarely get a hard (exact) sync match. 3. Can't go to a short sync sequence because of high probability of false sync. Solution, requires that you write a correlator to count the number of bits in error in the sync sequence, and only accept it as a valid sync sequence if the distance is less than a predetermined limit. The distance is simply the number_of_ones(receieved_seq XOR sync_pattern). I must admit that this is a very specific problem, but one that must be performed once every bit time, and it would benefit greatly from having such an instruction in hardware. Donald McLachlan
ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/26/91)
joeo@tinton.ccur.com (Joe Orost) wrote: >In article <10278@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: >> static __inline const double sin(double x) { >> double result; >> __asm("sin %1,%0" : "=f"(result) : "f"(x)); >> return result; >> } >> >>I claim that if this appears in <math.h>, it *is* built-in to the >>compiler, despite the fact that the compiler reads these rules from an >>external file rather than having them embedded directly in its >>executable. > >Not so fast. Our C3Ada-68K compiler and our Fortran VII-3200 compiler has the >transcendentals built-in. This means much more than just compiling the >calls inline. Built-in means that: > > 1. Constant evaluation is performed at compile time: > SIN(0.0) is replaced by 0.0, etc. Not done by the above; you're one up. > 2. Common subexpressions and loop invarient code motion is > performed on these operations at compile time: > SIN(A) ... SIN(A) replaced by one call using a tempo. Done by the above; that's what the "const" qualifier on the function does, and after inlining, gcc can see that there are no side effects. (I did some experiments with a non-existent "foo" instruction... it works fine.) > 3. Special-case optimizations are performed: > A**0.5 -> SQRT(A) > A**1.0 -> A Not done by the above. I can see how the A**0.5 is useful; I can't see how A**1.0 is - do programmers often write this? > 4. The compiler knows the argument profile for the routines. > In Fortran, SQRT(2) gets a warning and is converted to SQRT(2.0). Done by the above. The prototype forces conversaion to the right type (although silently; there may be a warning flag you can turn on). > 5. The calls generate inline code. Of course, this is done by the above. So, the only advantage you have to built-in transcendentals are compile-time evaluation of constant expressions and some special cases. I don't deny their usefulness, but claim it's not tremendous, and the cost of a more complex compiler is not zero. If you support a cross-compiler, I hope you worked hard to ensure you're doing target arithmentic and not host arithmetic when evaluating transcendentals. Numerically unstable applications might die horribly otherwise. -- -Colin
peter@ficc.ferranti.com (Peter da Silva) (02/27/91)
In article <22605:Feb2608:04:5391@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > In practice this isn't such a big problem. Joe just writes the code in > several different ways (he has to, if he wants a fighting chance) and > uses #ifdef or something similar to select the right solution for each > machine. But this is still a pain. Yes. It's even more of a pain than just #ifdefing a bunch of inline assembly. So why not just do that for all the obscure instructions and have done with it? -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
bs@linus.mitre.org (Robert D. Silverman) (02/27/91)
In article <6870@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: :In article <10278@dog.ee.lbl.gov>, torek@elf.ee.lbl.gov (Chris Torek) writes: :> [Bob Silverman asks for a way to compute (A * B + C) divrem D and :> claims that `Even on machines that support double length integer :> multiplies, one cannot put the above operations into HLL' (because :> the compiler cannot use 32x32=>64 bit operations) and that `one is :> FORCED to call assembler routines to do this'.] :> :> In article <10244@dog.ee.lbl.gov> I demonstrate that the last claim :> is false. :> :> >>This has its drawbacks: the syntax is distinctly un-pretty, and it :> >>requires gcc, and it is machine-dependent. It does, however, work. :> :> In article <1991Feb25.203629.5059@linus.mitre.org> bs@faron.mitre.org :> (Robert D. Silverman) writes: :> >Ah yes. A universal solution. Machine dependent. Ugly syntax. Only :> >works with one compiler. :> :> Do you want a machine-independent solution for using the machine :> instruction(s) that compute(s) (A * B + C) divrem D? :> :> This is clearly impossible, since some machines have no such instructions. : :I agree partly with both. The LANGUAGE should probably contain some such :instruction, and it certainly should allow the writing : : E,F = (A * B + C) divrem D : :or even : : E,F = (A * B + C) / D : Actually I would settle for the following::: I am more than willing to write the assembler code for an operation ---- we will call FOO. FOO takes 3 arguments and returns 2 more. I would then like to see the following in a HLL: asm_routine(FOO(a,b,c,d,e)) The compiler will then fetch my assembler routine, and RESOLVE any register/stack/addressing conflicts that might arise because the assembler routine uses the same registers/addresses etc. that are produced by the calling routine. The compiler will straighten out any conflicts [renaming registers within the assembler routine as needed etc.] and then in-line the assembler routine. It is true that I must rewrite FOO for every new machine, but because FOO is time critical it is worth my trouble to do just that. FOO would be written as an ordinary subroutine; it would assume that the arguments to it had been placed on the stack the same as any other subroutine call. There are currently machines which have no multiply or divide instruction at all, e.g. SPARC. When the compiler sees an operation a*b it generates a subroutine call. Building in a 'divrem' operator into a language could be done the same way. For machines that have the instruction there is no problem. For those that don't, a subroutine call is generated. The only difficulty in the above paragraph would be in deciding that a 'divrem' operator was worth adding to the language. If enough people can agree that it IS worthwhile, then actually doing it is a non-problem. As for all of you who claim that 'divrem' (or 'foo' or any other special operation) is not needed let me remind you of one thing: The current SPARC does not have integer multiply or divide in hardware. The next release (Version 8) will have it. If, as all of you have been claiming that multiplication & division are so infrequent as to not be needed, then why did SUN add it in? Now I can't say of my own personal knowledge, of course, but ask yourself what user or set of users would have enough clout [i.e. a big enough market] to convince SUN to add such a thing. Hint: think 'cryptography' Many of the instructions under discussion are essential for good performance in cryptographic applications. As more and more PC users hook into networks, computer security applications will become more important. As will the necessary instructions to support those applications. -- Bob Silverman #include <std.disclaimer> Mitre Corporation, Bedford, MA 01730 "You can lead a horse's ass to knowledge, but you can't make him think"
jbuck@galileo.berkeley.edu (Joe Buck) (02/27/91)
In article <6870@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes: [ Chris Torek's example of how to do optimal "divrem" in gcc ] |> On the other hand, I have to agree with Torek about the implementation on |> a given machine. This must depend on the hardware. But I can hope, with |> Silverman, that if the language allows writing things like this, hardware |> people might put them in. |> |> A question for Torek: Suppose I want to do the above for a variety of |> types, for example, have A, B, C, D, and F floating, but E integer. Can |> I use the same overloaded operator symbol in both cases with his approach? |> Can I do even more complicated overloaded situations, where the translation |> needs to take into account the types? I'm not Chris Torek, but the answer is: yes you can, with g++. g++ accepts exactly the same code Chris wrote for gcc, and, since it is a C++ compiler, allows functions and operators to be overloaded based on argument type.
jbuck@galileo.berkeley.edu (Joe Buck) (02/27/91)
|> Donald Lindsay writes: |> >- addressing modes which studies could not find a single use of. |> > (PDP-11 autodecrement deferred - omitted from the VAX.) Way back in 1979 I had a job writing code for an LSI-11 board with 4K words of core memory and no O/S (it was basically an interrupt-driven controller for interfacing a couple of array processors together), and I found use for several of the weird autodecrement-deferred instructions: including a "program" with one instruction word and one data word that would zero all of memory and then halt, returning control to the built-in ODT debugger. Can any PDP-11 programmers figure out what it was? -- Joe Buck jbuck@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck
spot@CS.CMU.EDU (Scott Draves) (02/27/91)
In article <1991Feb26.155558.4215@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes: > 3. Special-case optimizations are performed: > A**0.5 -> SQRT(A) > A**1.0 -> A Not done by the above. I can see how the A**0.5 is useful; I can't see how A**1.0 is - do programmers often write this? probably just about never. but that doesn't mean the compiler shouldn't deal with it. A lot of code is generated by cpp (or lisp macros), and therefore contains all sorts of stupidity. this is an important point to keep in mind. as an example, suppose I wrote the cpp macro #define GammaCorrect(intensity, gamma) \ (pow((double)(intensity),1.0/(double)(gamma))) A perfectly sane programmer might write GammaCorrect(i, 1.0). -- IBM Scott Draves Intel spot@cs.cmu.edu Microsoft
ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/27/91)
bs@linus.mitre.org (Robert D. Silverman) wrote: >:> In article <1991Feb25.203629.5059@linus.mitre.org> bs@faron.mitre.org >:> (Robert D. Silverman) writes: >:>> Ah yes. A universal solution. Machine dependent. Ugly syntax. Only >:>> works with one compiler. I submit it's better than most, and allows mostly what you'd like. Also, gcc is very widely available, and might inspire others. >Actually I would settle for the following::: > >I am more than willing to write the assembler code for an operation > ---- >we will call FOO. FOO takes 3 arguments and returns 2 more. > >I would then like to see the following in a HLL: > > asm_routine(FOO(a,b,c,d,e)) > >The compiler will then fetch my assembler routine, and RESOLVE any >register/stack/addressing conflicts that might arise because the >assembler routine uses the same registers/addresses etc. that are >produced by the calling routine. > >FOO would be written as an ordinary subroutine; it would assume that >the arguments to it had been placed on the stack the same as any other >subroutine call. > >The compiler will straighten out any conflicts [renaming registers >within the assembler routine as needed etc.] and then in-line the >assembler routine. Som the compiler has to be able to parse either object files or assembler source (I'm not sure which you mean), follow the data flow (with possible aliasing) in the procedure-calling convention, know the operand restrictions (does it need to be in registers or can it accept a memory operand?) and implicit operands (MQ on the MIPS chip, r0/r1/r2/... for VAX movec instructions, etc.) of every instruction, and in-line it? I think this is awfully impractical. The gcc solution, while syntactically messier, tells the compiler the things it needs to know for optimisation purposes in a form it can easily understand, and just supplies a string pattern to emit to the assembler, so the compiler doesn't have to know how to parse assembler input. It can also handle new instructions the compiler has not yet been extended to handle when a 32732/68050/80586/R5000/whatever comes out. I respectfully suggest that you wouldn't like the results of what you suggest either, like the people in Jon Bentley's story who wanted to type a list of ~200 districts into the computer and get a random sample (~5) out. He suggested perhaps having the computer emit a few random numbers and looking them up on the paper list would be a lot easier than typing in the paper list. I also suggest that you don't want it embedded in the source that the code is an assembler routine, so you can supply a high level version if it becomes necessary. -- -Colin
ram+@cs.cmu.edu (Rob MacLachlan) (02/28/91)
>From: hrubin@pop.stat.purdue.edu (Herman Rubin) >Newsgroups: comp.arch,comp.lang.misc >Subject: Re: bizarre instructions >Summary: Syntax changes would be needed >Date: 24 Feb 91 18:18:53 GMT > >> A sufficiently intelligent compiler will do the right thing. It may >> work to write >> k8=i4*j4 > >The first approach would require a very intelligent compiler, indeed. >The second approach is the "right" one. However, every standard I have >read would try the first. Since i8*j8 might not even exist on the machine, >at best an inlined function would have to be brought in instead of the >single hardware instruction for k8=i4*j4. > >I have never seen any language description in which the type of the result >in a replacement statement affected what operation is performed. Actually, this use of output type information to select operators is relatively common in Lisp compilers, especially Common Lisp. It is necessitated by the default assumption of indefinite-precision arithmetic. In order to do fixed precision, you must somehow know the result is a "small" integer. The CMU Common Lisp compiler (Python) is (so far as I know) unique among Lisp compilers in efficiently compiling full-word integer arithmetic (no tag bits). And our BIGNUM code is implemented with this facility combined with some additional primitives such as 32x32 -> 64. The 64 bit result is returned as two 32 bit values, using the Common Lisp multiple value mechanism. It would be relatively straightforward to add support for larger fixed-precision arithmetic. Robert A MacLachlan (ram@cs.cmu.edu)
ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/28/91)
don@dgbt.doc.ca (Donald McLachlan) wrote: >Solution, requires that you write a correlator to count the number of bits in >error in the sync sequence, and only accept it as a valid sync sequence if >the distance is less than a predetermined limit. > >The distance is simply the number_of_ones(receieved_seq XOR sync_pattern). > >I must admit that this is a very specific problem, but one that must be >performed once every bit time, and it would benefit greatly from having >such an instruction in hardware. Just wondering: the usual hack to count bits is a table of, say, 256 entries. Look up each byte and add the values together. If you know the sync sequence ahead of time and can afford one table per byte of the sync sequence, you can get a matched bits count directly by pre-diddling the table. Another approach, since what you really want is to know whether there are fewer than N bits set in the word, is to repeat distance &= (distance-1); N times, and if the result is zero, you have a match. For small n, this is quite fast. -- -Colin
pcg@cs.aber.ac.uk (Piercarlo Grandi) (03/01/91)
On 26 Feb 91 02:25:39 GMT, dik@cwi.nl (Dik T. Winter) said: [ ... mul-add-div-rem primitive should be defined in some header, rather than the compiler provide, like GCC, the primitives to define it efficiently ... ] dik> And this is exactly what Bob Silverman wants (if I read his mind dik> correctly that is). As it is now, every user has to write his own dik> for every platform he encounters (and I have written some 40 upto dik> now, though through the use of subroutine calls, not through dik> inlining in gcc). This ought to be encouraged. I think that the FSF should be much more about libraries than about commands. The emphasis on programs/commands instead of reusable libraries is one of tragic historic legacies of Unix. The technical question is which language architecture best supports coding efficient libraries and efficient library use; then there is a practical (economics, politics) question of writing those libraries. dik> And that was also the complaint of Peter Montgomery who has dik> proposed such a thing for the Fortran standard, and it is not dik> there. The current state of affairs is that it is not in the dik> standard for most languages. But this is not something we ought to argue about in comp.arch; it is no longer about how to define a language architecture so that it may take advantage of specifics of the CPU architecture if possible. What you want is not a technical solution, it is legislation. "There ought to be a law..."; in your case it is one that makes compiler writers provide your favourite primitives on all the platforms you want to use. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/02/91)
In article <PCG.91Feb28182206@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: > What you want is not a technical solution, it is legislation. "There > ought to be a law..."; in your case it is one that makes compiler > writers provide your favourite primitives on all the platforms you want > to use. I don't think this is practical. I have my favorites, and you have your favorites, and when you take the overall favorites of programmers around the world you have way too many operations for any one language to support. To put it differently, there will always be *some* primitive that a chip designer implements or that a programmer wants, and that isn't in the language. If the chip designer provides X and the programmer wants X but the language doesn't understand X, the programmer is screwed. What I do think is practical is having each compiler writer at least provide a portable idiom for each instruction on his machine. (Portable doesn't mean everyone will recognize the same idiom the same way; it just means that the idiom is written portably in the language.) This is much more manageable, and it ensures that the programmer will never hit a wall between what the compiler lets him do and what he can accomplish in assembly. Note that some compilers go halfway, and provide unportable idioms (e.g., directives). I find these to be no better than assembly language, because I can't take a program with directives and use it without change on another system. ---Dan
mash@mips.com (John Mashey) (03/03/91)
In article <7063:Mar202:29:0091@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >What I do think is practical is having each compiler writer at least >provide a portable idiom for each instruction on his machine. (Portable I'm NOT suggesting there is no use to the various kinds of inlining and other ways to get access to machine instructions that exist, as has been discussed. HOWEVER, I would observe that: a) People must be careful NOT to assume that function calls are inordinately expensive. This is simply NOT true these days. In particular, on most of the current RISCs, especially when aided and abetted by good register allocators, function calls only cost a few cycles; in particular, calls to leaf routines (i.e., those that do not call others) are usually very cheap, because they almost never save/restore any registers. b) Fast function calls ae necessary for many purposes. I cannot think of any popular architecture designed in the last 5-8 years that hasn't paid at least some attention to making fast function calls, one way or another. c) They solve a LOT of the problem being discussed. d) They do not solve 100% of the problem being discussed. However, it is always worth doing the cycle count for a given language and architecture choice between: 1) The BEST that you can possibly do, with hand-coding 2) The BEST you can do by writing leaf-level assembly programs on the machine. I believe that in most cases, measured over the complete program, that case 2) does pretty well, especially because the instructions you're trying to get to are often high cycle-count operations, where the % overhead for getting them is low. The one obvious exception is if you have low-cycle count operations (like rotate, or population count, or byte-swapping, or string-primitives, etc) that you'd like to get inlined, and have no way to express directly. At least C is moving, albeit slowly in the direction of better support for the possibility of these things. However, the bottom line still is: analyze the cycle count differences to see if mechanisms are worth it. Sometimes they are, sometimes they're not. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086
hrubin@pop.stat.purdue.edu (Herman Rubin) (03/03/91)
In article <8UQ9CL7@xds13.ferranti.com>, peter@ficc.ferranti.com (Peter da Silva) writes: > In article <22605:Feb2608:04:5391@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: ...................... > Yes. It's even more of a pain than just #ifdefing a bunch of inline > assembly. So why not just do that for all the obscure instructions > and have done with it? ALL the obscure instructions? How about the ones which we haven't discovered yet? -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/03/91)
In article <658@spim.mips.COM> mash@mips.com (John Mashey) writes: > a) People must be careful NOT to assume that function calls > are inordinately expensive. This is simply NOT true these > days. In particular, on most of the current RISCs, Look, so-called RISC chips are hardly the mainstream. Function calls are relatively expensive on most of the chips I use; the SPARC and the MIPS are the only exceptions. On the other hand, I consider this whole function-call issue to be orthogonal to the issue of having the compiler provide portable access to all machine instructions. I assume that any good language or optimizer will handle inlining. > b) Fast function calls ae necessary for many purposes. > I cannot think of any popular architecture designed in the > last 5-8 years that hasn't paid at least some attention to > making fast function calls, one way or another. How about the Astronautics ZS supermini? It's a hellishly fast machine, and beats the Convex hands-down on non-vectorizable code. It has many registers, to be sure, and a nice optimizer, but a typical function call still takes several load-saves. As the machine has no built-in stack support, and as every memory reference takes two instructions, function calls are quite expensive. Inlining is a must for fast code on that machine. [ some opinions on whether the current situation works ] > The one obvious exception is if you have low-cycle count > operations (like rotate, or population count, or byte-swapping, > or string-primitives, etc) that you'd like to get inlined, > and have no way to express directly. That's a pretty big exception. As Peter points out, fast divrem would noticeably speed up integer formatting, not to mention multiple-precision arithmetic. Pop count is useful for algebraic coding (e.g., as in the Goppa public-key system). Byte reversal (not that I know any chips that have it) is a costly step in most applications of FFTs. Getting the high word of a multiply is practically necessary for fast multiple-precision arithmetic and other number-theoretic applications. And so on. ---Dan
quiroz@cs.rochester.edu (Cesar Quiroz) (03/04/91)
In <23358:Mar310:45:5591@kramden.acf.nyu.edu>, brnstnd@kramden.acf.nyu.edu (Dan Bernstein) wrote: | ...On the other hand, I consider this whole function-call issue to | be orthogonal to the issue of having the compiler provide portable | access to all machine instructions. ... I have been following this discussion quietly (and probably I should not break that), and I have tried to understand this position, but it still doesn't make sense to me. It resists definition, and therefore it is easy to use it to make the discussion infinitely long. There is something to be said for recommending that the compilers pick up idioms and builtins and treat them especially---there is a large opportunity for improvement there. But it seems obvious that keeping the semantics straight and guaranteeing a specific mapping to machine code are simultaneously incompatible if pursued to extremes. Let's see it this way: the other camp considers "portability" and "efficiency" desirable, but sees them as non-binary and as commodities to be traded. An occasional lack of portability to ensure large local benefit is OK to this camp. Say, no one expects GCC asm to be portable at all, but it gives a clear interface to most current---not imaginary or prospective---hardware. Similarly, a little efficiency can be lost for portability: your compiler can be out and running on more machines faster if it doesn't recognize the many special cases of pow() and just goes treats it as another function to be called. As a user, you get the tools to make your own tradeoffs. How to make the decision? Maximum payoff first. If pow appears often enough to warrant it, put some effort into it, etc. Don't get stuck with extremes: maximum portability, maximum efficiency and your pet syntax to boot. Get a lollipop. Blah, this has been said a bunch of times already. | As Peter points out, fast divrem would noticeably speed up integer | formatting, not to mention multiple-precision arithmetic. Pop count is | useful for algebraic coding (e.g., as in the Goppa public-key system). | Byte reversal (not that I know any chips that have it) is a costly step | in most applications of FFTs. Getting the high word of a multiply is | practically necessary for fast multiple-precision arithmetic and other | number-theoretic applications. And so on. And so on indeed. The list is infinite by rights, should compiler writers give the same ad-hoc treatment to each and every one of those algorithms? This departs from giving ad-hoc treatment to each and every machine instruction, but that confusion seems deeply at the root of Bernstein's position -- Cesar Quiroz
dik@cwi.nl (Dik T. Winter) (03/04/91)
In article <658@spim.mips.COM> mash@mips.com (John Mashey) writes: > a) People must be careful NOT to assume that function calls > are inordinately expensive. This is simply NOT true these > days. Sometimes yes, sometimes no. But that is not much different from years ago. > In particular, on most of the current RISCs, especially > when aided and abetted by good register allocators, function > calls only cost a few cycles; in particular, calls to > leaf routines (i.e., those that do not call others) are > usually very cheap, because they almost never save/restore > any registers. Ah, that is sometimes the problem, but again sometimes not. Whether you are looking CISC or RISC, it depends. > b) Fast function calls ae necessary for many purposes. > I cannot think of any popular architecture designed in the > last 5-8 years that hasn't paid at least some attention to > making fast function calls, one way or another. I have extensive figures about subroutine call overhead (in this case about leaf routines). The background is a library of routines to do extended precision integer arithmetic (quit apt in this discussion). I did port it (both in C and in Fortran) to some 40 different platforms. For most coding in assembly was necessary to get performance. The source is extremely flexible. It knows about different optimization leves, will compile for a different number of bits used in a word and so on. There is a completely portable variant (optimization level 0) that only avoids all the known compiler buts (C and Fortran). It is fairly simple to get an optimized version on a new platform (at least I think so). The problem is of course that all routines using the library need to be recompiled to use the more efficient version because the number of bits in a word can change. Optimization level 1 implies assembly routines for (integer) multiplication and division. It is however sometimes necessary to code these routines to use floating-point arithmetic. This only reduces the number of bits in a word that are used from 30 to 26 (on a 32 bit machine). This is needed on the i860 that does not have a integer multiply, and no divide at all. It can be useful on others (e.g. MIPS). The routines vary in size from a few instructions to quite a lot. E.g. the Vax multiply is under 10 instructions while the SPARC divide is over 200. In this case the operations are performed on the available registers, i.e. no registers are saved. Optimization level 2 means that loops over the multiplications and divisions are coded in assembly. This is roughly equivalent to inlining the simple functions (although you would not do that if the simple function is fairly large). In this case sometimes registers are saved because they are needed for the loop. (Level 3 and level 4 optimization are also defined, but not relevant here.) Below follows a table for a number of machines that indicates the speedup when optimization level 2 is used. Also in the last two columns it indicates whether the base multiplication and division routines are small, medium size or large. I use small when a single instruction could be used, medium when the hardware provides step instructions (multiply step or divide step) and large when bit fiddling should be done. Medium is also used if FP operations are used. Gould NP1 32.3% small small Gould PowerNode 32.1% small small Alliant FX/4 (68020 look-alike) 27.5% small small Sony (68030) 27.1% small small Vax 780 26.0% small small Encore (32532) 24.9% small small Sequent Balance (32032) 21.9% small small Sun3 (68020) 20.4% small small IBM RT/PC 18.6% medium medium SGI 4D/25 (MIPS) FP 14.5% medium medium Alliant FX2800 (i860) 13.8% medium medium SGI 4D/25 (MIPS) INT 10.0% small large Harris HCX7 (Tahoe) 3.8% small small Sun4 (SPARC) 3.1% medium large Two figures are given for the SGI. One for when complete integer arithmetic is used, one when floating-point is used in part. Floating point is faster by 12% on the SGI. The reason for the difference between Alliant FX/4 and Sun3 is that the Alliant uses a caller saves calling sequence while the Sun3 uses a callee saves sequence. We see indeed that the subroutine call overhead decreases when subroutines become larger. What is extremely surprising is the position of the Tahoe in this table! This is not a modern machine by any means, nor is it RISC (it is the smaller brother of the Vax). But while the base routines use only 5 resp. 4 instructions (4 only because of a bug? in the hardware, it would have been 3 otherwise), we see nearly no effect of the CALL. Also interesing is that Sony's 68030 has larger subroutine call overhead than Sun's 68020. Approximately the same code was run on both machines. The only differences are due to compiler differences. So this is contradictionary to John Mashey's remark that modern machines have lower subroutine call overhead. That depends on your definition of modern. What does this learn us? Really nothing. Inlining may or may not help. On the Tahoe inlining will help a bit, but not much. But on the MIPS with integer arithmetic inlining helps (this means inlining some 20 instructions for the multiply and some 430 for the divide). In all, it would be good to look at how the Tahoe does subroutine calls. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (03/04/91)
In article <1991Feb25.134714.23523@linus.mitre.org> bs@gauss.mitre.org (Robert D. Silverman) writes: >> iq=ia/ib >> ir=mod(ia,ib) >>A reasonable compiler will only generate 1 divide with remainder. >Only ONCE have I ever seen a compiler bright enough to do the above >optimization. (Burroughs 6700; a stack machine; Algol 68) The last time I had my hands on a B6700 (a *lovely* machine) was in 1979. However, I did write a compiler for it and assist with a couple of others, and I am quite sure that there was no instruction that yielded both quotient and remainder. There was an instruction with the effect of Algol 60's integer divide, and an instruction with the effect of Algol 60's 'mod'. Later models in the same architectural family have varied the instruction set somewhat (such as dropping VECTORMODE), perhaps this was a later model? Several years ago I posted a "<q,r> = (a*b+c) <div,mod> d" function for Sun compilers, using their ".il" files. I used to use a language (Pop-2) in which the integer division operator *always* returned both results; oddly enough this was almost always a pain. -- The purpose of advertising is to destroy the freedom of the market.
mash@mips.com (John Mashey) (03/05/91)
In article <23358:Mar310:45:5591@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >Look, so-called RISC chips are hardly the mainstream.... >> b) Fast function calls ae necessary for many purposes. >> I cannot think of any popular architecture designed in the >> last 5-8 years that hasn't paid at least some attention to >> making fast function calls, one way or another. > >How about the Astronautics ZS supermini? It's a hellishly fast >machine, and beats the Convex hands-down on non-vectorizable code. ... I guess I must be confused. For some weird reason, I thought RISCs were at least part of the mainstream of current computer, based on the almost-nonexistence of new CISC designs. (Please note, that in all talks that I give, that doesn't mean I claim CISCs are going away, especially the X86; just that new architectures look more RISCy than CISCy). Clearly, I have somehow missed the idea that the ZS might be the mainstream... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086
jpk@ingres.com (Jon Krueger) (03/05/91)
From article <3062@charon.cwi.nl>, by dik@cwi.nl (Dik T. Winter): > > a) People must be careful NOT to assume that function calls > > are inordinately expensive. This is simply NOT true these > > days. > [table of machines and speedups; in no case > does the entire speedup exceed 33%; 20% is more typical.] > So this is > contradictionary to John Mashey's remark that modern machines have lower > subroutine call overhead. That depends on your definition of modern. And your definition of "lower", apparently. Since your speedups are achieved both from inlining and nonportable assembler, it's impossible to tell how much function call overhead contributes. Even if it's entirely responsible, it doesn't look expensive to me. Since it's probably not entirely responsible, it looks even cheaper. Finally, even if you could guarantee me a portable 30% overall speedup, your code isn't a real program, it's a set of support routines. What speedups have you measured in real programs? For instance, if you like, in real floating point intensive programs calling your routines? -- Jon -- Jon Krueger, jpk@ingres.com
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/06/91)
In article <677@spim.mips.COM> mash@mips.com (John Mashey) writes: > In article <23358:Mar310:45:5591@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > >Look, so-called RISC chips are hardly the mainstream.... Okay, I agree that this is arguable, but only because ``RISC'' is as meaningful as ``object-oriented.'' Why is the CDC 6600 not a RISC architecture? > >> I cannot think of any popular architecture designed in the > >> last 5-8 years that hasn't paid at least some attention to > >> making fast function calls, one way or another. > >How about the Astronautics ZS supermini? It's a hellishly fast > >machine, and beats the Convex hands-down on non-vectorizable code. > I guess I must be confused. For some weird reason, I thought RISCs > were at least part of the mainstream of current computer, The issue of fast function calls has nothing to do with whether RISC is in the mainstream. You are confusing the issues. The ZS has 64 registers. It is highly pipelined. It has no particularly complex instructions. But I wouldn't say it pays much attention to fast function calls. ---Dan
mash@mips.com (John Mashey) (03/06/91)
In article <3304:Mar601:52:2891@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >In article <677@spim.mips.COM> mash@mips.com (John Mashey) writes: >> In article <23358:Mar310:45:5591@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >> >Look, so-called RISC chips are hardly the mainstream.... > >Okay, I agree that this is arguable, but only because ``RISC'' is as >meaningful as ``object-oriented.'' Why is the CDC 6600 not a RISC >architecture? I would point out most RISC-genealogy charts start with a Cray bubble; mine certainly do, and thousands of people have seen them; as well, there have been numerous postings over the past years in this newsgroup acknowledging the inspiration of Cray designs, especially the CDC 6600, etc, etc. What I don't understand is what of the immediately preceding discussion would lead anyone to bring up CDC 6600, as though a RISC person WOULDN'T be aware of it? >> >> I cannot think of any popular architecture designed in the >> >> last 5-8 years that hasn't paid at least some attention to >> >> making fast function calls, one way or another. >> >How about the Astronautics ZS supermini? It's a hellishly fast >> >machine, and beats the Convex hands-down on non-vectorizable code. >> I guess I must be confused. For some weird reason, I thought RISCs >> were at least part of the mainstream of current computer, > >The issue of fast function calls has nothing to do with whether RISC is >in the mainstream. You are confusing the issues. > >The ZS has 64 registers. It is highly pipelined. It has no particularly >complex instructions. But I wouldn't say it pays much attention to fast >function calls. Everybody please re-read what was posted, quoted above. If somebody wants to challenge what was said in some useful way, we'd all be pleased to hear of counter-examples. In this particular case, I'd like to hear why the ZS is 'a popular architecture' while "so-called RISC chips are hardly the mainstream." Perhaps I am just ignorant, but I've not come across the ZS very often ... well, ever.... and I'm not at all sure what issues are that I'm confusing, but am happy to become educated ... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086
ph@ama-1.ama.caltech.edu (Paul Hardy) (03/07/91)
In article <14710@sunquest.UUCP> terry@venus.sunquest.com (Terry R. Friedrichsen) writes: Path: nntp-server.caltech.edu!jarthur!uunet!sunquest!venus.sunquest.com!terry > >In fact, the old literature (and old machines!) are full of "bizarre" > >instructions that proved ill-advised. A few examples: > > > > [ some good examples removed] > > > >- addressing modes which studies could not find a single use of. > > (PDP-11 autodecrement deferred - omitted from the VAX.) > > This example differs somewhat from his others, which were basically > deliberate design decisions that are not now generally seen as the > wise way to go. Actually, I've used this instruction to scan backwards in a linked list. Extremely handy. I've seen other uses for it. It was not, however, used by any *DEC* compiler. That's what did it in. Apart from that, there was something about the symmetry of the machine (lacking in the VAX) that was useful in itself. For one thing, it allowed compiler writers to very quickly take advantage of the PDP-11's orthoganal architecture: get something working with one instruction in one addressing mode, then extend it from there with a flick of the wrist. > DEC's pdp-10 instruction set had the same oddities. The machine had > a plethora of no-op equivalent instructions My favorite will always be the "JUMP" instruction, which of course does not jump since there's no condition associated with it. --Paul -- This is my address: ph@ama.caltech.edu This is UUCP: ...!{decwrl,uunet}! This is my address on UUCP: ...!{decwrl,uunet}!caltech.edu!ama!ph Any questions? "Does Emacs have the Buddha nature?" --Paul Hardy "Yow!" --Zippy
dik@cwi.nl (Dik T. Winter) (03/07/91)
(Later tonight I will respond to some questions from John Mashey, now this) In article <1991Mar5.080911.12759@ingres.Ingres.COM> jpk@ingres.com (Jon Krueger) writes: > From article <3062@charon.cwi.nl>, by dik@cwi.nl (Dik T. Winter): > > So this is > > contradictionary to John Mashey's remark that modern machines have lower > > subroutine call overhead. That depends on your definition of modern. > > And your definition of "lower", apparently. Since your speedups are > achieved both from inlining and nonportable assembler, it's impossible > to tell how much function call overhead contributes. Even if it's > entirely responsible, it doesn't look expensive to me. It is entirely responsible. > Since it's > probably not entirely responsible, it looks even cheaper. Finally, > even if you could guarantee me a portable 30% overall speedup, your > code isn't a real program, it's a set of support routines. What > speedups have you measured in real programs? This was a real program that was using those support routines. The speedup was *not* for the routines themselves but for a program running from 1.5 to 155 seconds, depending on the architecture. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
jsp@glia.biostr.washington.edu. (Jeff Prothero) (03/08/91)
In article <4882@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes: In article <1991Feb25.134714.23523@linus.mitre.org> bs@gauss.mitre.org (Robert D. Silverman) writes: >> iq=ia/ib >> ir=mod(ia,ib) >>A reasonable compiler will only generate 1 divide with remainder. >Only ONCE have I ever seen a compiler bright enough to do the above >optimization. (Burroughs 6700; a stack machine; Algol 68) I've seen MetaWare's High C do something very close to this on the '386. I found this depressing at the time, as I couldn't find a nice way to make *my* code generator do it. Also a little surprising, since High C (at least the version I was playing with) was not a strongly optimising compiler, just had lots of local tweaks. jsp@milton.u.washinton.edu -- Jeff Prothero (jsp@u.washington.edu) <std disclaimer> Biological Structure Graphics Lab, University of Washington