hrubin@pop.stat.purdue.edu (Herman Rubin) (03/09/91)
There clearly is a major disagreement on what "should" be in an architecture or language. One of the arguments given against including assembler code is that compiler optimization is at least made more difficult, and portability is lost. What people like Montgomery, Silverman, and I, as well as many others, have pointed out is that if the operation is not in the language, the compiler cannot do a good job of getting the architecture to do it. Furthermore, while one can always simulate what is wanted by a clumsy sequence of operations in the language, this is at least extremely likely to generate efficient code. Optimizing compilers "know" about certain means of translating the language into organized combinations of hardware operations, and have some abilities to pick efficient ones. But they cannot include things about which they do not have any cognizance. The "solution" I suggest is to allow the programmer, etc., to set up idioms in whatever syntax is easiest to use for that programmer, and to provide the translations into some adequate intermediate language. There may be, and in fact should be, alternate translations. Even the addition of two vectors to produce a result should have different C code on different machines. Presumably, those types of expressions which are found useful will eventually get into the languages. But the expressions themselves will have considerable portability, although if the expression is unknown in the target dialect, a dictionary will have to be provided. There are two reasons for posting this to comp.arch as well as comp.lang.misc. For one, much of the discussion has been there, and it seems that many of the posters there consider language architecture to be architecture. For another, by having this type of "portable" representation of what people want computed, hardware designers may learn something about costs and tradeoffs which they are not getting now. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
gsteckel@vergil.East.Sun.COM (Geoff Steckel - Sun BOS Hardware CONTRACTOR) (03/09/91)
In article <7499@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >There clearly is a major disagreement on what "should" be in an architecture >or language. One of the arguments given against including assembler code is >that compiler optimization is at least made more difficult, and portability is >lost. One thing lost in the recent debate is the possibility of applying `object' methodology to the complex operation debate. The C language already has the ability to return a structure from a function; the `divrem' function could return struct div_and_rem { int quotient ; int remainder } ; C++ (admitting its many faults) can implement this sort of operation easily. The usage (not the formal definition) of the language (and object-like languages in general) includes the idea of `standard object heirarchies'. These are currently distributed for C++ from a number of sources. This seems to be a way for the {numerical analysts, crypto specialists, molbiogeneticists, etc.} to introduce their special operators, distribute them, standardize them, etc. For efficiency of human time, the users of the extended functionality would have to organize a bit to coordinate development and distribution of prototypes and specifications. Once common sets of these functions stabilized a bit, it would then be more likely that vendors would be willing to invest in special efforts to improve versions for particular hardware. Just a thought... geoff steckel (gwes@wjh12.harvard.EDU) (...!husc6!wjh12!omnivore!gws) Disclaimer: I am not affiliated with Sun Microsystems, despite the From: line. This posting is entirely the author's responsibility.
hrubin@pop.stat.purdue.edu (Herman Rubin) (03/10/91)
In article <4748@eastapps.East.Sun.COM>, gsteckel@vergil.East.Sun.COM (Geoff Steckel - Sun BOS Hardware CONTRACTOR) writes: > In article <7499@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: > >There clearly is a major disagreement on what "should" be in an architecture > >or language. One of the arguments given against including assembler code is > >that compiler optimization is at least made more difficult, and portability is > >lost. > > One thing lost in the recent debate is the possibility of applying `object' > methodology to the complex operation debate. The C language already has > the ability to return a structure from a function; the `divrem' function > could return struct div_and_rem { int quotient ; int remainder } ; > > C++ (admitting its many faults) can implement this sort of operation easily. ........................... These suggestions are not enough, for two reasons. Structs are not adequate for two reasons. One is that the results need not be stored, but frequently, possibly even predominately, they should be kept in registers for proximate use. The other is that struct is too clumsy, as there is no automatic reason (although in some cases hardware may do it for simplicity) to assign adjacent locations for the results. Despite some formal similarities, a list is not a struct. Some examples of this are dividing float by float with an integer quotient and a floating remainder, which is a widely used operation, and the frexp function in C, which writes a float as as one whose magnitude is in [.5,1) and puts the power of 2 into a memory location y = frexp(x,&n) instead of y,n = frexp(x) The other is notation. I see not good reason why I should not be able to write q,r = a/b for both the integer and the floating case, with q being the integer quotient in both cases. I would like to write exp,mant =UP x, where exp is the exponent and mant the mantissa for the floating point x, etc. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
chip@tct.uucp (Chip Salzenberg) (03/12/91)
According to hrubin@pop.stat.purdue.edu (Herman Rubin): >The "solution" I suggest is to allow the programmer, etc., to set up idioms >in whatever syntax is easiest to use for that programmer, and to provide the >translations into some adequate intermediate language. A spec, Herman. Surely you can afford some few hours out of your busy academic schedule to write a spec. -- Chip Salzenberg at Teltronics/TCT <chip@tct.uucp>, <uunet!pdn!tct!chip> "Most of my code is written by myself. That is why so little gets done." -- Herman "HLLs will never fly" Rubin
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/12/91)
In article <7571@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: > The other is that struct is too clumsy, as there is no automatic reason > (although in some cases hardware may do it for simplicity) to assign adjacent > locations for the results. Despite some formal similarities, a list is not a > struct. Data point: Q has both ``struct'' and ``bstruct''. The former makes no guarantees about memory arrangement; the latter makes similar guarantees to ANSI C's struct. As the language's sole user for now, I can assure you that it's normal practice to have a function return a struct, and to write (q,r) = div(a,b). (Yes, you can make a struct of lvalues.) In the C code output by q2c, the place to store a struct returned from a function is currently passed as an initial argument, in this case pointers to q and r. Somewhere in my optimization list is having a function use registers for this when possible. ---Dan
alex@vmars.tuwien.ac.at (Alexander Vrchoticky) (03/13/91)
hrubin@pop.stat.purdue.edu (Herman Rubin) writes: > Even the addition of two vectors to > produce a result should have different C code on different machines. For the purposes of design diversity? Are you sure you did not want to type `machine' instead of `C' there? > But the expressions themselves will have considerable > portability, although if the expression is unknown in the target dialect, a > dictionary will have to be provided. Was hat man im Zusammenhang mit Compilertechnologie unter einem `Dictionary' zu verstehen? Unter `Portabilitaet von Ausdruecken'? [sorry, i could not resist :-)] -- Alexander Vrchoticky | alex@vmars.tuwien.ac.at TU Vienna, CS/Real-Time Systems | +43/222/58801-8168 "those who feel they're touched by madness, sit down next to me" (james)
jbuck@galileo.berkeley.edu (Joe Buck) (03/14/91)
In article <7571@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes: |> In article <4748@eastapps.East.Sun.COM>, gsteckel@vergil.East.Sun.COM (Geoff Steckel - Sun BOS Hardware CONTRACTOR) writes: |> > One thing lost in the recent debate is the possibility of applying `object' |> > methodology to the complex operation debate. The C language already has |> > the ability to return a structure from a function; the `divrem' function |> > could return struct div_and_rem { int quotient ; int remainder } ; |> These suggestions are not enough, for two reasons. Structs are not adequate |> for two reasons. One is that the results need not be stored, but frequently, |> possibly even predominately, they should be kept in registers for proximate |> use. The other is that struct is too clumsy, as there is no automatic reason |> (although in some cases hardware may do it for simplicity) to assign adjacent |> locations for the results. Despite some formal similarities, a list is not a |> struct. Who said that a struct needs adjacent memory locations? gcc returns a two-int (or two-float, or two-double) struct in a pair of registers. Functions can be declared inline, so you might even be able to return three or more results this way. It really does behave like a list. Check out this short program: struct div_and_rem { int quotient; int remainder; }; inline struct div_and_rem divide(int dividend, int divisor) { struct div_and_rem t; t.quotient = dividend/divisor; t.remainder = dividend % divisor; return t; } foop (int x, int y) { struct div_and_rem t; t = divide(x,y); printf ("%d %d\n", t.quotient, t.remainder); } For a 68020, gcc generates LC0: .ascii "%d %d\12\0" .even .globl _foop _foop: link a6,#0 moveml #0x3800,sp@- movel a6@(8),d0 movel a6@(12),d1 movel d0,d4 divsl d1,d4 movel d4,d2 divsll d1,d1:d0 movel d1,d3 movel d3,sp@- movel d2,sp@- pea LC0 jbsr _printf moveml a6@(-12),#0x1c unlk a6 rts It's not optimal: there are two divides. Still, if you write kindly I'll bet you could talk RMS into seeing if he could recognize the pattern and produce one divide in gcc 2.0 (or for some higher number). But notice the important thing: the struct has vanished! The routine truly returns two results, in registers, without requiring adjacent memory locations. -- Joe Buck jbuck@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck
sef@kithrup.COM (Sean Eric Fagan) (03/14/91)
In article <11964@pasteur.Berkeley.EDU> jbuck@galileo.berkeley.edu (Joe Buck) writes: >Check out this short program: [code deleted] >It's not optimal: there are two divides. Still, if you write kindly >I'll bet you could talk RMS into seeing if he could recognize the >pattern and produce one divide in gcc 2.0 (or for some higher number). foop: pushl %ebp movl %esp,%ebp pushl %ebx movl 8(%ebp),%eax cltd idivl 12(%ebp) movl %eax,%ecx movl %edx,%ebx pushl %ebx pushl %ecx pushl $.LC0 call printf leal -4(%ebp),%esp popl %ebx leave ret It was already done, for gcc 1.3[89]. Good work, eh? Yes, the code could be optimal: gcc could look at the entire function, and not bother moving from eax to ecx, just pushing them directly. But those are small amounts (two or three cycles, I believe), and are *much* better than the 15+ cycles the extra divide would have taken. -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
hrubin@pop.stat.purdue.edu (Herman Rubin) (03/15/91)
In article <1991Mar14.013109.16636@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes: > In article <11964@pasteur.Berkeley.EDU> jbuck@galileo.berkeley.edu (Joe Buck) writes: > >Check out this short program: > > [code deleted] > >It's not optimal: there are two divides. Still, if you write kindly > >I'll bet you could talk RMS into seeing if he could recognize the > >pattern and produce one divide in gcc 2.0 (or for some higher number). > > > foop: | pushl %ebp | movl %esp,%ebp | pushl %ebx | movl 8(%ebp),%eax | cltd | idivl 12(%ebp) | movl %eax,%ecx | movl %edx,%ebx | pushl %ebx | pushl %ecx | pushl $.LC0 | call printf | leal -4(%ebp),%esp | popl %ebx | leave | ret > > It was already done, for gcc 1.3[89]. Good work, eh? Yes, the code could > be optimal: gcc could look at the entire function, and not bother moving > from eax to ecx, just pushing them directly. But those are small amounts > (two or three cycles, I believe), and are *much* better than the 15+ cycles > the extra divide would have taken. This example has far too many loads and stores. Possibly this MIGHT not be too important for a division, but how about something like frexp? The operations may be register-register, in which case all these loads and stores are inappropriate. Also, something this simple should be inlined; if a subroutine call, there is the additional save/restore overhead which has to be done somewhere. The real need is for the languages and compilers to allow the user to introduce idioms, with translation into machine primitives. In the above example, idivl is such a primitive, and should be considered no differently than the various types of subtraction. The relevant idiom in the above example would be q,r = x/y, where the / is overloaded some more. If instead r, x, and y were floating point, and q integer, the code would be quite different, Mathematical notation has been developing over centuries, and we still see many new idioms and overloadings. It is not necessary to have a committee to decide what notation will be allowed and what will not. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
sef@kithrup.COM (Sean Eric Fagan) (03/15/91)
In article <7850@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >| pushl %ebp >| pushl %ebx >| movl 8(%ebp),%eax >| idivl 12(%ebp) >| pushl %ebx >| pushl %ecx >| pushl $.LC0 >| leal -4(%ebp),%esp >| popl %ebx >> >This example has far too many loads and stores. 9 memory references. 4 necessary for the calling sequence gcc conforms to. Three necessary to call another function, since the 'bcs' does not specify calling routines with values in registers. Two more because arguments passed in are not in registers. Leaving a total of 0 unnecessary loads and stores. Could this be improved? Certainly. But not by much. >Possibly this MIGHT not be >too important for a division, but how about something like frexp? I think it was frexp() that I wrote for berkeley using gcc with inline assembly. Uhm... I think it had 7 loads and stores, all but two or three of which would disappear if the function got inlined and optimized. >The >operations may be register-register, in which case all these loads and >stores are inappropriate. Herman: where are you supposed to get the values from? Magic? >Also, something this simple should be inlined; >if a subroutine call, there is the additional save/restore overhead which >has to be done somewhere. Jesus. Guess what, herman: the routine *was* inlined. Take a look at the original source code again. I think you're just complaining because you seem to thing there's something wrong, when there really isn't. -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/15/91)
In article <2378@tuvie.UUCP> alex@vmars.tuwien.ac.at (Alexander Vrchoticky) writes: > hrubin@pop.stat.purdue.edu (Herman Rubin) writes: > > Even the addition of two vectors to > > produce a result should have different C code on different machines. > For the purposes of design diversity? Are you sure you did not want to > type `machine' instead of `C' there? #ifdef VECTOR for (i = 0; i < k; i++) y[i] = neg(x[i + r - k]); for (i = k; i < r; i++) y[i] = x[i - k]; for (i = 0; i < r; i++) x[i] = y[i]; #else for (i = 0; i < k; i++) y[i] = x[i + r - k]; for (i = r - 1; i >= k; i--) x[i] = x[i - k]; for (i = 0; i < k; i++) x[i] = neg(y[i]); #endif Well, okay, this is negacyclic rotation, but the point is the same. ---Dan
hrubin@pop.stat.purdue.edu (Herman Rubin) (03/15/91)
In article <2378@tuvie.UUCP>, alex@vmars.tuwien.ac.at (Alexander Vrchoticky) writes: > hrubin@pop.stat.purdue.edu (Herman Rubin) writes: > > > Even the addition of two vectors to > > produce a result should have different C code on different machines. > > For the purposes of design diversity? Are you sure you did not want to > type `machine' instead of `C' there? My statement is correct as it stands. The optimal C code to add two vectors is different on different machines, strange as it may seem. The different codes do exactly the same thing, IF all one wants to do is to add the vectors. If the index has to be used, or in some other cases, one code may be better than another because of other considerations. > > But the expressions themselves will have considerable > > portability, although if the expression is unknown in the target dialect, a > > dictionary will have to be provided. > > Was hat man im Zusammenhang mit Compilertechnologie unter einem `Dictionary' > zu verstehen? Unter `Portabilitaet von Ausdruecken'? > > [sorry, i could not resist :-)] Ich verstehe Deutsch, aber nicht sehr gut. A human has little difficulty in translating between two computer languages, and not too much problem between "natural" languages. Computer programs seem to have much more of a problem. I have relatively little difficulty in translating between mathematical constructions and HLL or machine constructions, but the current communication channels lack the flexibility for even fairly efficient compilers to take over. The compiler writer has provided the translation between x = y-z and machine code in such a way that the compiler can take into account types, locations, etc., in producing good code. The same type of translation should be available for other constructs. If a programmer must make the detailed translation, or even does this for reasons of efficiency, in each case, the language designers and architects will not see the uses of the constructs. If the programmer instead can use the dictionary, the construct is apparent, rather than its expansion, which is likely to mean little. How many would recognize the code for B'A^(-1)C, A positive definite, without expecting that to be done? -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
rcg@lpi.liant.com (Rick Gorton) (03/16/91)
The example was: foop (int x, int y) { struct div_and_rem t; t = divide(x,y); printf ("%d %d\n", t.quotient, t.remainder); } The 80X86 assembly code (from gcc 1.3) was: > foop: > pushl %ebp PPPP > movl %esp,%ebp PPPP > pushl %ebx PPPP > movl 8(%ebp),%eax > cltd > idivl 12(%ebp) > movl %eax,%ecx XXXX > movl %edx,%ebx XXXX > pushl %ebx XXXX > pushl %ecx XXXX > pushl $.LC0 XXXX > call printf XXXX > leal -4(%ebp),%esp EEEE > popl %ebx EEEE > leave EEEE > ret EEEE > In article <7850@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >This example has far too many loads and stores. Possibly this MIGHT not be >too important for a division, but how about something like frexp? The >operations may be register-register, in which case all these loads and >stores are inappropriate. Also, something this simple should be inlined; >if a subroutine call, there is the additional save/restore overhead which >has to be done somewhere. Sir: you have TOTALLY missed the boat here. The example was written as a subroutine/function. It was NOT written as a piece of inline code. See the lines marked "XXXX"? Those are for the user specified, machine translatable, machine primitive called "printf". It is a subroutine call to print the quotient and remainders. The lines marked "PPPP" are the subroutine PROLOGUE instructions, which are required because of the nature of the 80X86 architecture and calling conventions, and because this routine is NOT a leaf routine (it calls another routine). The instructions marked "EEEE" are the subroutine EPILOGUE instructions. So after getting to the heart of the matter, it takes 3 instructions to perform the "divide" idiom when the two inputs are ON THE STACK: > movl 8(%ebp),%eax > cltd > idivl 12(%ebp) You have been exceptionally vocal about the poor code quality of compilers, and yet you come up with a complaint like this? Your commentary leads me to believe that you are annoyed that your code executes slowly, but that you haven't made the slightest effort to understand the reasons why. Kind of like someone putting water in the gas tank of their car, and then telling the manufacturer of the engine that they need to improve the efficiency of the engine, because it works so poorly. And yet you want me, the compiler writer to try to fix your problem? Think again. What programming language do you use? What kind of hardware do you execute your programs on? Do you actually WRITE code at all? Or do you just complain about how slowly it executes? >Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 >Phone: (317)494-6054 >hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP) -- Richard Gorton rcg@lpi.liant.com (508) 626-0006 Language Processors, Inc. Framingham, MA 01760 Hey! This is MY opinion. Opinions have little to do with corporate policy.
hrubin@pop.stat.purdue.edu (Herman Rubin) (03/16/91)
In article <1991Mar14.195853.27398@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes: > In article <7850@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: ...................... > >This example has far too many loads and stores. > > 9 memory references. 4 necessary for the calling sequence gcc conforms to. > Three necessary to call another function, since the 'bcs' does not specify > calling routines with values in registers. Two more because arguments > passed in are not in registers. Leaving a total of 0 unnecessary loads and > stores. Could this be improved? Certainly. But not by much. > >Possibly this MIGHT not be > >too important for a division, but how about something like frexp? > > I think it was frexp() that I wrote for berkeley using gcc with inline > assembly. Uhm... I think it had 7 loads and stores, all but two or three of > which would disappear if the function got inlined and optimized. > > >The > >operations may be register-register, in which case all these loads and > >stores are inappropriate. > > Herman: where are you supposed to get the values from? Magic? > Computing q,r = a/b should not even consider a subroutine call. The arguments, or at least most of them, are likely to be the results of previous operations, and hence already in registers. The results are likely to be used in proximal instructions, and hence kept in registers rather than being stored. This IS what decent compilers do for the "standard" operations of + - * / ^ | &. Other operations should be treated in the same way, and not as subroutine calls. > >Also, something this simple should be inlined; > >if a subroutine call, there is the additional save/restore overhead which > >has to be done somewhere. > > Jesus. Guess what, herman: the routine *was* inlined. Take a look at the > original source code again. It is inlined, but it is still in the nature of a subroutine call. These "unusual" constructs should be treated as having general arguments, usually not in specified locations. The example given loaded the arguments and stored the results, even if that were inlined. Somewhat better would be to have the arguments in registers specific to the inlining procedure, and the results in other specified registers. This is not what a decent compiler now does for the operations it understands. The expansion should allow adding to THAT set of operations. To summarize, what should be provided is to allow the compiler to accept the idiom producer's insight into the various ways the job can be done using the machine instructions or previous idioms, and optimize using this information. As I understand an inlined subroutine call, it could not merely issue the instruction idivl a,b,q,r or for some machines something similar to idivl a,b,q movl q',r where q' is the register adjacent to q, assuming that things were in the appropriate registers, and only load/store as needed. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)