zrra07@backus (Randall R. Appleton) (07/05/90)
I have a simple question: Does anyone know what sort of speed-up on gets by going from a good implementation of some algorithm in a third generation language (C, Fortran) and a good optimizing compiler to hand-coded assembly? In other words, if I take my average well written program, compiled with a good optimizing compiler, and re-write it in assembler, what sort of speedup should I expect to see? Thanks, Randy
schow@bcarh185.bnr.ca (Stanley T.H. Chow) (07/06/90)
In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes: >I have a simple question: Does anyone know what sort of speed-up >on gets by going from a good implementation of some algorithm in a >third generation language (C, Fortran) and a good optimizing compiler >to hand-coded assembly? > >In other words, if I take my average well written program, compiled >with a good optimizing compiler, and re-write it in assembler, what sort >of speedup should I expect to see? > >Thanks, >Randy The short answer is: Depends. OKay, you ask, on what does it depend? Well, it depends on just about every single adjective you used in your question: How good is "a good implementation"? Which algoritm is the "some algorithm"? Which "good optimizing compiler"? Whose average? Over which set of programs? "Well written" for what purpose? Who will do the rewriting? For which computer? Basically, performance is like anything else, the variation between people (using identical tools) is larger than variation between tools. So the "optimum" stratagy is to get the best person for the job. You really will have to specify more details to get any meaningful answer to your question. For example, There are some Fortran compilers that are extremely good; for many programs, the code is just about optimal. The meaning of "well written" is very context dependent. Some people insist on clear, understandable code; others want it fast, still others want it to be robust and bullet-proof. Which properties does your program have and which properties do you want to preserve in assembler? -- Stanley Chow BitNet: schow@BNR.CA BNR UUCP: ..!psuvax1!BNR.CA.bitnet!schow (613) 763-2831 ..!utgpu!bnr-vpa!bnr-rsc!schow%bcarh185 Me? Represent other people? Don't make them laugh so hard.
cik@l.cc.purdue.edu (Herman Rubin) (07/06/90)
In article <1797@apctrc.UUCP>, zrra07@backus (Randall R. Appleton) writes: > I have a simple question: Does anyone know what sort of speed-up > on gets by going from a good implementation of some algorithm in a > third generation language (C, Fortran) and a good optimizing compiler > to hand-coded assembly? > > In other words, if I take my average well written program, compiled > with a good optimizing compiler, and re-write it in assembler, what sort > of speedup should I expect to see? This only looks simple. The results might even be drastically different on different machines. Even the choice of algorithm might depend quite heavily on the specific machine. The infamous 4.2BSD frexp() was an example of a problem which can be efficiently (and even very simply) coded in assembler, but which cannot be efficiently handled by such limited languages as those mentioned. The thing missing from these languages are the operations of pack and unpack; utterly trivial, but not there. In fact, even less is needed on most machines. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!cik(UUCP)
toma@tekgvs.LABS.TEK.COM (Tom Almy) (07/06/90)
In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes: >I have a simple question: Does anyone know what sort of speed-up >on gets by going from a good implementation of some algorithm in a >third generation language (C, Fortran) and a good optimizing compiler >to hand-coded assembly? Simple Answer: That depends... If the program implements an algorithm that is well suited for the compiler (standard matrix multiplication in Fortran, for example) or is a program that is a standard benchmark for which *every* compiler writer optimizes, you might not see any improvement at all. If fact it might run slower if you are not as clever as the compiler writer! On the other hand, if the algorithm cannot be efficiently represented in the language (string manipulation in Fortran, complex numbers in C) or there are special cpu instructions you can take advantage of that the compiler doesn't, then you can get major improvements, even order of magnitude. In general, there is only a small portion of the code that uses most of the execution time, and by coding those portions in assembler, you can get most of the assembler performance while still enjoying the convenience of a high level language for the bulk of the code. If you are concerned about code size, there still is a major improvement to be offered with assembler, typically because you can custom tailor subroutine calling conventions and register usage. But then again many modern compilers let you do the same. (I hope this won't be the start of some "language war" :-) Tom Almy toma@tekgvs.labs.tek.com Standard Disclaimers Apply
henry@zoo.toronto.edu (Henry Spencer) (07/06/90)
In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes: >I have a simple question: Does anyone know what sort of speed-up >on gets by going from a good implementation of some algorithm in a >third generation language (C, Fortran) and a good optimizing compiler >to hand-coded assembly? It can be negative. Good optimizing compilers can often do a better job on large bodies of code than humans can. Small code fragments often do better when carefully crafted by humans, although this is a very time-consuming activity. Improving the algorithm is usually a more cost-effective way of boosting performance. -- "Either NFS must be scrapped or NFS | Henry Spencer at U of Toronto Zoology must be changed." -John K. Ousterhout | henry@zoo.toronto.edu utzoo!henry
gillett@ceomax..dec.com (Christopher Gillett) (07/07/90)
In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes: >I have a simple question: Does anyone know what sort of speed-up >on gets by going from a good implementation of some algorithm in a >third generation language (C, Fortran) and a good optimizing compiler >to hand-coded assembly? > This is, among other things, an interesting question, but I certainly wouldn't characterize it as "simple" :-). There are a huge number of factors to consider in making this determination, and many of them are subjective and/or incalculatable. Here are some things that you should consider: 1. Nature of the program: Something that is fairly linear and spends all its time calling system functions is not likely to be sped up by conversion to assembler. Something fairly complex that uses lots of control transfers, and is reasonably non-obvious (itself a very subjective term) is more likely to be improved. 2. How good is good? The quality of the compiler that you are using is significant. Compilers range from utter trash to seemingly clairvoyant in their abilities to generate efficient code. Since there's no really good metric that you can apply to an arbitrary compiler to determine its effectiveness, you're left making a judgement call about whether or not the compiler is a better assembly language programmer than you. 3. How talented are you? The art of assembly language programming is rapidly dying out. Many people are really gifted high level programmers who fall apart in the assembly world (absolutely no jabs at anybody at all intended here, btw :-) ). Problems are solved much differently in the two different worlds. If you try to do a straight translation from high level to low level, you're not doing much better than the compiler (in fact, probably much worse). However, if you are a gifted machine-level hacker, and you have really understand the program you want to convert (like, did you write it yourself?), then you might gain significantly by converting the code. These are just the first few things that pop into mind. My opinion on the whole matter is that you should consider language issues long before putting down any code. The very nature of the program should dictate which language to use. Further, it's been my observation that code conversion jobs (whether from one high level language to another, or from a high level to a low level, or low to high) are expensive, time consuming, and not very rewarding. >Thanks, You're welcome! >Randy /Chris --- Christopher Gillett gillett@ceomax.dec.com Digital Equipment Corporation {decwrl,decpa}!ceomax.dec.com!gillett Hudson, Taxachusetts (508) 568-7172
nmm@cl.cam.ac.uk (Nick Maclaren) (07/07/90)
In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes: >I have a simple question: Does anyone know what sort of speed-up >on gets by going from a good implementation of some algorithm in a >third generation language (C, Fortran) and a good optimizing compiler >to hand-coded assembly? > >In other words, if I take my average well written program, compiled >with a good optimizing compiler, and re-write it in assembler, what sort >of speedup should I expect to see? It is impossible to predict, in general, but the following rules of thumb may be useful. Consider the minimal program to take unit time (or space); the following multiplicative factors are good guidelines: Competence of assembler programmer 1x - 5x Competence of 3 GL programmer 1x - 3x Suitability of 3 GL for algorithm 1x - 10x Competence of compiler's code 1x - 3x Degree of optimisation 1x - 10x You can therefore expect the assembler to be anything from 900 times faster to 5 times slower! This excludes even more drastic factors (in either direction), caused by gross mistakes or misunderstandings (e.g. dropping into a floating-point emulator to do 2-D array indexing). Nick Maclaren University of Cambridge Computer Laboratory nmm@cl.cam.ac.uk
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (07/07/90)
In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes: >... if I take my average well written program, compiled >with a good optimizing compiler, and re-write it in assembler, what sort >of speedup should I expect to see? The speedup will vary. A lot. There are rare examples where well-written assembler code was slower. -- Don D.C.Lindsay
mwm@raven.pa.dec.com (Mike (Real Amigas have keyboard garages) Meyer) (07/07/90)
In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
In other words, if I take my average well written program, compiled
with a good optimizing compiler, and re-write it in assembler, what sort
of speedup should I expect to see?
Lots of good answers, but everyone missed my favorite reason for
compilers generating lousy code - they have a machine that's
ill-matched to the language in question. The best example is to take a
language that really expects there to be a stack, and compile it for a
machine that doesn't have one. An excellent compiler will decide which
routines aren't possibly recursive and compile them into code that
doesn't use the stack, but other routines will have to simulate the
stack in some (probably slow & ugly) manner. I never saw a compiler
that smart, but it's been years since I worked on such a machine.
Given that all routines are compiled as if they might be recursive,
recoding in machine language in such a case - even if the coder is
only moderatly competent - to use fixed memory locations for local
variables and pass arguments in registers instead of doing those
things on a fake stack can easily run an order of magnitude faster,
and a factor of 2 smaller.
<mike
--
So this is where the future lies Mike Meyer
In a beer gut belly; In an open fly mwm@relay.pa.dec.com
Brilcreamed, acrylic, mindless boys decwrl!mwm
Punching, kicking, making noise
jlg@lanl.gov (Jim Giles) (07/07/90)
From article <373@e2big.mko.dec.com>, by gillett@ceomax..dec.com (Christopher Gillett): > [...] > 3. How talented are you? The art of assembly language programming > is rapidly dying out. Many people are really gifted high level > programmers who fall apart in the assembly world (absolutely no > jabs at anybody at all intended here, btw :-) ). [...] This is an interesting point which is more complex that it appears. I can do assembly code reasonably well on most RISC-ish type machines (which include Crays - only one addressing mode, simple instructions, etc.). I can't do well at all with CISC machines (I can't see the point in taking the trouble to learn all those mode and such). So, part of your point three is really a reference to personality. However, you left out target machine as one of your points. The type of hardware you are targeted for can have a very strong impact on how weel you can do vs. a compiler. On a simple scalar machine, you might not do as well as the compiler since implementors have long since found ways of generating good code for such machines (besides, there really isn't much you can do - the scalar non-pipelined code has only one "degreee of freedom" and that is instruction selection - ordering and register allocation are usually obvious). On a vector or parallel machine, one loop that _you_ vectorize but the compiler didn't can gain a _lot_! J. Giles
oz@yunexus.yorku.ca (Ozan Yigit) (07/07/90)
In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes: >In other words, if I take my average well written program, compiled >with a good optimizing compiler, and re-write it in assembler, what sort >of speedup should I expect to see? When I was a young & budding assembler hack, I tried this on several compiler outputs. At that time, VMS fortran and C compilers were considered very good. I was able to out-optimize them by as much as 30%. [As I said, I was young :-)] These days, compilers like GCC come very close, and while I am convinced that for a moderate chunk of code, a determined hacker can still out-optimize a compiler, I fail to see its value except in some very special cases. [I had to improve a chess program once, by re-writing its bitmap routines (which got called zillions of times) in assembler. It was about 30 lines of code, and made about 40% speed improvement] oz
gillies@m.cs.uiuc.edu (07/07/90)
It can be hard to find a speedup on CISC machines, since cacheing, alignment, scoreboarding, and even paging effects may be hard to predict. For instance, the Motorola 68020 Manual admits that you can only hand-wave guesses at the execution time of an instruction stream -- this forces the dedicated tuner to run hoardes of benchmarks. This is my greatest complaint about CISC architecture -- from an optimization standpoint, I sneer religiously at nondeterministic low-level hardware hacks in high-performance computer design (I don't even like caches, scoreboards, or paging -- they're just fads). Maybe it's because I think all future computers should run real-time operating systems, and because I do research in real-time system scheduling and optimization, that I hold these views.
cik@l.cc.purdue.edu (Herman Rubin) (07/07/90)
In article <373@e2big.mko.dec.com>, gillett@ceomax..dec.com (Christopher Gillett) writes: > In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes: > >I have a simple question: Does anyone know what sort of speed-up > >on gets by going from a good implementation of some algorithm in a > >third generation language (C, Fortran) and a good optimizing compiler > >to hand-coded assembly? ......................... > 3. How talented are you? The art of assembly language programming > is rapidly dying out. Many people are really gifted high level > programmers who fall apart in the assembly world (absolutely no > jabs at anybody at all intended here, btw :-) ). Problems are > solved much differently in the two different worlds. If you try > to do a straight translation from high level to low level, you're > not doing much better than the compiler (in fact, probably much > worse). However, if you are a gifted machine-level hacker, and > you have really understand the program you want to convert (like, > did you write it yourself?), then you might gain significantly > by converting the code. I am inclined to doubt that those suffering from lack of understanding of the relation between architecture and "how to do it efficiently" will even be gifted high level programmers. To produce good code, it is necessary to understand the cost considerations, and I find it difficult to believe that this can be done at the limited level of HLLs. There are two problems with the present assembler. First, there is the gosh-awful syntax and arbitrariness of them. Suppose that for appropriate x, y, and z, the operation x = y - z can be performed by a single machine instruction. Now I understand what the machine is doing, but how do I communicate it to the assembler? Those who have used more than one assembler are likely to have seen assemblers with the orders of x, y, and z quite different. There is also the attitude in the computer field that the user need not take these things into account, that the compiler is a great genius and should do it all for you. This is similar to the attitude which existed in the public schools almost universally, and is not completely gone yet, that reading should be learned only by the whole word method, and the idea of sounding out words, especially in English where there are "so many exceptions", should not be taught until much later. I would put the hardware of the most machines I know as considerably simpler than any HLL. > These are just the first few things that pop into mind. My opinion on the > whole matter is that you should consider language issues long before putting > down any code. The very nature of the program should dictate which language > to use. Further, it's been my observation that code conversion jobs (whether > from one high level language to another, or from a high level to a low level, > or low to high) are expensive, time consuming, and not very rewarding. I would disagree somewhat. I have found starting with a Fortran program usually not helpful, but not the case with C. Even the talented human can learn from the apparently stupid observations made by the less talented, including the compiler. Others have observed that reverse engineering their own programs can result in improvements. But one is unlikely to produce an artist by having the apprentice color within the lines. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!cik(UUCP)
tbray@watsol.waterloo.edu (Tim Bray) (07/08/90)
zrra07@backus (Randall R. Appleton) writes: >In other words, if I take my average well written program, compiled >with a good optimizing compiler, and re-write it in assembler, what sort >of speedup should I expect to see? Isn't "average well written program" an oxymoron? The speedup, as several people have pointed out, is difficult to predict. In general, however, the *cost-effectiveness* is liable to be negative. If the program is used heavily enough that hard-won speedups look attractive, it will require regular maintenance as people's needs change. Performing this maintenance on assembler code is apt to be dramatically more difficult and burn tons of expensive programmer time. Second, a *really* cost-effective way to make a program faster is to run it on this year's hot chip, probably twice as fast as your current computer if it's > 18 months old. But if you've committed to one architecture by going to assembler, you've closed that door. Third, as some have already said, but it can't be said too often, the expected performance improvement per unit effort from repeatedly profiling your code and whittling away at the algorithms in the critical areas is typically dramatically greater than that for assembly hacking. And the results are portable. Yup, assembler hacking is a dying art, thank goodness. But I'm glad I'm old enough to have done it. Doing serious work with VAX assembler was out and out *fun* - I suppose hunting dinner with a stone knife was too. Cheers, Tim Bray, Open Text Systems
cik@l.cc.purdue.edu (Herman Rubin) (07/08/90)
In article <3300143@m.cs.uiuc.edu>, gillies@m.cs.uiuc.edu writes: > > It can be hard to find a speedup on CISC machines, since cacheing, > alignment, scoreboarding, and even paging effects may be hard to > predict. For instance, the Motorola 68020 Manual admits that you can > only hand-wave guesses at the execution time of an instruction stream > -- this forces the dedicated tuner to run hoardes of benchmarks. > > This is my greatest complaint about CISC architecture -- from an > optimization standpoint, I sneer religiously at nondeterministic > low-level hardware hacks in high-performance computer design (I don't > even like caches, scoreboards, or paging -- they're just fads). So the fact that a multiplication takes a variable length of time depending on the numbers is a flaw in the machine? I do not think so. There are even cleaer examples. I have algorithms for which a tool operation is finding the spacing between the 1's in a random bit stream. Of course this is non-deterministic. Also, it is even necessary to have planned exceptions, which in some cases are rather likely. There are certainly situations where a non-optimized CISC machine will beat an optimized RISC machine with the same basic components. I have argued for some operations on machines which are deterministic, even, but which will have hardware switches of essentially no cost, and would have conditional transfers in RISC, or even in current CISC. > Maybe it's because I think all future computers should run real-time > operating systems, and because I do research in real-time system > scheduling and optimization, that I hold these views. There are many situations which can not be done in real time. Also, the operating system should be a very minor part of the software, and do only what an operating system, and not user modules, must. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!cik(UUCP)
peter@ficc.ferranti.com (Peter da Silva) (07/08/90)
In article <2319@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > But one is unlikely to produce an artist by having the apprentice color within > the lines. Disney, Fritz Freleng, Don Bluth, and other animators would disagree violently. There are some products of the human mind that are best created by teams of more and less talented individuals. Animators and apprentices... or humans and compilers. -- Peter da Silva. `-_-' +1 713 274 5180. <peter@ficc.ferranti.com>
cavrak@uvm-gen.UUCP (Steve Cavrak,113 Waterman,6561483,) (07/08/90)
From article <1797@apctrc.UUCP>, by zrra07@backus (Randall R. Appleton): > I have a simple question: Does anyone know what sort of speed-up > on gets by going from a good implementation of some algorithm in a > third generation language (C, Fortran) and a good optimizing compiler > to hand-coded assembly? > An example with VS-FORTRAN on an IBM-3090 with a vector processor. a. plain fortran matrix 1.0 b. full optimized fortran 0.3 c. IBM's ESSL hand coded library 0.1 or better As another example, you could track down the standard LINPACK benchmarks from net-lib@ornl.gov. The report there gives a good summary of how the LINPACK performs under different conditions. Steve
usenet@nlm.nih.gov (usenet news poster) (07/09/90)
cavrak@uvm-gen.UUCP (Steve Cavrak,113 Waterman,6561483,) writes: >in response to zrra07@backus (Randall R. Appleton): >> I have a simple question: Does anyone know what sort of speed-up >> on gets by going from a good implementation of some algorithm in a >> third generation language (C, Fortran) and a good optimizing compiler >> to hand-coded assembly? > >An example with VS-FORTRAN on an IBM-3090 with a vector processor. > >a. plain fortran matrix 1.0 >b. full optimized fortran 0.3 >c. IBM's ESSL hand coded library 0.1 or better It is my understanding that on the RS/6000, ESSL is implemented in Fortran, and IBM claims that most ports have experienced zero or negative gains in attempting to hand code assembler on the machine. There have been a number of references to "the ill conceived notions" of higher level languages in this thread, but it is not at all clear to me what these are supposed to be. C was designed as a higher level alternative to assembler. On a RISC machine with a good compiler, a C programmer willing to work on his code ought to be able to do almost anything hand coded assembler can. This may involve looking at assembly output, recoding routines a few times and testing alternative alogorithms, but in the end you still have C (that will run on another processor even if not as fast). On a complicated processor (CISC, non-orthogonal registers, segmented addresses etc.) assembler may do better because the compiler writers made simplifying assumptions about the processor, making some instructions and address constructs inaccessible. Seems like Intel processors will be the last bastion of assembly language. >Steve David States
hankd@dynamo.ecn.purdue.edu (Hank Dietz) (07/09/90)
In article <1990Jul6.161158.1297@zoo.toronto.edu>: >In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes: >>I have a simple question: Does anyone know what sort of speed-up >>on gets by going from a good implementation of some algorithm in a >>third generation language (C, Fortran) and a good optimizing compiler >>to hand-coded assembly? > >It can be negative. Good optimizing compilers can often do a better >job on large bodies of code than humans can. Small code fragments >often do better when carefully crafted by humans, although this is >a very time-consuming activity. Basically, I agree. Given the caveat that good compilers are quite rare, I have seen good compilers consistently outperform handwritten assembly code -- but humans could usually do local tweaks to compiler generated code to speed it up even a little more. IMHO, this is primarily because compilers usually don't do a very good job of modeling irregular or complex instructions, whereas humans often are great at using those, but write inferior code for the more mundane operations (e.g., they do a poorer job of allocating registers, aren't consistent about taking advantage of some properties, etc.). Notice that RISC architectures don't have many "funny" instructions, so compilers for them can do particularly well against humans.... Basically, optimized code generation is a search procedure -- if the compiler knows to search for the right things (which is learned in part by observing what the best human programmers do), the fact that computers can search faster and more consistently than humans will eventually make compiler technology generate better code than a typical human every time. Think about how computers play chess -- at first, clever heuristics were best, but lately the winners have focussed more on very fast search techniques, and now they are generally better than most human chess players. Yes, the very best human chess players still win, but they are a very small group, and each game played so carefully is a major effort (i.e., costly). So it is/will be for compilers versus humans writing assembly language programs.... In any case, for good C and Fortran compilers using 1970s compiler technology and generating code for a PDP11ish target, I'd say a factor of 2.5x slower than carefully written assembly code is typical. For a compiler using 1980s compiler technology (there are very few of these) and targeting a RISC machine, I'd say it's common to be 30% or more *faster* than hand-written code (assuming a dumb, non-optimizing, assembler). Of course, there are billions and billions of caveats on these numbers.... -hankd@ecn.purdue.edu
cliffc@erato.rice.edu (Cliff Click) (07/09/90)
In article <2319@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >But one is unlikely to produce an artist by having the apprentice color within >the lines. >-- >Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 >Phone: (317)494-6054 >hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!cik(UUCP) For once I agree with you! Looking at what some compilers produce (say, at least 6 commerical ones) I can get a factor of 2 speedup by doing "stupid" optimizations, that I would expect the compiler to get on it's own. To get better than that I cannot color within the compiler's lines, but generally I need to restructure how things are solved - not nessecarily a change of algorithms but a change of *structure*. Cliff Assembly Hacker Click -- Cliff Click cliffc@owlnet.rice.edu
rmarks@KSP.Unisys.COM (Richard Marks) (07/09/90)
I wrote a version of UUDECODE for the IBM PC. It is used by many internet users. It is written in Turbo PAscal 5.5. I coded it very carefully with much consideration for speed. Then I rewrote about 5% of the code, the inner loops, in assembler and got a 50% speed up. Moral: Consider implementing in your favorite higher level language; but then recode the inner loops in assembler for perfromance. Regards, Richard Marks rmarks@KSP.unisys.COM
atk@boulder.Colorado.EDU (Alan T. Krantz) (07/09/90)
In article <1990Jul8.230954.18881@ecn.purdue.edu> hankd@dynamo.ecn.purdue.edu (Hank Dietz) writes: >In article <1990Jul6.161158.1297@zoo.toronto.edu>: >>In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes: >> >>It can be negative. Good optimizing compilers can often do a better >>job on large bodies of code than humans can. Small code fragments >>often do better when carefully crafted by humans, although this is >>a very time-consuming activity. > >Basically, I agree. Given the caveat that good compilers are quite >rare, I have seen good compilers consistently outperform handwritten >assembly code -- but humans could usually do local tweaks to compiler >generated code to speed it up even a little more. I'm not so sure I've seen this very often. It seems that on some of the newer machines (IBM RS/6000) which have very special conditions for optimizing the pipeline this might be more accurate - but on many machines it isn't. The trick is when you hand write assembly code you get to optimize register usage across routine calls much better then most compilers can (actually, to be honest I haven't seen any compilers do global register optimization - though this is an area of active research). Anyways, almost every comment I've seen on compilers vs hand optimization has been on a routine by routine case - and I still believe (though I could be wrong) that the optimial results from assembly code comes from register usage across/through routine calls... Ho hum - I'm sure someone will set me straight .... ------------------------------------------------------------------ | Mail: 1830 22nd street Email: atk@boulder.colorado.edu| | Apt 16 Vmail: Home: (303) 939-8256 | | Boulder, Co 80302 Office: (303) 492-8115 | ------------------------------------------------------------------
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (07/09/90)
In article <9730@brazos.Rice.edu> cliffc@erato.rice.edu (Cliff Click) writes: | Looking at what some compilers produce (say, at least 6 commerical ones) I can | get a factor of 2 speedup by doing "stupid" optimizations, that I would expect | the compiler to get on it's own. To get better than that I cannot color | within the compiler's lines, but generally I need to restructure how things | are solved - not nessecarily a change of algorithms but a change of *structure*. Yes to both ideas. I was doing a Mandelbrot program and when I had done all I could to speed it up, I output the assembler code to a file, optimized it by hand, and got a 45% reduction in CPU time. However, I had alread gotten a 60% reduction by improving the algorithm and the implementation in C. While there is room for notable improvements when you have a small routine called 600 million times, in general the real savings comes from improving the overall algorithm to reduce the work to be done. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Stupidity, like virtue, is its own reward" -me
ingoldsb@ctycal.UUCP (Terry Ingoldsby) (07/10/90)
In article <1797@apctrc.UUCP>, zrra07@backus (Randall R. Appleton) writes: > I have a simple question: Does anyone know what sort of speed-up > on gets by going from a good implementation of some algorithm in a > third generation language (C, Fortran) and a good optimizing compiler > to hand-coded assembly? This seems to be following along in the thread of compiler optimization, where various people are suggesting that optimizers have or have not replaced human ingenuity in generating efficient code. This discussion loses sight of two things: 1) How hard (long) does the human try? - Some years ago I wrote some critical code for an application, part of which ran on a TMS32010 and part on an 8086 (it was a multi- processor system). There were two sections of the program that had to be *very* fast. I worked almost a week to write at most three pages of code. When I was done, I doubt that any optimizer could have beat that code. I tried the implications of using different registers for different purposes, and so on until I had what was the fastest possible implementation. But at what cost? I wrote about 150 assembler instructions in 5 days. That is 30 instructions/day, or 1 instruction every 15 minutes. 2) Does portability mean anything to you? - The code I wrote was very non-portable. If the processors had been changed, even to other members of the same family (eg. TMS32020 or 80286) then all my optimizations would have been for naught. The new processors would have required a different optimization strategy. The old code would have run, but not optimally. To me the argument of which is better is, man or machine, depends on what you are trying to accomplish. If time and money are not a problem, then an expert human can probably beat a compiler. If budgets or portability have any importance the compiler wins hands down. P.S. To answer your question, I probably saved 50% runtime on the tight portions of my code. -- Terry Ingoldsby ctycal!ingoldsb@calgary.UUCP Land Information Services or The City of Calgary ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb
jkrueger@dgis.dtic.dla.mil (Jon) (07/10/90)
cavrak@uvm-gen.UUCP (Steve Cavrak,113 Waterman,6561483,) writes: >a. plain fortran matrix 1.0 >b. full optimized fortran 0.3 >c. IBM's ESSL hand coded library 0.1 or better But what is this code DOING? Without that information these numbers are useless. It's reporting what happened to the dependent variables without specifying the independent variables. -- Jon -- Jonathan Krueger jkrueger@dtic.dla.mil uunet!dgis!jkrueger Drop in next time you're in the tri-planet area!
chip@tct.uucp (Chip Salzenberg) (07/10/90)
According to cik@l.cc.purdue.edu (Herman Rubin): >There are two problems with the present assembler. First, there is the >gosh-awful syntax and arbitrariness of them. Assembler syntax isn't an architectural issue. For example: The Z-80 has two widely used assembler syntaxes, the extended 8080 and the Zilog. Both of them are, to me, difficult to read. Compare, however, the smal2z language which was bundled with an old Intel C compiler. Using a bit of lex and yacc in front of the normal assembler, the smal2z translator accepted C syntax for Z-80 operations: A = B; /* LD A,B */ B = *HL++; /* LD B,(HL) | INC HL */ do { /* label: */ *IX++ = 0; /* LD (IX),0 | INC IX */ } while (--B); /* DJNZ label */ My point: Assembler syntax isn't an architectural issue. It's just an arbitrary way to present the architecture to programmers. >There is also the attitude in the computer field that the user need not take >these things into account, that the compiler is a great genius and should do >it all for you. The attitude I have observed is more like: "I'd rather let the computer do the grunt work for me, than do it myself." Micromanagement produces microsuccess. I prefer macrosuccess. If you don't like HLLs, don't use them. But don't heckle those of us who would use HLLs as a valuable tool -- even our primary tool. -- Chip Salzenberg at ComDev/TCT <chip@tct.uucp>, <uunet!ateng!tct!chip>
grunwald@foobar.colorado.edu (Dirk Grunwald) (07/10/90)
The MIPS compiler suite (and old bit of software, too) does global optimization. I think that they still adhere to a set register passing convention, but they do things like eliminate stack manipulation for leaf procedures, etc etc. Several people (Steinkist & Hennessy, Wahl, Wallace) have investigated the advantages of global register allocation at link time, and it does very well. You don't even need to have runtime performance data to tune the allocation; simple bottom-up coloring appears to work very well. Someone just needs to hack it into Gnu C. This would raise the common denominator of compiler performance & make companies either provide something of comparable performance or support the Gnu compiler. I still can't believe how many systems are shipped with shitty compilers.
mcdaniel@adi.com (Tim McDaniel) (07/10/90)
Henry Spencer wrote:
It can be negative. ...
hankd@dynamo.ecn.purdue.edu (Hank Dietz) wrote:
Basically, I agree.
atk@boulder.Colorado.EDU (Alan T. Krantz) writes:
I'm not so sure I've seen this very often.
Damn right! Henry Spencer and Hank Dietz AGREEING on something?!?
8-)
--
"I'm not a nerd -- I'm 'socially challenged'."
Tim McDaniel
Internet: mcdaniel@adi.com UUCP: {uunet,sharkey}!puffer!mcdaniel
jon@hitachi.uucp (Jon Ryshpan) (07/10/90)
>>I have a simple question: Does anyone know what sort of speed-up >>on gets by going from a good implementation of some algorithm in a >>third generation language (C, Fortran) and a good optimizing compiler >>to hand-coded assembly? ... > Notice >that RISC architectures don't have many "funny" instructions, so >compilers for them can do particularly well against humans.... One of the reasons that it's not so easy to improve on the C compiler for RISC chips is that they were designed to support good C compilers. So if what you want to say (what your algorithm requires) is easy to express in C, you're likely to get an compilation that's hard to improve on. Even so, if what you want to say is not easy to express in C, like "complex number" or "function in a function" or "static variable bound to a fixed memory location" you may get an object that would benefit from hand coding. Jonathan Ryshpan <...!uunet!hitachi!jon> M/S 420 (415) 244-7369 Hitachi America Ltd. 2000 Sierra Pt. Pkwy. Brisbane CA 94005-1819
meissner@osf.org (Michael Meissner) (07/10/90)
In article <23285@boulder.Colorado.EDU> grunwald@foobar.colorado.edu (Dirk Grunwald) writes: | The MIPS compiler suite (and old bit of software, too) does global | optimization. I think that they still adhere to a set register passing | convention, but they do things like eliminate stack manipulation for | leaf procedures, etc etc. | | Several people (Steinkist & Hennessy, Wahl, Wallace) have investigated | the advantages of global register allocation at link time, and it does | very well. You don't even need to have runtime performance data to | tune the allocation; simple bottom-up coloring appears to work very | well. I've always wondered about the gains. Speaking off the cuff, I would imagine you would get gains if you limit yourself to a Fortran style of coding (no matter what language you use). By Fortran style, I mean little or no recursion, all calls are direct (not through pointers), and all routines exist when the program is loaded. In particular, calls to routines through pointers have to fall back on some sort of fixed rules such as which set of registers are caller save or callee save, and where the arguments go. Most of the object oriented programs that I'm aware of ultimately call functions through pointers, and this in fact is touted as a good thing, so that a subclass can replace a function as needed. Also in dynamic languages like LISP, the function being called may not have been created yet, or will have been replaced by the time the call is made. A weaker version of this is if your system has shared libraries -- when the link is made, the shared libraries are not bound in, they are generally bound in when either the program starts up, or an unbound function is called. I also wonder about the computational complexity of global register analysis. My experience has tended to be that register allocation is one of the more memory intensive pieces of compilers, and the size of the memory is needed scales with the number of basic blocks used and number of distinct pseudo regs requested. I could easily imagine a global register analysis phase thrashing on really large programs, unless severe restrictions where used, and those restrictions would tend to limit optimization. | Someone just needs to hack it into Gnu C. This would raise the common | denominator of compiler performance & make companies either provide | something of comparable performance or support the Gnu compiler. I | still can't believe how many systems are shipped with shitty | compilers. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA Do apple growers tell their kids money doesn't grow on bushes?
cliffc@sicilia.rice.edu (Cliff Click) (07/11/90)
In article <MEISSNER.90Jul10102037@osf.osf.org> meissner@osf.org (Michael Meissner) writes: >I also wonder about the computational complexity of global register >analysis. Global register allocation is NP-complete. >My experience has tended to be that register allocation is >one of the more memory intensive pieces of compilers, and the size of >the memory is needed scales with the number of basic blocks used and >number of distinct pseudo regs requested. I could easily imagine a >global register analysis phase thrashing on really large programs, Some folks at Rice here experimenting with a nameless commerical compiler discovered a simple 2000-line fortran program that took 90+ HOURS to compile on a 10+ Mips, 16 Meg workstation. The program spent roughly 98% of it's time thrashing the disk. The algorithm the compiler used required exponentional memory space. The same compile took ~15 minutes on a 48 Meg workstation. Cliff Click cliffc@rice.edu -- Cliff Click cliffc@owlnet.rice.edu
nevin@igloo.scum.com (Nevin Liber) (07/11/90)
[followups to comp.lang.misc only, please] In article <1990Jul8.183551.13891@nlm.nih.gov> states@tech.NLM.NIH.GOV (David States) writes: >On a RISC machine with a good compiler, a C programmer willing to work >on his code ought to be able to do almost anything hand coded assembler can. >This may involve looking at assembly output, recoding routines a few times >and testing alternative alogorithms, but in the end you still have C (that >will run on another processor even if not as fast). Bleech!! (This is not intended as a flame; it is just my humble opinion.) If you are going to go to all that bother, you are better off writing that portion of your code directly in assembler. There are two ways of programming in C: (1) writing high-level portable code (2) using it as a macro-assembler Unfortunately, it is darn near impossible to write code which satisfies both of these goals! And from a maintenance standpoint, assembler itself is better than C for goal #2. For one thing, it separates the code which needs to be redone during a port. For another, it always does what you expect it to do, even when you change versions of the compiler. -- NEVIN ":-)" LIBER nevin@igloo.scum.com or ..!gargoyle!igloo!nevin (708) 831-FLYS Advertisement: Hire me!
mash@mips.COM (John Mashey) (07/11/90)
Since this discussion pops up here every 6-9 months, maybe it's worth a shot at some generalizations, to put some structure around the flood of anecdotal examples and counter-examples. First MY anecdote: I wrote all of the C string library in assembler for the R2000 a long while back, because you ALWAYS do that when you do a new UNIX. THEN, after looking at the generated code from the standard C functions, we threw out most of the assmebly code, because the compiler was good enough that it wasn't worth QAing the assembly code. Generalizations: the following factors will encourage one to use assembly code (left) or compiled code (right). I think most of the anecdotes so far fit at least one of these. The interesting questions are: 1) Are there more good genralizations? 2) What kind of experiments can be done to quantify these more specifically? ASSEMBLER COMPILER 1. Old, non-optimizing compiler Modern, global optimizer, at least In scientific world, vectorizer/parallizer if appropriate. 2. Architecture/language mismatch Architecture/language well matched 2a. Architecture weird 2b. Language not expressive enough 3. One architecture target, forever Portability (as in some embedded) (as in UNIX commands) 4. Absolute speed requirement Premium on development cost, maintenance 5. Absolute space requirement Space costs, but no hard, low limit 6. Programming time is cheap Programming time is not cheap As usual, in a domain-of-discourse-confusion type of discussion, much of the arguments would disappear if people would preface their opinion by specifiying which domain they're living in (i.e., which of the attributes above apply to what they do.) ALso, it would probably encourage light, and reduce heat, if people would avoid statements of the form I haven't seen it so, it must not exist in favor of In domain X, it doesn't seem to exist. Does it elsewhere? In this light: a) People with RISCs, good optimizers, doing general-purpose code, using C or FORTRAN seldom write assembler code. b) People with CISCs are more likely to go to assembler, especially when using pcc-style compilers. Likewise, people using machineswith vector units, special graphics operations, unusual pipelines or some parallel architectures. c) People that need operations not easily expressed in popular languages go to assembler more often. d) Embedded control folks are more likely to use assembler. etc. Of course, it is the job of computer system designers to make systems for which the assembler/compiler tradeoff increasingly favors compilers, and the higher the level the better (i.e., do it in shell or awk ... or 4GLs, or whatever, if possible), but this doesn't mean that there are not sometimes good reasons for doing it in assembler some of the time..... Let's just hope that we can make it less often. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mwm@raven.pa.dec.com (Mike (Real Amigas have keyboard garages) Meyer) (07/11/90)
In article <759@bbking.KSP.Unisys.COM> rmarks@KSP.Unisys.COM (Richard Marks) writes:
Moral: Consider implementing in your favorite higher level language;
but then recode the inner loops in assembler for perfromance.
You missed a step - instrument the code to find out where the time is
really being spent, and why. Your intuition about that is probably
wrong. And since no such statement is complete without anecdotal
evidence, here's mine:
I had software that was doing strange floating point calculations.
80+% of the cpu time was spent inside an inner loop. 90+% of the time
in the loop was spent in a system function. Changing the inner loop to
precompute the 50 or so values it needed and using a lookup table
basically eliminated that 90%, giving me a speedup of 3-4 times.
If I'd just rewritten the inner loop in assembler without
instrumenting it, and done an excellent job of it, I'd have improved
the speed of the 10% of the loop by an order of magnitued, giving me
code that ran maybe 8% faster. Instrumenting took less time than an
assembler rewrite and the lookup change was trivial.
Not that rewriting in assembler isn't sometimes the right solution. In
this case, improving the algorithm was the right solution. But the
right first step is always to instrument things so you know you're
solving the right problem.
<mike
--
Cats will be cats and cats will be cool Mike Meyer
Cats can be callous and cats can be cruel mwm@relay.pa.dec.com
Cats will be cats, remember this words! decwrl!mwm
Cats will be cats and cats eat birds.
Publius@dg.dg.com (Publius) (07/12/90)
In article <40052@mips.mips.COM> mash@mips.COM (John Mashey) writes: >Generalizations: the following factors will encourage one to use >assembly code (left) or compiled code (right). I think most of the >anecdotes so far fit at least one of these. The interesting questions are: >1) Are there more good genralizations? >2) What kind of experiments can be done to quantify these more specifically? >ASSEMBLER COMPILER >1. Old, non-optimizing compiler Modern, global optimizer, at least > In scientific world, vectorizer/parallizer > if appropriate. >2. Architecture/language mismatch Architecture/language well matched > 2a. Architecture weird > 2b. Language not expressive enough >3. One architecture target, forever Portability > (as in some embedded) (as in UNIX commands) >4. Absolute speed requirement Premium on development cost, maintenance >5. Absolute space requirement Space costs, but no hard, low limit >6. Programming time is cheap Programming time is not cheap These are fair-minded, gentlemanly, brilliant comments. I only want to add the following: Many "large" programs spend most of the CPU time in one or a few relatively "tiny" routines. Hand-coding these routines will not cause much maintenance and portability problem, AS LONG AS they can be isolated. I am not sure if the "comp.arch" is the proper place to talk about software development methodology. However, it is not unusual these days that different pieces of a large-scale software project are written in different languages. If we take a look of the complex source tree of a typical large-scale software, with many level of makefiles, quite often we will run into an "asm" sub-directory, in which we can find a few ".s" files. The key is not really if the assembler code can be used. The key is if these assembler-coded routines can be properly isolated and documented. Many of the optimization problems belong to the "NP-complete" or even the "Turing-uncomputable" categories. Thus, no matter how smart the global optimizer is, as long as it is equivalent to a Turing machine, it can not achieve the FULL optimization for EVERY situation. If a large program spends most of the CPU time in a tiny routine, then, go ahead and hand-code it in assembler and do an exhaustive optimization, as long as you isolate it and document it comprehensively. -- Disclaimer: I speak (and write) only for myself, not my employer. Publius "Old federalists never die, they simply change their names." publius@dg-pag.webo.dg.com
ballen@csd4.csd.uwm.edu (Bruce Allen) (07/12/90)
There is a nice discussion of the relative speedups that can be obtained by various kinds of optimization in "Writing efficient programs" by Jon Bentley (Prentice-Hall Software Series, 1982). In two examples, discussed in detail by Bentley (pgs 26-28 and 130) the speedup obtained in hand-coding in machine-language was a factor of 2 in one case and a factor of 5 in the other case. The latter case was one in which the machine had a particularly well-suited instruction, the former case is more typical. The major lesson that one gets from Bentley's book is that almost all the significant speedups are obtained in designing data structures, algorithms, and general good coding. In this book many examples are given of "typical" code being speeded up by factors of ten or more just by careful rewriting of high-level code. I think this is an excellent, fun-to-read book. Anyone who has not looked at this book should peruse it the next time they're in a good bookstore (if you don't get distracted by "Programming Pearls", by the same author, just next to it!).
rcd@ico.isc.com (Dick Dunn) (07/12/90)
mwm@raven.pa.dec.com (Mike (Real Amigas have keyboard garages) Meyer) writes: > rmarks@KSP.Unisys.COM (Richard Marks) writes: > Moral: Consider implementing in your favorite higher level language; > but then recode the inner loops in assembler for perfromance. > > You missed a step - instrument the code to find out where the time is > really being spent, and why. Your intuition about that is probably > wrong... I'll add one more step to precede these two: Decide whether performance really matters before you start any tuning. Mike and Richard may have assumed that this step had already been taken, but it still bears pointing out. First think whether performance is an issue, and if so, whether it will be cost-effective to fix it. Then (as Mike sez) figure out where the performance problems are...and he's right that most people miss the problem areas. Don't tune something that doesn't matter. You'll probably find that most of the time is spent in a tiny area, at which point you apply Richard's advice. -- Dick Dunn rcd@ico.isc.com -or- ico!rcd (303)449-2870 ...Never attribute to malice what can be adequately explained by stupidity.
des@elberton.inmos.co.uk (David Shepherd) (07/12/90)
In article <9743@brazos.Rice.edu>, cliffc@sicilia.rice.edu (Cliff Click) writes: |> In article <MEISSNER.90Jul10102037@osf.osf.org> meissner@osf.org (Michael Meissner) writes: |> >I also wonder about the computational complexity of global register |> >analysis. |> |> Global register allocation is NP-complete. |> Some folks at Rice here experimenting with a nameless commerical compiler |> discovered a simple 2000-line fortran program that took 90+ HOURS to compile |> on a 10+ Mips, 16 Meg workstation. The program spent roughly 98% of it's time |> thrashing the disk. The algorithm the compiler used required exponentional |> memory space. The same compile took ~15 minutes on a 48 Meg workstation. When compiling one of the files to build the HOL theorem prover the C code that AKCL produces from the lisp "compilation" of the original ML code (ok so we're not talking about normal "user code" here) takes > 24 hours to compile on -O2. Without optimization it takes about 5 mins. I can't remember exactly but I don't think it was actually trashing the disk that much -- i.e. cpu seemed solid at 100%. -------------------------------------------------------------------------- David Shepherd: des@inmos.co.uk or des@inmos.com tel: 0454-616616 x 529 INMOS ltd, 1000 Aztec West, Almondsbury, Bristol, BS12 4SQ
sysmgr@KING.ENG.UMD.EDU (Doug Mohney) (07/12/90)
In article <628@dg.dg.com>, Publius@dg.dg.com (Publius) writes: >I am not sure if the "comp.arch" is the proper place to talk about software >development methodology. Funny, some of the responses I've seen on here sound like idelogy
yam@cbnewsi.att.com (toshihiko.yamakami) (07/13/90)
I have little experience on assembler coding, so I might make rather silly comments. From article <628@dg.dg.com>, by Publius@dg.dg.com (Publius): > Many of the optimization problems belong to the "NP-complete" or even > the "Turing-uncomputable" categories. Thus, no matter how smart the > global optimizer is, as long as it is equivalent to a Turing machine, > it can not achieve the FULL optimization for EVERY situation. I always wonder why human beings can do 'Turing-uncomputable' computation. If human beings can do, why can't a program simulate it? If anyone is kind enough to mail me the reason, I will appreciate it. > If a large program spends most of the CPU time in a tiny routine, then, > go ahead and hand-code it in assembler and do an exhaustive optimization, > as long as you isolate it and document it comprehensively. It is "comp.arch", so I would like to comment on RISC chips and optimization. Following discussion may not be applied to CISC. I believe it was true that hand-code in assembler was fast than compiler code in CISC era. however, some of RISC chip's instruction set is so simple, so I wonder what people can do in hand on such a code. For example, let's consider MIPS R2000(or R3000). Last year, I tried to my first assembler optimization in my life. Our group needed the enciphering performance of 1 Mbit/sec/MIPS. To achieve it, I was forced to do some hand optimization. To be honest, I don't like assembler coding(I am biased, you know :-) ). I tried to re-write compiler output code on MC68030, SPARC, R2000. I could improve performance about three times in MC68030, 5-10% on SPARC, nothing(to be honest, negativbe) on R2000. To my surprise, every time I rewrote the R2000 code, the performance was decreased, and I gave up rewriting. After that, I believe it is almost impossible to improve performance in hand-coding on some types of RISC chips. I am happy if I can hear other assembler experts's hand-coding experience on MIPS or similar RISC chips. -- Toshihiko YAMAKAMI(NTT, Japan) Resident visitor in Bell Labs until Feb 1991 Room 4G634, AT&T Bell Labs, Crawfords Corner Rd. Holmdel, NJ 07733-1988 Tel:(201)949-5742 e-mail: yam@vax135.att.com (was: yam@nttmhs.ntt.jp)
jkrueger@dgis.dtic.dla.mil (Jon) (07/13/90)
[truths about ways and means of getting the most results out of each processor second] Should mention somewhere that for an interesting class of problems, storing and sharing the results is more of a bottleneck than computing them. And that engines and methods that compute them faster only help somewhat for committing them and making them available faster. -- Jon -- Jonathan Krueger jkrueger@dtic.dla.mil uunet!dgis!jkrueger Drop in next time you're in the tri-planet area!
karsh@trifolium.esd.sgi.com (Bruce Karsh) (07/13/90)
In article <628@dg.dg.com> publius@dg-pag.webo.dg.com (Publius) writes: >Many "large" programs spend most of the CPU time in one or a few relatively >"tiny" routines. This is an often stated. Is there any evidence for this? What does the distribution of time look like for, for instance, a compiler? I'd expect that there would be lots of time-sinks. The lexical analyzer, the symbol table hash routine, the symbol table lookup routine, common subexpression finder, ... etc, could all be candidate time-sinks. How about a text editor? I could think of a few dozen potential time-sinks in a text editor. When I've profiled programs, I've often found that the time is spent mostly in a few routines, but this isn't always the case. How often is it true that nearly all of the time is spent in only a few routines? How often are these routines tiny? Bruce Karsh karsh@sgi.com
aglew@oberon.crhc.uiuc.edu (Andy Glew) (07/13/90)
>When I've profiled programs, I've often found that the time is spent mostly >in a few routines, but this isn't always the case. How often is it true that >nearly all of the time is spent in only a few routines? How often are these >routines tiny? > > Bruce Karsh > karsh@sgi.com I, also, have profiled programs where the 90%/10% for routines rule doesn't apply. Well, actually, in a few that I have investigated in detail there was some sort of ratio rule applicable to *types* of code, but not according to routine. For example, I have seen programs where 50% of the time was spent in function calling and returning, manipulating variables, etc. The problems were (1) determining what types of classifications (time spent function calling, time spent in loop overhead, time accessing method variables...) were appropriate, and (2) getting my profiler to tell me how much time was spent according to each classification. Unfortunately, most profilers do not have a way of indicating "X% of your time was spent in function calling". Gprof tends to provide function granularity (possibly with programmer provided parts). That's why I quickly fall back to "flat" profiling, done at the instruction level, for then I can classify instructions according to "purpose". I wish I had access to a commercial profiler that does this. Is there one? Finishing the anecdote... the program that spent 50% of its time in function calls? --- It was one of those well-structured "CS" programs :-) [*] that had lots of small functions called all over the program. Inlining (actually, back then I macrofied - didn't have access to a compiler with inlining) functions with fewer than 10 lines of code in their bodies got me most of the necessary improvement. [*] This is just a mild reference to Karsh et al's parallel discussion. I don't want to get involved in that dispute. I *like* well-structured programs with lots of subroutines. I just wish that all of my compilers did inlining automatically. -- Andy Glew, aglew@uiuc.edu
firth@sei.cmu.edu (Robert Firth) (07/13/90)
In article <2319@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >To produce good code, it is necessary to >understand the cost considerations, and I find it difficult to believe that >this can be done at the limited level of HLLs. What cost considerations, pray? The costs that usually matter are the costs of designing, producing, testing, debugging and maintaining the code, since these things are very intensive of human labour. Most people who build software are very sensitive to these costs, which is the reason then use high level languages. A few people, however, are completely insensitive to most of the costs that really matter, and focus on the cost of executing the code on a machine. These people code in assembler, and they can be an expensive luxury as their tight, efficient programs repeatedly fall behind schedule, fail on simple test cases, and eat up unbelievable amounts of maintenance time. I find it very sad that so many of these discussions take the standpoint that it is the purpose of the human to serve the machine.
henry@zoo.toronto.edu (Henry Spencer) (07/14/90)
In article <64045@sgi.sgi.com> karsh@trifolium.sgi.com (Bruce Karsh) writes: >>Many "large" programs spend most of the CPU time in one or a few relatively >>"tiny" routines. > >This is an often stated. Is there any evidence for this? What does the >distribution of time look like for, for instance, a compiler? I'd expect >that there would be lots of time-sinks... My personal feeling is that the "hot spot" principle has become accepted dogma on the basis of rather few actual cases, the most infamous being the XPL compiler that spent all of its time skipping trailing blanks in card images. I suspect that a lot of hot-spot examples are the result of something like this: stupid code, or a stupid data format. Our experience with C News was that there never was a single massive hot spot. Typically the profile would show half a dozen major contributors to execution time and then a long tail of smaller ones. Often, the way to fix it was *not* to optimize the profile leaders, but to look at where they were being called from and why, and streamline the algorithms in the callers. -- NFS: all the nice semantics of MSDOS, | Henry Spencer at U of Toronto Zoology and its performance and security too. | henry@zoo.toronto.edu utzoo!henry
Publius@dg.dg.com (Publius) (07/14/90)
In article <64045@sgi.sgi.com> karsh@trifolium.sgi.com (Bruce Karsh) writes: >In article <628@dg.dg.com> publius@dg-pag.webo.dg.com (Publius) writes: > >>Many "large" programs spend most of the CPU time in one or a few relatively >>"tiny" routines. > >This is an often stated. Is there any evidence for this? The best known example is probabily the LINPACK benchmark (although it is not big). It spends over three quarters of the CPU time in a tiny routine called daxpy() (saxpy() for the single-precision case). The same daxpy() routine, though coded a little differently, is also responsible for most of the CPU time of one of the SPEC 1.0 Suite benchmarks (matrix300, to be exact). Actually, many numerical analysis programs spend big chunks of CPU time in this daxpy() routine. Many years ago, Donald Knuth stated that most of the CPU time in batch business application environment was spent in doing sorting. In practice, the better a large piece of software is modularized, the easier it is to identify the time-consuming routines. Also, it is not difficult to understand that the programs written with the re-usable code in mind tend to spend much CPU time in the same routines. -- Disclaimer: I speak (and write) only for myself, not my employer. Publius "Old federalists never die, they simply change their names." publius@dg-pag.webo.dg.com
mo@messy.bellcore.com (Michael O'Dell) (07/14/90)
In article <1990Jul13.173331.22918@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: > >Our experience with C News was that there never was a single massive hot >spot. Typically the profile would show half a dozen major contributors >to execution time and then a long tail of smaller ones. Often, the >way to fix it was *not* to optimize the profile leaders, but to look at >where they were being called from and why, and streamline the algorithms >in the callers. Ohh Yes!! The ol' maxim of There's no substitute for knowin' what the *&$# you're doin'!!!! I must agree with Henry - after the the trivially-obvious things are eliminated you have to actually optimize the work, not the code. -Mike --------------- "The obvious is immediately apparent if you look closely enough."
ken@argus.UUCP (Kenneth Ng) (07/14/90)
In article <64045@sgi.sgi.com>, karsh@trifolium.esd.sgi.com (Bruce Karsh) writes: : In article <628@dg.dg.com> publius@dg-pag.webo.dg.com (Publius) writes: : >Many "large" programs spend most of the CPU time in one or a few relatively : >"tiny" routines. : This is an often stated. Is there any evidence for this? What does the : distribution of time look like for, for instance, a compiler? I'd expect :that there would be lots of time-sinks. The lexical analyzer, the symbol table : hash routine, the symbol table lookup routine, common subexpression finder, : ... etc, could all be candidate time-sinks. While there are undoubtably programs where most of the time is spent in a few routines, I've been party to some 100K line programs where the profiler indicated that the number one cpu user routine was using 2-3 % of the resources, and the rest were in the low 0.5 to 2 % range, almost noise level. And the top ten users were I/O routines, so I suspect the profiler was catching them doing I/O to the various devices. And yet I know of over a dozen places where routines could be improved by doing a binary instead of linear search. (Whether or not there would be an improvement is debatable because then the entires would have to be inserted and moved in order, and the average size was only a dozen or so.) -- Kenneth Ng: Post office: NJIT - CCCC, Newark New Jersey 07102 uucp !andromeda!galaxy!argus!ken *** NOT ken@bellcore.uucp *** bitnet(prefered) ken@orion.bitnet or ken@orion.njit.edu
yodaiken@chelm.cs.umass.edu (victor yodaiken) (07/14/90)
In article <1990Jul13.173331.22918@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: > >My personal feeling is that the "hot spot" principle has become accepted >dogma on the basis of rather few actual cases, the most infamous being >the XPL compiler that spent all of its time skipping trailing blanks in >card images. I suspect that a lot of hot-spot examples are the result >of something like this: stupid code, or a stupid data format. > >Our experience with C News was that there never was a single massive hot >spot. Typically the profile would show half a dozen major contributors >to execution time and then a long tail of smaller ones. Often, the >way to fix it was *not* to optimize the profile leaders, but to look at >where they were being called from and why, and streamline the algorithms >in the callers. >-- That was my experience in trying to speed up an o.s. Major speed losses seemed to be the result of propagation of minor speed losses. Rather than hot spots, we found long execution paths and repeated recomputation of data that could be cached.
henry@zoo.toronto.edu (Henry Spencer) (07/15/90)
In article <1990Jul12.182552.26715@cbnewsi.att.com> yam@cbnewsi.att.com (toshihiko.yamakami) writes: >> Many of the optimization problems belong to the "NP-complete" or even >> the "Turing-uncomputable" categories... > >I always wonder why human beings can do 'Turing-uncomputable' computation. >If human beings can do, why can't a program simulate it? The answer is that human beings don't solve the entire problem, only a subset of it. And really, that is all you need. Programmers do not write arbitrary programs. The NP-completeness or Turing-uncomputability of an optimization problem is seldom of much interest; all it tells you is that there is no halfway-efficient fully-optimal fully-general solution. In practice, the fully-optimal fully-general solutions usually are only halfway efficient anyway, and hence nobody uses them. It's much better to use something that sacrifices theoretical optimality or full generality but runs fast on real input. -- NFS: all the nice semantics of MSDOS, | Henry Spencer at U of Toronto Zoology and its performance and security too. | henry@zoo.toronto.edu utzoo!henry
ssr@taylor.Princeton.EDU (Steve S. Roy) (07/15/90)
Burce Karsh writes: >>In article <628@dg.dg.com> publius@dg-pag.webo.dg.com (Publius) writes: >> >>Many "large" programs spend most of the CPU time in one or a few relatively >>"tiny" routines. > >This is an often stated. Is there any evidence for this? Oh yes. Naturally it depends on what type of code you're talking about. There is a strong tendency for a person who works on a particular type of program to take the things s/he does as completely typical and assumes their problems are the same as everyone else's. This is very much not the case -- the TYPES of bottlenecks people find vary a lot. Scientific number-crunchers are not like editors are not like compilers etc. Much of my experience is with scientific number crunchers. It is VERY VERY VERY common that >90% of the execution time of such programs is spent in <<10% of the code, even just a few dozen lines. It is also VERY VERY VERY true that speed is often of paramount importance. For the types of programs I use, those conditions are true more often than not. Steve Roy ssr@acm.princeton.edu
rwallace@vax1.tcd.ie (07/16/90)
In article <64045@sgi.sgi.com>, karsh@trifolium.esd.sgi.com (Bruce Karsh) writes: > In article <628@dg.dg.com> publius@dg-pag.webo.dg.com (Publius) writes: > >>Many "large" programs spend most of the CPU time in one or a few relatively >>"tiny" routines. > > This is an often stated. Is there any evidence for this? What does the > distribution of time look like for, for instance, a compiler? I'd expect > that there would be lots of time-sinks. The lexical analyzer, the symbol table > hash routine, the symbol table lookup routine, common subexpression finder, > ... etc, could all be candidate time-sinks. > > How about a text editor? I could think of a few dozen potential time-sinks > in a text editor. > > When I've profiled programs, I've often found that the time is spent mostly > in a few routines, but this isn't always the case. How often is it true that > nearly all of the time is spent in only a few routines? How often are these routines tiny? > Usually this is the case. The compiler is an exception because it's an all-round integer CPU intensive program, but I disagree with you on the text editor. I was writing a simple text editing facility on an accounting package in Turbo C on an MS-DOS machine a couple of weeks ago. The editor ended up spending almost all its time either paging or scrolling. Paging partly involves setting memory to a particular value, which ended up as a 3-line loop of C coded, so those 3 lines of code were taking about 30%-40% of the total execution time. Now the 80x86 is a pretty foul processor for memory writes and copies but by examining the assembly language output and changing 2 lines of C code I enabled the compiler to double the speed of the code for that inner loop. Lo: a 20% speed improvement by changing 2 lines. Russell Wallace, Trinity College, Dublin rwallace@vax1.tcd.ie "To summarize the summary of the summary: people are a problem"
staff@cadlab.sublink.ORG (Alex Martelli) (07/19/90)
cavrak@uvm-gen.UUCP (Steve Cavrak,113 Waterman,6561483,) writes: >... >An example with VS-FORTRAN on an IBM-3090 with a vector processor. >a. plain fortran matrix 1.0 >b. full optimized fortran 0.3 >c. IBM's ESSL hand coded library 0.1 or better d. NAG's implementation, fully Fortran but "properly" coded with block-submatrices algorithm: within 10% of ESSL! That's for linear-algebra stuff... back in '88, I was working in the same office at IBM Italy Scientific Center with the NAG guy that was visiting there and implementing those routines; whose performance gains were proving transferable to other machines with complex storage hierarchies, by the way, via a simple variation in the "good-block-size" parameter. On the other hand, I suspect the wondrously fast and precise library of Fortran intrinsics, coded in assembler with lots of clever table-lookups and bit-twiddling, would suffer a LOT if it had to be reimplemented in Fortran itself! But in most cases, proper algorithms, and, more specifically, appropriate data-access patterns, are THE key to good numerical-computing performance - and Fortran remains adequate for expressing such algorithms and data-access patterns. -- Alex Martelli - CAD.LAB s.p.a., v. Stalingrado 45, Bologna, Italia Email: (work:) staff@cadlab.sublink.org, (home:) alex@am.sublink.org Phone: (work:) ++39 (51) 371099, (home:) ++39 (51) 250434; Fax: ++39 (51) 366964 (work only; any time of day or night).
jmd@decwrl.dec.com (John Danskin) (07/27/90)
I just wanted to put my two cents in on this. A lot of people make generalizations based on the kind of coding they do, and state that the kind of coding other people are paid to do is stupid. I work on low level 3D graphics for digital equipment in palo alto. We build medium end 3D workstations (whatever those are). Somewhere in the workstation, there is alwoays a goemetry accelerator (if there isn't, some other group does the work). So far, for us, the geometry accelerator is some kind of RISC chip with hot floating point and a reasonably good C compiler. We have used weitek Accell parts and the Intel i860. I have been the person who examines chips, and says how fast they will go, so the people who decide how fast we need to go and how much we want to pay for a chip can make up their minds. To do this, I spend a few weeks writing machine/assembler code for the chip for the graphics primitives we think are interesting, and count the bus transactions, and talk to the chip company and the designers etc. Eventually, I have some pretty good pieces of code that represent a speed of light for the chip and the proposed memory subsystem and communications bottleneck. This might take a few weeks. We evaluate a lot more chips than we build systems out of. When we finally decide to build a system, we start out by coding everything in C (or copying the old C code from the last project). It runs with simulated hardware and communications at a fairly hokey level for a while while we wait for prototype real hardware. Eventually, we get real hardware, port to it, and the stuff runs in C. With pretty good C at this point (the algorithms are straightforward, well known, and we know them by now) the code runs between 10 and 30 times slower than the speed of light for the configuration, which is of course, what we are committed to achieving. The next 90% of the schedule is budgeted to optimization. We achieve this improvement by: recoding in assembler. Some routines are recoded several times with slightly different tradeoffs and degrees of aggresiveness in loop unrolling and inlining. This is extremely expensive. Incidentally, we do start with a few really important expensive routines, but those unimportant data shuffling, branching, procedure calling routines rapidly get important as we run out of things to do to those nifty little floating point routines. Sometimes we just recode in C a few times until we hit on a way to get the compiler to generate reasonable code. We seem to get about 5X performance out of this phase. Other things. I wrote a cache optimizing linker. Using a call graph and some human intervention (a hand generated sed script), It reorders modules so that there will be no cache misses in a direct mapped cache (it knows the size) for code paths that we think are important. Usually these are the paths for the graphics primitives with some of the infrequent stuff edited out of the call graph, and things that cannot happen in the same primitive flagged as OK to overlap in the cache. On the system I developed this for, we got a 2X performance increase. (Before, we had about 90% hits with a penalty for a miss of 5 cycles, since one miss implies another if you are thrashing on a direct cache, this means the penalty was really 10 cycles.) 10% of operations taking 10 times as long means the whole thing takes twice as long. What clued us in to this cache problem, was a routine that we recoded in assembler. We brought this critical routine from 400 cycles to 40 cycles, but the code ran slightly slower. We looked at the link map, and sure enough, the resize had caused it to overlap with this other critical routine... To recoding in assembly CAN slow down your code, but usually there is a lesson more interesting than "Assembly coding is bad". Usually we improve the hardware a little bit as we have better data on which parts *really* are bottlenecks. Why do we do this? Basically because we are a hardware company, and people view the part of the system which we produce as part of the hardware, and because it seems to be cheaper than including 10X faster hardware on every system and charging the same. People do full custom VLSI too. If everybody refused to get their hands dirty and look at some bits, nobody would build hardware ( a much lower level process than assembly or microcoding ). The development costs are staggering. The maintanence costs are staggering, but still it is all quite a bit cheaper than 10000 copies of 10X faster hardware. Another reason is because our code is floating point intensive, but not exactly vectorizable, and the compilers (MIPS included) do a relatively poor job of scheduling around multicycle instructions. The final reason, is that if we didn't do it, nobody would buy our boxes, and I would have to get another job, and I *like* working here. Incidentally, my favorite language this year is scheme. It doesn't run as fast as C, but sometimes you need to implement programs fast instead of implementing fast programs. -- John Danskin | jmd@decwrl.dec.com or decwrl!jmd DEC Advanced Technology Development | (415) 853-6724 100 Hamilton Avenue | My comments are my own. Palo Alto, CA 94301 | I do not speak for DEC.