[comp.arch] Compiler Costs

zrra07@backus (Randall R. Appleton) (07/05/90)

I have a simple question:  Does anyone know what sort of speed-up
on gets by going from a good implementation of some algorithm in a
third generation language (C, Fortran) and a good optimizing compiler
to hand-coded assembly?

In other words, if I take my average well written program, compiled
with a good optimizing compiler, and re-write it in assembler, what sort
of speedup should I expect to see?

Thanks,
Randy

schow@bcarh185.bnr.ca (Stanley T.H. Chow) (07/06/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>I have a simple question:  Does anyone know what sort of speed-up
>on gets by going from a good implementation of some algorithm in a
>third generation language (C, Fortran) and a good optimizing compiler
>to hand-coded assembly?
>
>In other words, if I take my average well written program, compiled
>with a good optimizing compiler, and re-write it in assembler, what sort
>of speedup should I expect to see?
>
>Thanks,
>Randy

The short answer is: Depends.

OKay, you ask, on what does it depend?

Well, it depends on just about every single adjective you used in
your question:
  How good is "a good implementation"?
  Which algoritm is the "some algorithm"?
  Which "good optimizing compiler"?
  Whose average? Over which set of programs?
  "Well written" for what purpose?
  Who will do the rewriting?
  For which computer?

Basically, performance is like anything else, the variation between
people (using identical tools) is larger than variation between tools.
So the "optimum" stratagy is to get the best person for the job.

You really will have to specify more details to get any meaningful
answer to your question. For example, There are some Fortran compilers
that are extremely good; for many programs, the code is just about
optimal.

The meaning of "well written" is very context dependent. Some people
insist on clear, understandable code; others want it fast, still others
want it to be robust and bullet-proof. Which properties does your
program have and which properties do you want to preserve in assembler?

--
Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!psuvax1!BNR.CA.bitnet!schow
(613) 763-2831		     ..!utgpu!bnr-vpa!bnr-rsc!schow%bcarh185
Me? Represent other people? Don't make them laugh so hard.

cik@l.cc.purdue.edu (Herman Rubin) (07/06/90)

In article <1797@apctrc.UUCP>, zrra07@backus (Randall R. Appleton) writes:
> I have a simple question:  Does anyone know what sort of speed-up
> on gets by going from a good implementation of some algorithm in a
> third generation language (C, Fortran) and a good optimizing compiler
> to hand-coded assembly?
> 
> In other words, if I take my average well written program, compiled
> with a good optimizing compiler, and re-write it in assembler, what sort
> of speedup should I expect to see?

This only looks simple.  The results might even be drastically different
on different machines.  Even the choice of algorithm might depend quite
heavily on the specific machine.  The infamous 4.2BSD frexp() was an 
example of a problem which can be efficiently (and even very simply) coded
in assembler, but which cannot be efficiently handled by such limited 
languages as those mentioned.  The thing missing from these languages 
are the operations of pack and unpack; utterly trivial, but not there.
In fact, even less is needed on most machines.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

toma@tekgvs.LABS.TEK.COM (Tom Almy) (07/06/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>I have a simple question:  Does anyone know what sort of speed-up
>on gets by going from a good implementation of some algorithm in a
>third generation language (C, Fortran) and a good optimizing compiler
>to hand-coded assembly?

Simple Answer: That depends...

If the program implements an algorithm that is well suited for the compiler
(standard matrix multiplication in Fortran, for example) or is a program that
is a standard benchmark for which *every* compiler writer optimizes, you
might not see any improvement at all. If fact it might run slower if you are
not as clever as the compiler writer!

On the other hand, if the algorithm cannot be efficiently represented in
the language (string manipulation in Fortran, complex numbers in C) or there
are special cpu instructions you can take advantage of that the compiler
doesn't, then you can get major improvements, even order of magnitude.

In general, there is only a small portion of the code that uses most of the
execution time, and by coding those portions in assembler, you can get most
of the assembler performance while still enjoying the convenience of a 
high level language for the bulk of the code.

If you are concerned about code size, there still is a major improvement
to be offered with assembler, typically because you can custom tailor 
subroutine calling conventions and register usage. But then again many
modern compilers let you do the same.

(I hope this won't be the start of some "language war" :-)

Tom Almy
toma@tekgvs.labs.tek.com
Standard Disclaimers Apply

henry@zoo.toronto.edu (Henry Spencer) (07/06/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>I have a simple question:  Does anyone know what sort of speed-up
>on gets by going from a good implementation of some algorithm in a
>third generation language (C, Fortran) and a good optimizing compiler
>to hand-coded assembly?

It can be negative.  Good optimizing compilers can often do a better
job on large bodies of code than humans can.  Small code fragments
often do better when carefully crafted by humans, although this is
a very time-consuming activity.

Improving the algorithm is usually a more cost-effective way of boosting
performance.
-- 
"Either NFS must be scrapped or NFS    | Henry Spencer at U of Toronto Zoology
must be changed."  -John K. Ousterhout |  henry@zoo.toronto.edu   utzoo!henry

gillett@ceomax..dec.com (Christopher Gillett) (07/07/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>I have a simple question:  Does anyone know what sort of speed-up
>on gets by going from a good implementation of some algorithm in a
>third generation language (C, Fortran) and a good optimizing compiler
>to hand-coded assembly?
>

This is, among other things, an interesting question, but I certainly
wouldn't characterize it as "simple" :-).  There are a huge number of
factors to consider in making this determination, and many of them
are subjective and/or incalculatable.  Here are some things that you
should consider:

     1.  Nature of the program:  Something that is fairly linear and 
         spends all its time calling system functions is not likely
         to be sped up by conversion to assembler.  Something fairly
         complex that uses lots of control transfers, and is reasonably
         non-obvious (itself a very subjective term) is more likely to
         be improved.  

     2.  How good is good?  The quality of the compiler that you are
         using is significant.  Compilers range from utter trash to
         seemingly clairvoyant in their abilities to generate efficient
         code.  Since there's no really good metric that you can apply
         to an arbitrary compiler to determine its effectiveness, you're
         left making a judgement call about whether or not the compiler
         is a better assembly language programmer than you.

     3.  How talented are you?  The art of assembly language programming
         is rapidly dying out.  Many people are really gifted high level
         programmers who fall apart in the assembly world (absolutely no
         jabs at anybody at all intended here, btw :-) ).  Problems are
         solved much differently in the two different worlds.  If you try
         to do a straight translation from high level to low level, you're
         not doing much better than the compiler (in fact, probably much
         worse).  However, if you are a gifted machine-level hacker, and
         you have really understand the program you want to convert (like,
         did you write it yourself?), then you might gain significantly
         by converting the code. 

These are just the first few things that pop into mind.  My opinion on the
whole matter is that you should consider language issues long before putting
down any code.  The very nature of the program should dictate which language
to use.  Further, it's been my observation that code conversion jobs (whether
from one high level language to another, or from a high level to a low level,
or low to high) are expensive, time consuming, and not very rewarding.

>Thanks,

You're welcome!

>Randy

/Chris
---
Christopher Gillett               gillett@ceomax.dec.com
Digital Equipment Corporation     {decwrl,decpa}!ceomax.dec.com!gillett
Hudson, Taxachusetts              (508) 568-7172

nmm@cl.cam.ac.uk (Nick Maclaren) (07/07/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>I have a simple question:  Does anyone know what sort of speed-up
>on gets by going from a good implementation of some algorithm in a
>third generation language (C, Fortran) and a good optimizing compiler
>to hand-coded assembly?
>
>In other words, if I take my average well written program, compiled
>with a good optimizing compiler, and re-write it in assembler, what sort
>of speedup should I expect to see?


It is impossible to predict, in general, but the following rules of thumb
may be useful.  Consider the minimal program to take unit time (or space);
the following multiplicative factors are good guidelines:

Competence of assembler programmer    1x  -  5x
Competence of 3 GL programmer         1x  -  3x
Suitability of 3 GL for algorithm     1x  -  10x 
Competence of compiler's code         1x  -  3x
Degree of optimisation                1x  -  10x 

You can therefore expect the assembler to be anything from 900 times
faster to 5 times slower!  This excludes even more drastic factors (in
either direction), caused by gross mistakes or misunderstandings (e.g.
dropping into a floating-point emulator to do 2-D array indexing).

Nick Maclaren
University of Cambridge Computer Laboratory
nmm@cl.cam.ac.uk

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (07/07/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>... if I take my average well written program, compiled
>with a good optimizing compiler, and re-write it in assembler, what sort
>of speedup should I expect to see?

The speedup will vary.  A lot. There are rare examples where
well-written assembler code was slower.

-- 
Don		D.C.Lindsay

mwm@raven.pa.dec.com (Mike (Real Amigas have keyboard garages) Meyer) (07/07/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
   In other words, if I take my average well written program, compiled
   with a good optimizing compiler, and re-write it in assembler, what sort
   of speedup should I expect to see?

Lots of good answers, but everyone missed my favorite reason for
compilers generating lousy code - they have a machine that's
ill-matched to the language in question. The best example is to take a
language that really expects there to be a stack, and compile it for a
machine that doesn't have one. An excellent compiler will decide which
routines aren't possibly recursive and compile them into code that
doesn't use the stack, but other routines will have to simulate the
stack in some (probably slow & ugly) manner. I never saw a compiler
that smart, but it's been years since I worked on such a machine.

Given that all routines are compiled as if they might be recursive,
recoding in machine language in such a case - even if the coder is
only moderatly competent - to use fixed memory locations for local
variables and pass arguments in registers instead of doing those
things on a fake stack can easily run an order of magnitude faster,
and a factor of 2 smaller.

	<mike
--
So this is where the future lies			Mike Meyer
In a beer gut belly; In an open fly			mwm@relay.pa.dec.com
Brilcreamed, acrylic, mindless boys			decwrl!mwm
Punching, kicking, making noise

jlg@lanl.gov (Jim Giles) (07/07/90)

From article <373@e2big.mko.dec.com>, by gillett@ceomax..dec.com (Christopher Gillett):
> [...]
>      3.  How talented are you?  The art of assembly language programming
>          is rapidly dying out.  Many people are really gifted high level
>          programmers who fall apart in the assembly world (absolutely no
>          jabs at anybody at all intended here, btw :-) ).  [...]

This is an interesting point which is more complex that it appears.
I can do assembly code reasonably well on most RISC-ish type machines
(which include Crays - only one addressing mode, simple instructions,
etc.).  I can't do well at all with CISC machines (I can't see the
point in taking the trouble to learn all those mode and such).  So,
part of your point three is really a reference to personality.

However, you left out target machine as one of your points.  The type
of hardware you are targeted for can have a very strong impact on how
weel you can do vs. a compiler.  On a simple scalar machine, you might
not do as well as the compiler since implementors have long since found
ways of generating good code for such machines (besides, there really
isn't much you can do - the scalar non-pipelined code has only one
"degreee of freedom" and that is instruction selection - ordering and
register allocation are usually obvious).  On a vector or parallel
machine, one loop that _you_ vectorize but the compiler didn't can
gain a _lot_!

J. Giles

oz@yunexus.yorku.ca (Ozan Yigit) (07/07/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>In other words, if I take my average well written program, compiled
>with a good optimizing compiler, and re-write it in assembler, what sort
>of speedup should I expect to see?

When I was a young & budding assembler hack, I tried this on several
compiler outputs. At that time, VMS fortran and C compilers were
considered very good. I was able to out-optimize them by as much as 30%.
[As I said, I was young :-)] These days, compilers like GCC come very
close, and while I am convinced that for a moderate chunk of code, a
determined hacker can still out-optimize a compiler, I fail to see its
value except in some very special cases. [I had to improve a chess program
once, by re-writing its bitmap routines (which got called zillions of
times) in assembler. It was about 30 lines of code, and made about 40%
speed improvement] 

oz

gillies@m.cs.uiuc.edu (07/07/90)

It can be hard to find a speedup on CISC machines, since cacheing,
alignment, scoreboarding, and even paging effects may be hard to
predict.  For instance, the Motorola 68020 Manual admits that you can
only hand-wave guesses at the execution time of an instruction stream
-- this forces the dedicated tuner to run hoardes of benchmarks.

This is my greatest complaint about CISC architecture -- from an
optimization standpoint, I sneer religiously at nondeterministic
low-level hardware hacks in high-performance computer design (I don't
even like caches, scoreboards, or paging -- they're just fads).

Maybe it's because I think all future computers should run real-time
operating systems, and because I do research in real-time system
scheduling and optimization, that I hold these views.

cik@l.cc.purdue.edu (Herman Rubin) (07/07/90)

In article <373@e2big.mko.dec.com>, gillett@ceomax..dec.com (Christopher Gillett) writes:
> In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
> >I have a simple question:  Does anyone know what sort of speed-up
> >on gets by going from a good implementation of some algorithm in a
> >third generation language (C, Fortran) and a good optimizing compiler
> >to hand-coded assembly?

			.........................

>      3.  How talented are you?  The art of assembly language programming
>          is rapidly dying out.  Many people are really gifted high level
>          programmers who fall apart in the assembly world (absolutely no
>          jabs at anybody at all intended here, btw :-) ).  Problems are
>          solved much differently in the two different worlds.  If you try
>          to do a straight translation from high level to low level, you're
>          not doing much better than the compiler (in fact, probably much
>          worse).  However, if you are a gifted machine-level hacker, and
>          you have really understand the program you want to convert (like,
>          did you write it yourself?), then you might gain significantly
>          by converting the code. 

I am inclined to doubt that those suffering from lack of understanding of
the relation between architecture and "how to do it efficiently" will even
be gifted high level programmers.  To produce good code, it is necessary to
understand the cost considerations, and I find it difficult to believe that
this can be done at the limited level of HLLs.

There are two problems with the present assembler.  First, there is the
gosh-awful syntax and arbitrariness of them.  Suppose that for appropriate
x, y, and z, the operation x = y - z can be performed by a single machine
instruction.  Now I understand what the machine is doing, but how do I
communicate it to the assembler?  Those who have used more than one assembler
are likely to have seen assemblers with the orders of x, y, and z quite
different.

There is also the attitude in the computer field that the user need not take
these things into account, that the compiler is a great genius and should do
it all for you.  This is similar to the attitude which existed in the public
schools almost universally, and is not completely gone yet, that reading
should be learned only by the whole word method, and the idea of sounding
out words, especially in English where there are "so many exceptions", should
not be taught until much later.  I would put the hardware of the most
machines I know as considerably simpler than any HLL.

> These are just the first few things that pop into mind.  My opinion on the
> whole matter is that you should consider language issues long before putting
> down any code.  The very nature of the program should dictate which language
> to use.  Further, it's been my observation that code conversion jobs (whether
> from one high level language to another, or from a high level to a low level,
> or low to high) are expensive, time consuming, and not very rewarding.

I would disagree somewhat.  I have found starting with a Fortran program 
usually not helpful, but not the case with C.  Even the talented human can
learn from the apparently stupid observations made by the less talented,
including the compiler.  Others have observed that reverse engineering 
their own programs can result in improvements.

But one is unlikely to produce an artist by having the apprentice color within
the lines.  
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

tbray@watsol.waterloo.edu (Tim Bray) (07/08/90)

zrra07@backus (Randall R. Appleton) writes:
>In other words, if I take my average well written program, compiled
>with a good optimizing compiler, and re-write it in assembler, what sort
>of speedup should I expect to see?

Isn't "average well written program" an oxymoron?

The speedup, as several people have pointed out, is difficult to predict.  In
general, however, the *cost-effectiveness* is liable to be negative.  If the
program is used heavily enough that hard-won speedups look attractive, it
will require regular maintenance as people's needs change.  Performing this
maintenance on assembler code is apt to be dramatically more difficult and
burn tons of expensive programmer time.

Second, a *really* cost-effective way to make a program faster is to run
it on this year's hot chip, probably twice as fast as your current computer
if it's > 18 months old.  But if you've committed to one architecture by
going to assembler, you've closed that door.

Third, as some have already said, but it can't be said too often, the
expected performance improvement per unit effort from repeatedly profiling
your code and whittling away at the algorithms in the critical areas is
typically dramatically greater than that for assembly hacking.  And the
results are portable.

Yup, assembler hacking is a dying art, thank goodness.  But I'm glad I'm
old enough to have done it.  Doing serious work with VAX assembler was
out and out *fun* - I suppose hunting dinner with a stone knife was too.

Cheers, Tim Bray, Open Text Systems

cik@l.cc.purdue.edu (Herman Rubin) (07/08/90)

In article <3300143@m.cs.uiuc.edu>, gillies@m.cs.uiuc.edu writes:
> 
> It can be hard to find a speedup on CISC machines, since cacheing,
> alignment, scoreboarding, and even paging effects may be hard to
> predict.  For instance, the Motorola 68020 Manual admits that you can
> only hand-wave guesses at the execution time of an instruction stream
> -- this forces the dedicated tuner to run hoardes of benchmarks.
> 
> This is my greatest complaint about CISC architecture -- from an
> optimization standpoint, I sneer religiously at nondeterministic
> low-level hardware hacks in high-performance computer design (I don't
> even like caches, scoreboards, or paging -- they're just fads).

So the fact that a multiplication takes a variable length of time depending
on the numbers is a flaw in the machine?  I do not think so.  There are 
even cleaer examples.  I have algorithms for which a tool operation is
finding the spacing between the 1's in a random bit stream.  Of course
this is non-deterministic.  Also, it is even necessary to have planned
exceptions, which in some cases are rather likely.

There are certainly situations where a non-optimized CISC machine will
beat an optimized RISC machine with the same basic components.  I have
argued for some operations on machines which are deterministic, even,
but which will have hardware switches of essentially no cost, and would
have conditional transfers in RISC, or even in current CISC.

> Maybe it's because I think all future computers should run real-time
> operating systems, and because I do research in real-time system
> scheduling and optimization, that I hold these views.

There are many situations which can not be done in real time.  Also, the
operating system should be a very minor part of the software, and do only
what an operating system, and not user modules, must.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

peter@ficc.ferranti.com (Peter da Silva) (07/08/90)

In article <2319@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> But one is unlikely to produce an artist by having the apprentice color within
> the lines.  

Disney, Fritz Freleng, Don Bluth, and other animators would disagree violently.
There are some products of the human mind that are best created by teams of
more and less talented individuals. Animators and apprentices... or humans and
compilers.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.
<peter@ficc.ferranti.com>

cavrak@uvm-gen.UUCP (Steve Cavrak,113 Waterman,6561483,) (07/08/90)

From article <1797@apctrc.UUCP>, by zrra07@backus (Randall R. Appleton):
> I have a simple question:  Does anyone know what sort of speed-up
> on gets by going from a good implementation of some algorithm in a
> third generation language (C, Fortran) and a good optimizing compiler
> to hand-coded assembly?
> 

An example with VS-FORTRAN on an IBM-3090 with a vector processor.

a.  plain fortran matrix		1.0
b.  full optimized fortran		0.3
c.  IBM's ESSL hand coded library	0.1 or better

As another example, you could track down the standard LINPACK
benchmarks from net-lib@ornl.gov.  The report there gives a good
summary of how the LINPACK performs under different conditions.

Steve

usenet@nlm.nih.gov (usenet news poster) (07/09/90)

cavrak@uvm-gen.UUCP (Steve Cavrak,113 Waterman,6561483,) writes:
>in response to zrra07@backus (Randall R. Appleton):
>> I have a simple question:  Does anyone know what sort of speed-up
>> on gets by going from a good implementation of some algorithm in a
>> third generation language (C, Fortran) and a good optimizing compiler
>> to hand-coded assembly?
>
>An example with VS-FORTRAN on an IBM-3090 with a vector processor.
>
>a.  plain fortran matrix		1.0
>b.  full optimized fortran		0.3
>c.  IBM's ESSL hand coded library	0.1 or better

It is my understanding that on the RS/6000, ESSL is implemented in Fortran,
and IBM claims that most ports have experienced zero or negative gains
in attempting to hand code assembler on the machine.  There have been a 
number of references to "the ill conceived notions" of higher level 
languages in this thread, but it is not at all clear to me what these are
supposed to be.  C was designed as a higher level alternative to assembler.
On a RISC machine with a good compiler, a C programmer willing to work 
on his code ought to be able to do almost anything hand coded assembler can.
This may involve looking at assembly output, recoding routines a few times
and testing alternative alogorithms, but in the end you still have C (that
will run on another processor even if not as fast).

On a complicated processor (CISC, non-orthogonal registers, segmented
addresses etc.) assembler may do better because the compiler writers made
simplifying assumptions about the processor, making some instructions and
address constructs inaccessible.  Seems like Intel processors will be the
last bastion of assembly language.

>Steve

David States

hankd@dynamo.ecn.purdue.edu (Hank Dietz) (07/09/90)

In article <1990Jul6.161158.1297@zoo.toronto.edu>:
>In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>>I have a simple question:  Does anyone know what sort of speed-up
>>on gets by going from a good implementation of some algorithm in a
>>third generation language (C, Fortran) and a good optimizing compiler
>>to hand-coded assembly?
>
>It can be negative.  Good optimizing compilers can often do a better
>job on large bodies of code than humans can.  Small code fragments
>often do better when carefully crafted by humans, although this is
>a very time-consuming activity.

Basically, I agree.  Given the caveat that good compilers are quite
rare, I have seen good compilers consistently outperform handwritten
assembly code -- but humans could usually do local tweaks to compiler
generated code to speed it up even a little more.  IMHO, this is
primarily because compilers usually don't do a very good job of
modeling irregular or complex instructions, whereas humans often are
great at using those, but write inferior code for the more mundane
operations (e.g., they do a poorer job of allocating registers, aren't
consistent about taking advantage of some properties, etc.).  Notice
that RISC architectures don't have many "funny" instructions, so
compilers for them can do particularly well against humans....

Basically, optimized code generation is a search procedure -- if the
compiler knows to search for the right things (which is learned in
part by observing what the best human programmers do), the fact that
computers can search faster and more consistently than humans will
eventually make compiler technology generate better code than a
typical human every time.  Think about how computers play chess -- at
first, clever heuristics were best, but lately the winners have
focussed more on very fast search techniques, and now they are
generally better than most human chess players.  Yes, the very best
human chess players still win, but they are a very small group, and
each game played so carefully is a major effort (i.e., costly).  So it
is/will be for compilers versus humans writing assembly language
programs....

In any case, for good C and Fortran compilers using 1970s compiler
technology and generating code for a PDP11ish target, I'd say a factor
of 2.5x slower than carefully written assembly code is typical.  For a
compiler using 1980s compiler technology (there are very few of these)
and targeting a RISC machine, I'd say it's common to be 30% or more
*faster* than hand-written code (assuming a dumb, non-optimizing,
assembler).  Of course, there are billions and billions of caveats on
these numbers....

						-hankd@ecn.purdue.edu

cliffc@erato.rice.edu (Cliff Click) (07/09/90)

In article <2319@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>But one is unlikely to produce an artist by having the apprentice color within
>the lines.  
>-- 
>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
>Phone: (317)494-6054
>hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

For once I agree with you!

Looking at what some compilers produce (say, at least 6 commerical ones) I can
get a factor of 2 speedup by doing "stupid" optimizations, that I would expect
the compiler to get on it's own.  To get better than that I cannot color 
within the compiler's lines, but generally I need to restructure how things
are solved - not nessecarily a change of algorithms but a change of *structure*.

Cliff Assembly Hacker Click

--
Cliff Click                
cliffc@owlnet.rice.edu       

rmarks@KSP.Unisys.COM (Richard Marks) (07/09/90)

I wrote a version of UUDECODE for the IBM PC.  It is used by many 
internet users.  It is written in Turbo PAscal 5.5.  I coded it very
carefully with much consideration for speed.

Then I rewrote about 5% of the code, the inner loops, in assembler and
got a 50% speed up.

Moral:  Consider implementing in your favorite higher level language;
but then recode the inner loops in assembler for perfromance.

Regards,
Richard Marks
rmarks@KSP.unisys.COM

atk@boulder.Colorado.EDU (Alan T. Krantz) (07/09/90)

In article <1990Jul8.230954.18881@ecn.purdue.edu> hankd@dynamo.ecn.purdue.edu (Hank Dietz) writes:
>In article <1990Jul6.161158.1297@zoo.toronto.edu>:
>>In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>>
>>It can be negative.  Good optimizing compilers can often do a better
>>job on large bodies of code than humans can.  Small code fragments
>>often do better when carefully crafted by humans, although this is
>>a very time-consuming activity.
>
>Basically, I agree.  Given the caveat that good compilers are quite
>rare, I have seen good compilers consistently outperform handwritten
>assembly code -- but humans could usually do local tweaks to compiler
>generated code to speed it up even a little more. 

I'm not so sure I've seen this very often. It seems that on some of the
newer machines (IBM RS/6000) which have very special conditions for
optimizing the pipeline this might be more accurate - but on many machines it
isn't. The trick is when you hand write assembly code you get to
optimize register usage across routine calls much better then most
compilers can (actually, to be honest I haven't seen any compilers do
global register optimization - though this is an area of active
research). Anyways, almost every comment I've seen on compilers vs hand
optimization has been on a routine by routine case - and I still believe
(though I could be wrong) that the optimial results from assembly code
comes from register usage across/through routine calls...

Ho hum - I'm sure someone will set me straight ....


 
------------------------------------------------------------------
|  Mail:    1830 22nd street      Email: atk@boulder.colorado.edu|
|           Apt 16                Vmail: Home:   (303) 939-8256  |
|           Boulder, Co 80302            Office: (303) 492-8115  |
------------------------------------------------------------------

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (07/09/90)

In article <9730@brazos.Rice.edu> cliffc@erato.rice.edu (Cliff Click) writes:

| Looking at what some compilers produce (say, at least 6 commerical ones) I can
| get a factor of 2 speedup by doing "stupid" optimizations, that I would expect
| the compiler to get on it's own.  To get better than that I cannot color 
| within the compiler's lines, but generally I need to restructure how things
| are solved - not nessecarily a change of algorithms but a change of *structure*.

  Yes to both ideas. I was doing a Mandelbrot program and when I had
done all I could to speed it up, I output the assembler code to a file,
optimized it by hand, and got a 45% reduction in CPU time. However, I
had alread gotten a 60% reduction by improving the algorithm and the
implementation in C.

  While there is room for notable improvements when you have a small
routine called 600 million times, in general the real savings comes from
improving the overall algorithm to reduce the work to be done.

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me

ingoldsb@ctycal.UUCP (Terry Ingoldsby) (07/10/90)

In article <1797@apctrc.UUCP>, zrra07@backus (Randall R. Appleton) writes:
> I have a simple question:  Does anyone know what sort of speed-up
> on gets by going from a good implementation of some algorithm in a
> third generation language (C, Fortran) and a good optimizing compiler
> to hand-coded assembly?

This seems to be following along in the thread of compiler optimization,
where various people are suggesting that optimizers have or have not
replaced human ingenuity in generating efficient code.  This discussion
loses sight of two things:
  1) How hard (long) does the human try?
    - Some years ago I wrote some critical code for an application, part
      of which ran on a TMS32010 and part on an 8086 (it was a multi-
      processor system).  There were two sections of the program that
      had to be *very* fast.  I worked almost a week to write at most
      three pages of code.  When I was done, I doubt that any optimizer
      could have beat that code.  I tried the implications of using
      different registers for different purposes, and so on until I
      had what was the fastest possible implementation.  But at what
      cost?  I wrote about 150 assembler instructions in 5 days.
      That is 30 instructions/day, or 1 instruction every 15 minutes.

  2) Does portability mean anything to you?
    - The code I wrote was very non-portable.  If the processors had
      been changed, even to other members of the same family (eg.
      TMS32020 or 80286) then all my optimizations would have been
      for naught.  The new processors would have required a different
      optimization strategy.  The old code would have run, but not
      optimally.

To me the argument of which is better is, man or machine, depends on
what you are trying to accomplish.  If time and money are not a
problem, then an expert human can probably beat a compiler.  If
budgets or portability have any importance the compiler wins hands
down.

P.S.
 To answer your question, I probably saved 50% runtime on the tight
portions of my code.  


-- 
  Terry Ingoldsby                ctycal!ingoldsb@calgary.UUCP
  Land Information Services                 or
  The City of Calgary       ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

jkrueger@dgis.dtic.dla.mil (Jon) (07/10/90)

cavrak@uvm-gen.UUCP (Steve Cavrak,113 Waterman,6561483,) writes:

>a.  plain fortran matrix		1.0
>b.  full optimized fortran		0.3
>c.  IBM's ESSL hand coded library	0.1 or better

But what is this code DOING?  Without that information these numbers
are useless.  It's reporting what happened to the dependent variables
without specifying the independent variables.

-- Jon
-- 
Jonathan Krueger    jkrueger@dtic.dla.mil   uunet!dgis!jkrueger
Drop in next time you're in the tri-planet area!

chip@tct.uucp (Chip Salzenberg) (07/10/90)

According to cik@l.cc.purdue.edu (Herman Rubin):
>There are two problems with the present assembler.  First, there is the
>gosh-awful syntax and arbitrariness of them.

Assembler syntax isn't an architectural issue.

For example: The Z-80 has two widely used assembler syntaxes, the
extended 8080 and the Zilog.  Both of them are, to me, difficult to
read.  Compare, however, the smal2z language which was bundled with an
old Intel C compiler.  Using a bit of lex and yacc in front of the
normal assembler, the smal2z translator accepted C syntax for Z-80
operations:

	A = B;			/* LD A,B */
	B = *HL++;		/* LD B,(HL) | INC HL */
	do {			/* label: */
		*IX++ = 0;	/* LD (IX),0 | INC IX */
	} while (--B);		/* DJNZ label */

My point: Assembler syntax isn't an architectural issue.  It's just an
arbitrary way to present the architecture to programmers.

>There is also the attitude in the computer field that the user need not take
>these things into account, that the compiler is a great genius and should do
>it all for you.

The attitude I have observed is more like: "I'd rather let the
computer do the grunt work for me, than do it myself."

Micromanagement produces microsuccess.  I prefer macrosuccess.

If you don't like HLLs, don't use them.  But don't heckle those of us
who would use HLLs as a valuable tool -- even our primary tool.
-- 
Chip Salzenberg at ComDev/TCT     <chip@tct.uucp>, <uunet!ateng!tct!chip>

grunwald@foobar.colorado.edu (Dirk Grunwald) (07/10/90)

The MIPS compiler suite (and old bit of software, too) does global
optimization. I think that they still adhere to a set register passing
convention, but they do things like eliminate stack manipulation for
leaf procedures, etc etc.

Several people (Steinkist & Hennessy, Wahl, Wallace) have investigated
the advantages of global register allocation at link time, and it does
very well.  You don't even need to have runtime performance data to
tune the allocation; simple bottom-up coloring appears to work very
well.

Someone just needs to hack it into Gnu C. This would raise the common
denominator of compiler performance & make companies either provide
something of comparable performance or support the Gnu compiler. I
still can't believe how many systems are shipped with shitty
compilers.

mcdaniel@adi.com (Tim McDaniel) (07/10/90)

Henry Spencer wrote:
   It can be negative. ...

hankd@dynamo.ecn.purdue.edu (Hank Dietz) wrote:
   Basically, I agree.

atk@boulder.Colorado.EDU (Alan T. Krantz) writes:
   I'm not so sure I've seen this very often.

Damn right!  Henry Spencer and Hank Dietz AGREEING on something?!?

8-)

--
"I'm not a nerd -- I'm 'socially challenged'."

Tim McDaniel
Internet: mcdaniel@adi.com             UUCP: {uunet,sharkey}!puffer!mcdaniel

jon@hitachi.uucp (Jon Ryshpan) (07/10/90)

>>I have a simple question:  Does anyone know what sort of speed-up
>>on gets by going from a good implementation of some algorithm in a
>>third generation language (C, Fortran) and a good optimizing compiler
>>to hand-coded assembly?
...
>							Notice
>that RISC architectures don't have many "funny" instructions, so
>compilers for them can do particularly well against humans....

One of the reasons that it's not so easy to improve on the C compiler
for RISC chips is that they were designed to support good C compilers.
So if what you want to say (what your algorithm requires) is easy to
express in C, you're likely to get an compilation that's hard to improve
on.

Even so, if what you want to say is not easy to express in C, like
"complex number" or "function in a function" or "static variable bound
to a fixed memory location" you may get an object that would benefit
from hand coding.

Jonathan Ryshpan		<...!uunet!hitachi!jon>

M/S 420				(415) 244-7369  	
Hitachi America Ltd.
2000 Sierra Pt. Pkwy.
Brisbane CA 94005-1819

meissner@osf.org (Michael Meissner) (07/10/90)

In article <23285@boulder.Colorado.EDU> grunwald@foobar.colorado.edu
(Dirk Grunwald) writes:

| The MIPS compiler suite (and old bit of software, too) does global
| optimization. I think that they still adhere to a set register passing
| convention, but they do things like eliminate stack manipulation for
| leaf procedures, etc etc.
| 
| Several people (Steinkist & Hennessy, Wahl, Wallace) have investigated
| the advantages of global register allocation at link time, and it does
| very well.  You don't even need to have runtime performance data to
| tune the allocation; simple bottom-up coloring appears to work very
| well.

I've always wondered about the gains.  Speaking off the cuff, I would
imagine you would get gains if you limit yourself to a Fortran style
of coding (no matter what language you use).  By Fortran style, I mean
little or no recursion, all calls are direct (not through pointers),
and all routines exist when the program is loaded.

In particular, calls to routines through pointers have to fall back on
some sort of fixed rules such as which set of registers are caller
save or callee save, and where the arguments go.  Most of the object
oriented programs that I'm aware of ultimately call functions through
pointers, and this in fact is touted as a good thing, so that a
subclass can replace a function as needed.  Also in dynamic languages
like LISP, the function being called may not have been created yet, or
will have been replaced by the time the call is made.  A weaker
version of this is if your system has shared libraries -- when the
link is made, the shared libraries are not bound in, they are
generally bound in when either the program starts up, or an unbound
function is called.

I also wonder about the computational complexity of global register
analysis.  My experience has tended to be that register allocation is
one of the more memory intensive pieces of compilers, and the size of
the memory is needed scales with the number of basic blocks used and
number of distinct pseudo regs requested.  I could easily imagine a
global register analysis phase thrashing on really large programs,
unless severe restrictions where used, and those restrictions would
tend to limit optimization.

| Someone just needs to hack it into Gnu C. This would raise the common
| denominator of compiler performance & make companies either provide
| something of comparable performance or support the Gnu compiler. I
| still can't believe how many systems are shipped with shitty
| compilers.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Do apple growers tell their kids money doesn't grow on bushes?

cliffc@sicilia.rice.edu (Cliff Click) (07/11/90)

In article <MEISSNER.90Jul10102037@osf.osf.org> meissner@osf.org (Michael Meissner) writes:
>I also wonder about the computational complexity of global register
>analysis.  

Global register allocation is NP-complete.

>My experience has tended to be that register allocation is
>one of the more memory intensive pieces of compilers, and the size of
>the memory is needed scales with the number of basic blocks used and
>number of distinct pseudo regs requested.  I could easily imagine a
>global register analysis phase thrashing on really large programs,

Some folks at Rice here experimenting with a nameless commerical compiler
discovered a simple 2000-line fortran program that took 90+ HOURS to compile
on a 10+ Mips, 16 Meg workstation.  The program spent roughly 98% of it's time
thrashing the disk.  The algorithm the compiler used required exponentional
memory space.  The same compile took ~15 minutes on a 48 Meg workstation.

Cliff Click
cliffc@rice.edu

--
Cliff Click                
cliffc@owlnet.rice.edu       

nevin@igloo.scum.com (Nevin Liber) (07/11/90)

[followups to comp.lang.misc only, please]

In article <1990Jul8.183551.13891@nlm.nih.gov> states@tech.NLM.NIH.GOV (David States) writes:
>On a RISC machine with a good compiler, a C programmer willing to work 
>on his code ought to be able to do almost anything hand coded assembler can.
>This may involve looking at assembly output, recoding routines a few times
>and testing alternative alogorithms, but in the end you still have C (that
>will run on another processor even if not as fast).

Bleech!!  (This is not intended as a flame; it is just my humble
opinion.)  If you are going to go to all that bother, you are better
off writing that portion of your code directly in assembler.

There are two ways of programming in C:

	(1) writing high-level portable code
	(2) using it as a macro-assembler

Unfortunately, it is darn near impossible to write code which satisfies
both of these goals!  And from a maintenance standpoint, assembler
itself is better than C for goal #2.  For one thing, it separates the
code which needs to be redone during a port.  For another, it always
does what you expect it to do, even when you change versions of the
compiler.
-- 
	NEVIN ":-)" LIBER
	nevin@igloo.scum.com  or  ..!gargoyle!igloo!nevin
	(708) 831-FLYS
Advertisement:  Hire me!

mash@mips.COM (John Mashey) (07/11/90)

Since this discussion pops up here every 6-9 months, maybe it's worth
a shot at some generalizations, to put some structure around the
flood of anecdotal examples and counter-examples. 
First MY anecdote: I wrote all of the C string library in assembler for
the R2000 a long while back, because you ALWAYS do that when you do a new
UNIX.  THEN, after looking at the generated code from the standard C functions,
we threw out most of the assmebly code, because the compiler was good
enough that it wasn't worth QAing the assembly code.

Generalizations: the following factors will encourage one to use
assembly code (left) or compiled code (right).  I think most of the
anecdotes so far fit at least one of these.  The interesting questions are:
1) Are there more good genralizations?
2) What kind of experiments can be done to quantify these more specifically?

ASSEMBLER			COMPILER
1. Old, non-optimizing compiler	Modern, global optimizer, at least
				In scientific world, vectorizer/parallizer
					if appropriate.
2. Architecture/language mismatch	Architecture/language well matched

	2a. Architecture weird
	2b. Language not expressive enough


3. One architecture target, forever	Portability

	(as in some embedded)	(as in UNIX commands)

4. Absolute speed requirement	Premium on development cost, maintenance

5. Absolute space requirement	Space costs, but no hard, low limit

6. Programming time is cheap	Programming time is not cheap

As usual, in a domain-of-discourse-confusion type of discussion,
much of the arguments would disappear if people would preface their
opinion by specifiying which domain they're living in (i.e., which of
the attributes above apply to what they do.)
ALso, it would probably encourage light, and reduce heat, if people
would avoid statements of the form
	I haven't seen it so, it must not exist
in favor of
	In domain X, it doesn't seem to exist.  Does it elsewhere?

In this light:
	a) People with RISCs, good optimizers, doing general-purpose
	code, using C or FORTRAN seldom write assembler code.
	b) People with CISCs are more likely to go to assembler,
	especially when using pcc-style compilers. Likewise, people
	using machineswith vector units, special graphics operations,
	unusual pipelines or some parallel architectures.
	c) People that need operations not easily expressed in
	popular languages go to assembler more often.
	d) Embedded control folks are more likely to use assembler.
	etc.

Of course, it is the job of computer system designers to make systems
for which the assembler/compiler tradeoff increasingly favors compilers,
and the higher the level the better (i.e., do it in shell or awk ...
or 4GLs, or whatever, if possible), but this doesn't mean
that there are not sometimes good reasons for doing it in assembler
some of the time.....  Let's just hope that we can make it less often.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mwm@raven.pa.dec.com (Mike (Real Amigas have keyboard garages) Meyer) (07/11/90)

In article <759@bbking.KSP.Unisys.COM> rmarks@KSP.Unisys.COM (Richard Marks) writes:
   Moral:  Consider implementing in your favorite higher level language;
   but then recode the inner loops in assembler for perfromance.

You missed a step - instrument the code to find out where the time is
really being spent, and why. Your intuition about that is probably
wrong. And since no such statement is complete without anecdotal
evidence, here's mine:

I had software that was doing strange floating point calculations.
80+% of the cpu time was spent inside an inner loop. 90+% of the time
in the loop was spent in a system function. Changing the inner loop to
precompute the 50 or so values it needed and using a lookup table
basically eliminated that 90%, giving me a speedup of 3-4 times.

If I'd just rewritten the inner loop in assembler without
instrumenting it, and done an excellent job of it, I'd have improved
the speed of the 10% of the loop by an order of magnitued, giving me
code that ran maybe 8% faster. Instrumenting took less time than an
assembler rewrite and the lookup change was trivial.

Not that rewriting in assembler isn't sometimes the right solution. In
this case, improving the algorithm was the right solution. But the
right first step is always to instrument things so you know you're
solving the right problem.

	<mike
--
Cats will be cats and cats will be cool			Mike Meyer
Cats can be callous and cats can be cruel		mwm@relay.pa.dec.com
Cats will be cats, remember this words!			decwrl!mwm
Cats will be cats and cats eat birds.

Publius@dg.dg.com (Publius) (07/12/90)

In article <40052@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>Generalizations: the following factors will encourage one to use
>assembly code (left) or compiled code (right).  I think most of the
>anecdotes so far fit at least one of these.  The interesting questions are:
>1) Are there more good genralizations?
>2) What kind of experiments can be done to quantify these more specifically?
>ASSEMBLER			COMPILER
>1. Old, non-optimizing compiler	Modern, global optimizer, at least
>				In scientific world, vectorizer/parallizer
>					if appropriate.
>2. Architecture/language mismatch	Architecture/language well matched
>	2a. Architecture weird
>	2b. Language not expressive enough
>3. One architecture target, forever	Portability
>	(as in some embedded)	(as in UNIX commands)
>4. Absolute speed requirement	Premium on development cost, maintenance
>5. Absolute space requirement	Space costs, but no hard, low limit
>6. Programming time is cheap	Programming time is not cheap


These are fair-minded, gentlemanly, brilliant comments.  I only want to add
the following:

Many "large" programs spend most of the CPU time in one or a few relatively
"tiny" routines.  Hand-coding these routines will not cause much maintenance
and portability problem, AS LONG AS they can be isolated.

I am not sure if the "comp.arch" is the proper place to talk about software
development methodology.  However, it is not unusual these days that
different pieces of a large-scale software project are written in different
languages.   If we take a look of the complex source tree of a typical
large-scale software, with many level of makefiles, quite often we will
run into an "asm" sub-directory, in which we can find a few ".s" files.
The key is not really if the assembler code can be used.  The key is if
these assembler-coded routines can be properly isolated and documented.

Many of the optimization problems belong to the "NP-complete" or even
the "Turing-uncomputable" categories.  Thus, no matter how smart the
global optimizer is, as long as it is equivalent to a Turing machine,
it can not achieve the FULL optimization for EVERY situation.

If a large program spends most of the CPU time in a tiny routine, then,
go ahead and hand-code it in assembler and do an exhaustive optimization,
as long as you isolate it and document it comprehensively.



-- 
Disclaimer: I speak (and write) only for myself, not my employer.

Publius     "Old federalists never die, they simply change their names."
publius@dg-pag.webo.dg.com

ballen@csd4.csd.uwm.edu (Bruce Allen) (07/12/90)

There is a nice discussion of the relative speedups that can be obtained
by various kinds of optimization in "Writing efficient programs"
by Jon Bentley (Prentice-Hall Software Series, 1982).

In two examples, discussed in detail by Bentley (pgs 26-28 and 130)
the speedup obtained in hand-coding in machine-language was a factor
of 2 in one case and a factor of 5 in the other case.  The latter
case was one in which the machine had a particularly well-suited
instruction, the former case is more typical.

The major lesson that one gets from Bentley's book is that almost
all the significant speedups are obtained in designing data
structures, algorithms, and general good coding.  In this book
many examples are given of "typical" code being speeded up by
factors of ten or more just by careful rewriting of high-level
code.

I think this is an excellent, fun-to-read book. Anyone who has not
looked at this book should peruse it the next time they're in a 
good bookstore (if you don't get distracted by "Programming Pearls",
by the same author, just next to it!).

rcd@ico.isc.com (Dick Dunn) (07/12/90)

mwm@raven.pa.dec.com (Mike (Real Amigas have keyboard garages) Meyer) writes:
> rmarks@KSP.Unisys.COM (Richard Marks) writes:
>    Moral:  Consider implementing in your favorite higher level language;
>    but then recode the inner loops in assembler for perfromance.
> 
> You missed a step - instrument the code to find out where the time is
> really being spent, and why. Your intuition about that is probably
> wrong...

I'll add one more step to precede these two:  Decide whether performance
really matters before you start any tuning.  Mike and Richard may have
assumed that this step had already been taken, but it still bears pointing
out.  First think whether performance is an issue, and if so, whether it
will be cost-effective to fix it.  Then (as Mike sez) figure out where the
performance problems are...and he's right that most people miss the problem
areas.  Don't tune something that doesn't matter.  You'll probably find
that most of the time is spent in a tiny area, at which point you apply
Richard's advice.
-- 
Dick Dunn     rcd@ico.isc.com  -or-  ico!rcd          (303)449-2870
   ...Never attribute to malice what can be adequately explained by stupidity.

des@elberton.inmos.co.uk (David Shepherd) (07/12/90)

In article <9743@brazos.Rice.edu>, cliffc@sicilia.rice.edu (Cliff Click)
writes:
|> In article <MEISSNER.90Jul10102037@osf.osf.org> meissner@osf.org
(Michael Meissner) writes:
|> >I also wonder about the computational complexity of global register
|> >analysis.  
|> 
|> Global register allocation is NP-complete.
|> Some folks at Rice here experimenting with a nameless commerical compiler
|> discovered a simple 2000-line fortran program that took 90+ HOURS to compile
|> on a 10+ Mips, 16 Meg workstation.  The program spent roughly 98% of
it's time
|> thrashing the disk.  The algorithm the compiler used required exponentional
|> memory space.  The same compile took ~15 minutes on a 48 Meg workstation.

When compiling one of the files to build the HOL theorem prover the C code
that AKCL produces from the lisp "compilation" of the original ML code (ok so
we're not talking about normal "user code" here) takes > 24 hours to compile
on -O2. Without optimization it takes about 5 mins. I can't remember
exactly but
I don't think it was actually trashing the disk that much -- i.e. cpu
seemed solid at
100%.


--------------------------------------------------------------------------
David Shepherd: des@inmos.co.uk or des@inmos.com    tel: 0454-616616 x 529
                INMOS ltd, 1000 Aztec West, Almondsbury, Bristol, BS12 4SQ

sysmgr@KING.ENG.UMD.EDU (Doug Mohney) (07/12/90)

In article <628@dg.dg.com>, Publius@dg.dg.com (Publius) writes:
>I am not sure if the "comp.arch" is the proper place to talk about software
>development methodology.  

Funny, some of the responses I've seen on here sound like idelogy

yam@cbnewsi.att.com (toshihiko.yamakami) (07/13/90)

I have little experience on assembler coding,
so I might make rather silly comments.

From article <628@dg.dg.com>, by Publius@dg.dg.com (Publius):
> Many of the optimization problems belong to the "NP-complete" or even
> the "Turing-uncomputable" categories.  Thus, no matter how smart the
> global optimizer is, as long as it is equivalent to a Turing machine,
> it can not achieve the FULL optimization for EVERY situation.

I always wonder why human beings can do 'Turing-uncomputable' computation.
If human beings can do, why can't a program simulate it?
If anyone is kind enough to mail me the reason, I will appreciate it.

> If a large program spends most of the CPU time in a tiny routine, then,
> go ahead and hand-code it in assembler and do an exhaustive optimization,
> as long as you isolate it and document it comprehensively.

It is "comp.arch", so I would like to comment on RISC chips
and optimization. Following discussion may not be applied to CISC.
I believe it was true that hand-code in assembler was fast
than compiler code in CISC era.
however, some of RISC chip's instruction set is so simple,
so I wonder what people can do in hand on such a code.
For example, let's consider MIPS R2000(or R3000).
Last year, I tried to my first assembler optimization in my life.
Our group needed the enciphering performance of 1 Mbit/sec/MIPS.
To achieve it, I was forced to do some hand optimization.
To be honest, I don't like assembler coding(I am biased, you know :-) ).
I tried to re-write compiler output code on MC68030, SPARC, R2000.
I could improve performance about three times in MC68030, 5-10% on SPARC,
nothing(to be honest, negativbe) on R2000.
To my surprise, every time I rewrote the R2000 code, the performance
was decreased, and I gave up rewriting.
After that, I believe it is almost impossible to improve
performance in hand-coding on some types of RISC chips.

I am happy if I can hear other assembler experts's hand-coding experience
on MIPS or similar RISC chips.

-- 
Toshihiko YAMAKAMI(NTT, Japan) Resident visitor in Bell Labs until Feb 1991
 Room 4G634, AT&T Bell Labs, Crawfords Corner Rd. Holmdel, NJ 07733-1988
 Tel:(201)949-5742	e-mail: yam@vax135.att.com (was: yam@nttmhs.ntt.jp)

jkrueger@dgis.dtic.dla.mil (Jon) (07/13/90)

[truths about ways and means of getting the most results
out of each processor second]

Should mention somewhere that for an interesting class of problems,
storing and sharing the results is more of a bottleneck than computing
them.  And that engines and methods that compute them faster only help
somewhat for committing them and making them available faster.

-- Jon
-- 
Jonathan Krueger    jkrueger@dtic.dla.mil   uunet!dgis!jkrueger
Drop in next time you're in the tri-planet area!

karsh@trifolium.esd.sgi.com (Bruce Karsh) (07/13/90)

In article <628@dg.dg.com> publius@dg-pag.webo.dg.com (Publius) writes:

>Many "large" programs spend most of the CPU time in one or a few relatively
>"tiny" routines.

This is an often stated.  Is there any evidence for this?  What does the
distribution of time look like for, for instance, a compiler?  I'd expect
that there would be lots of time-sinks.  The lexical analyzer, the symbol table
hash routine, the symbol table lookup routine, common subexpression finder,
... etc, could all be candidate time-sinks.

How about a text editor?  I could think of a few dozen potential time-sinks
in a text editor.

When I've profiled programs, I've often found that the time is spent mostly
in a few routines, but this isn't always the case.  How often is it true that
nearly all of the time is spent in only a few routines?  How often are these
routines tiny?

			Bruce Karsh
			karsh@sgi.com

aglew@oberon.crhc.uiuc.edu (Andy Glew) (07/13/90)

>When I've profiled programs, I've often found that the time is spent mostly
>in a few routines, but this isn't always the case.  How often is it true that
>nearly all of the time is spent in only a few routines?  How often are these
>routines tiny?
>
>			Bruce Karsh
>			karsh@sgi.com

I, also, have profiled programs where the 90%/10% for routines rule
doesn't apply.  Well, actually, in a few that I have investigated in
detail there was some sort of ratio rule applicable to *types* of
code, but not according to routine. For example, I have seen programs
where 50% of the time was spent in function calling and returning,
manipulating variables, etc. The problems were (1) determining what
types of classifications (time spent function calling, time spent in
loop overhead, time accessing method variables...) were appropriate,
and (2) getting my profiler to tell me how much time was spent
according to each classification.
    Unfortunately, most profilers do not have a way of indicating "X%
of your time was spent in function calling". Gprof tends to provide
function granularity (possibly with programmer provided parts).
That's why I quickly fall back to "flat" profiling, done at the
instruction level, for then I can classify instructions according to
"purpose".  I wish I had access to a commercial profiler that does
this. Is there one?
    Finishing the anecdote... the program that spent 50% of its time
in function calls? --- It was one of those well-structured "CS"
programs :-) [*] that had lots of small functions called all over the
program.  Inlining (actually, back then I macrofied - didn't have
access to a compiler with inlining) functions with fewer than 10 lines
of code in their bodies got me most of the necessary improvement.


[*] This is just a mild reference to Karsh et al's parallel discussion.
    I don't want to get involved in that dispute.  I *like* well-structured
    programs with lots of subroutines.  I just wish that all of my compilers
    did inlining automatically.

--
Andy Glew, aglew@uiuc.edu

firth@sei.cmu.edu (Robert Firth) (07/13/90)

In article <2319@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:

>To produce good code, it is necessary to
>understand the cost considerations, and I find it difficult to believe that
>this can be done at the limited level of HLLs.

What cost considerations, pray?  The costs that usually matter are
the costs of designing, producing, testing, debugging and
maintaining the code, since these things are very intensive
of human labour.  Most people who build software are very
sensitive to these costs, which is the reason then use high
level languages.

A few people, however, are completely insensitive to most of the
costs that really matter, and focus on the cost of executing the
code on a machine.  These people code in assembler, and they can
be an expensive luxury as their tight, efficient programs
repeatedly fall behind schedule, fail on simple test cases, and
eat up unbelievable amounts of maintenance time.

I find it very sad that so many of these discussions take the
standpoint that it is the purpose of the human to serve the
machine.

henry@zoo.toronto.edu (Henry Spencer) (07/14/90)

In article <64045@sgi.sgi.com> karsh@trifolium.sgi.com (Bruce Karsh) writes:
>>Many "large" programs spend most of the CPU time in one or a few relatively
>>"tiny" routines.
>
>This is an often stated.  Is there any evidence for this?  What does the
>distribution of time look like for, for instance, a compiler?  I'd expect
>that there would be lots of time-sinks...

My personal feeling is that the "hot spot" principle has become accepted
dogma on the basis of rather few actual cases, the most infamous being
the XPL compiler that spent all of its time skipping trailing blanks in
card images.  I suspect that a lot of hot-spot examples are the result
of something like this:  stupid code, or a stupid data format.

Our experience with C News was that there never was a single massive hot
spot.  Typically the profile would show half a dozen major contributors
to execution time and then a long tail of smaller ones.  Often, the
way to fix it was *not* to optimize the profile leaders, but to look at
where they were being called from and why, and streamline the algorithms
in the callers.
-- 
NFS:  all the nice semantics of MSDOS, | Henry Spencer at U of Toronto Zoology
and its performance and security too.  |  henry@zoo.toronto.edu   utzoo!henry

Publius@dg.dg.com (Publius) (07/14/90)

In article <64045@sgi.sgi.com> karsh@trifolium.sgi.com (Bruce Karsh) writes:
>In article <628@dg.dg.com> publius@dg-pag.webo.dg.com (Publius) writes:
>
>>Many "large" programs spend most of the CPU time in one or a few relatively
>>"tiny" routines.
>
>This is an often stated.  Is there any evidence for this?


The best known example is probabily the LINPACK benchmark (although it is
not big).  It spends over three quarters of the CPU time in a tiny routine
called daxpy() (saxpy() for the single-precision case).  The same daxpy()
routine, though coded a little differently, is also responsible for most of
the CPU time of one of the SPEC 1.0 Suite benchmarks (matrix300, to be exact).
Actually, many numerical analysis programs spend big chunks of CPU time in
this daxpy() routine.

Many years ago, Donald Knuth stated that most of the CPU time in batch
business application environment was spent in doing sorting.

In practice, the better a large piece of software is modularized, the easier
it is to identify the time-consuming routines.  Also, it is not difficult to
understand that the programs written with the re-usable code in mind tend to
spend much CPU time in the same routines.





-- 
Disclaimer: I speak (and write) only for myself, not my employer.

Publius     "Old federalists never die, they simply change their names."
publius@dg-pag.webo.dg.com

mo@messy.bellcore.com (Michael O'Dell) (07/14/90)

In article <1990Jul13.173331.22918@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>
>Our experience with C News was that there never was a single massive hot
>spot.  Typically the profile would show half a dozen major contributors
>to execution time and then a long tail of smaller ones.  Often, the
>way to fix it was *not* to optimize the profile leaders, but to look at
>where they were being called from and why, and streamline the algorithms
>in the callers.



Ohh Yes!! The ol' maxim of 

	There's no substitute for knowin' what the *&$# you're doin'!!!!

I must agree with Henry - after the the trivially-obvious things are eliminated
you have to actually optimize the work, not the code.

	-Mike

---------------

"The obvious is immediately apparent if you look closely enough."

ken@argus.UUCP (Kenneth Ng) (07/14/90)

In article <64045@sgi.sgi.com>, karsh@trifolium.esd.sgi.com (Bruce Karsh) writes:
: In article <628@dg.dg.com> publius@dg-pag.webo.dg.com (Publius) writes:
: >Many "large" programs spend most of the CPU time in one or a few relatively
: >"tiny" routines.
: This is an often stated.  Is there any evidence for this?  What does the
: distribution of time look like for, for instance, a compiler?  I'd expect
:that there would be lots of time-sinks.  The lexical analyzer, the symbol table
: hash routine, the symbol table lookup routine, common subexpression finder,
: ... etc, could all be candidate time-sinks.

While there are undoubtably programs where most of the time is spent in a
few routines, I've been party to some 100K line programs where the profiler
indicated that the number one cpu user routine was using 2-3 % of the
resources, and the rest were in the low 0.5 to 2 % range, almost noise level.
And the top ten users were I/O routines, so I suspect the profiler was
catching them doing I/O to the various devices.  And yet I know of over
a dozen places where routines could be improved by doing a binary instead
of linear search.  (Whether or not there would be an improvement is 
debatable because then the entires would have to be inserted and moved
in order, and the average size was only a dozen or so.) 

-- 
Kenneth Ng: Post office: NJIT - CCCC, Newark New Jersey  07102
uucp !andromeda!galaxy!argus!ken *** NOT ken@bellcore.uucp ***
bitnet(prefered) ken@orion.bitnet  or ken@orion.njit.edu

yodaiken@chelm.cs.umass.edu (victor yodaiken) (07/14/90)

In article <1990Jul13.173331.22918@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>
>My personal feeling is that the "hot spot" principle has become accepted
>dogma on the basis of rather few actual cases, the most infamous being
>the XPL compiler that spent all of its time skipping trailing blanks in
>card images.  I suspect that a lot of hot-spot examples are the result
>of something like this:  stupid code, or a stupid data format.
>
>Our experience with C News was that there never was a single massive hot
>spot.  Typically the profile would show half a dozen major contributors
>to execution time and then a long tail of smaller ones.  Often, the
>way to fix it was *not* to optimize the profile leaders, but to look at
>where they were being called from and why, and streamline the algorithms
>in the callers.
>-- 

That was my experience in trying to speed up an o.s. Major speed losses
seemed to be the result of propagation of minor speed losses.  Rather than
hot spots, we found long execution paths and repeated recomputation of
data that could be cached. 

henry@zoo.toronto.edu (Henry Spencer) (07/15/90)

In article <1990Jul12.182552.26715@cbnewsi.att.com> yam@cbnewsi.att.com (toshihiko.yamakami) writes:
>> Many of the optimization problems belong to the "NP-complete" or even
>> the "Turing-uncomputable" categories...
>
>I always wonder why human beings can do 'Turing-uncomputable' computation.
>If human beings can do, why can't a program simulate it?

The answer is that human beings don't solve the entire problem, only a
subset of it.  And really, that is all you need.  Programmers do not
write arbitrary programs.  The NP-completeness or Turing-uncomputability
of an optimization problem is seldom of much interest; all it tells you
is that there is no halfway-efficient fully-optimal fully-general solution.
In practice, the fully-optimal fully-general solutions usually are only
halfway efficient anyway, and hence nobody uses them.  It's much better
to use something that sacrifices theoretical optimality or full generality
but runs fast on real input.
-- 
NFS:  all the nice semantics of MSDOS, | Henry Spencer at U of Toronto Zoology
and its performance and security too.  |  henry@zoo.toronto.edu   utzoo!henry

ssr@taylor.Princeton.EDU (Steve S. Roy) (07/15/90)

Burce Karsh writes:

>>In article <628@dg.dg.com> publius@dg-pag.webo.dg.com (Publius) writes:
>>
>>Many "large" programs spend most of the CPU time in one or a few relatively
>>"tiny" routines.
>
>This is an often stated.  Is there any evidence for this?

Oh yes.

Naturally it depends on what type of code you're talking about.  There
is a strong tendency for a person who works on a particular type of
program to take the things s/he does as completely typical and
assumes their problems are the same as everyone else's.  This is very
much not the case -- the TYPES of bottlenecks people find vary a lot.
Scientific number-crunchers are not like editors are not like
compilers etc.

Much of my experience is with scientific number crunchers.  It is VERY
VERY VERY common that >90% of the execution time of such programs is
spent in <<10% of the code, even just a few dozen lines.  It is also
VERY VERY VERY true that speed is often of paramount importance.

For the types of programs I use, those conditions are true more often
than not.

Steve Roy
ssr@acm.princeton.edu

rwallace@vax1.tcd.ie (07/16/90)

In article <64045@sgi.sgi.com>, karsh@trifolium.esd.sgi.com (Bruce Karsh) writes:
> In article <628@dg.dg.com> publius@dg-pag.webo.dg.com (Publius) writes:
> 
>>Many "large" programs spend most of the CPU time in one or a few relatively
>>"tiny" routines.
> 
> This is an often stated.  Is there any evidence for this?  What does the
> distribution of time look like for, for instance, a compiler?  I'd expect
> that there would be lots of time-sinks.  The lexical analyzer, the symbol table
> hash routine, the symbol table lookup routine, common subexpression finder,
> ... etc, could all be candidate time-sinks.
> 
> How about a text editor?  I could think of a few dozen potential time-sinks
> in a text editor.
> 
> When I've profiled programs, I've often found that the time is spent mostly
> in a few routines, but this isn't always the case.  How often is it true that
> nearly all of the time is spent in only a few routines?  How often are these
 routines tiny?
> 

Usually this is the case. The compiler is an exception because it's an
all-round integer CPU intensive program, but I disagree with you on the text
editor. I was writing a simple text editing facility on an accounting package
in Turbo C on an MS-DOS machine a couple of weeks ago. The editor ended up
spending almost all its time either paging or scrolling. Paging partly involves
setting memory to a particular value, which ended up as a 3-line loop of C
coded, so those 3 lines of code were taking about 30%-40% of the total
execution time. Now the 80x86 is a pretty foul processor for memory writes and
copies but by examining the assembly language output and changing 2 lines of C
code I enabled the compiler to double the speed of the code for that inner
loop. Lo: a 20% speed improvement by changing 2 lines.

Russell Wallace, Trinity College, Dublin
rwallace@vax1.tcd.ie
"To summarize the summary of the summary: people are a problem"

staff@cadlab.sublink.ORG (Alex Martelli) (07/19/90)

cavrak@uvm-gen.UUCP (Steve Cavrak,113 Waterman,6561483,) writes:
>...
>An example with VS-FORTRAN on an IBM-3090 with a vector processor.
>a.  plain fortran matrix		1.0
>b.  full optimized fortran		0.3
>c.  IBM's ESSL hand coded library	0.1 or better

d.  NAG's implementation, fully Fortran but "properly" coded with
    block-submatrices algorithm: within 10% of ESSL!

That's for linear-algebra stuff...  back in '88, I was working in the
same office at IBM Italy Scientific Center with the NAG guy that was
visiting there and implementing those routines; whose performance gains
were proving transferable to other machines with complex storage
hierarchies, by the way, via a simple variation in the "good-block-size"
parameter.  On the other hand, I suspect the wondrously fast and precise
library of Fortran intrinsics, coded in assembler with lots of clever
table-lookups and bit-twiddling, would suffer a LOT if it had to be
reimplemented in Fortran itself!  But in most cases, proper algorithms,
and, more specifically, appropriate data-access patterns, are THE key
to good numerical-computing performance - and Fortran remains adequate 
for expressing such algorithms and data-access patterns.
-- 
Alex Martelli - CAD.LAB s.p.a., v. Stalingrado 45, Bologna, Italia
Email: (work:) staff@cadlab.sublink.org, (home:) alex@am.sublink.org
Phone: (work:) ++39 (51) 371099, (home:) ++39 (51) 250434; 
Fax: ++39 (51) 366964 (work only; any time of day or night).

jmd@decwrl.dec.com (John Danskin) (07/27/90)

I just wanted to put my two cents in on this.  A lot of people make
generalizations based on the kind of coding they do, and state that
the kind of coding other people are paid to do is stupid.

I work on low level 3D graphics for digital equipment in palo alto.
We build medium end 3D workstations (whatever those are). Somewhere in
the workstation, there is alwoays a goemetry accelerator (if there
isn't, some other group does the work). So far, for us, the geometry
accelerator is some kind of RISC chip with hot floating point and a
reasonably good C compiler. We have used weitek Accell parts and the
Intel i860.

I have been the person who examines chips, and says how fast they will
go, so the people who decide how fast we need to go and how much we
want to pay for a chip can make up their minds.

To do this, I spend a few weeks writing machine/assembler code for the
chip for the graphics primitives we think are interesting, and count
the bus transactions, and talk to the chip company and the designers
etc.  Eventually, I have some pretty good pieces of code that
represent a speed of light for the chip and the proposed memory
subsystem and communications bottleneck. This might take a few weeks.
We evaluate a lot more chips than we build systems out of.

When we finally decide to build a system, we start out by coding
everything in C (or copying the old C code from the last project). It
runs with simulated hardware and communications at a fairly hokey
level for a while while we wait for prototype real hardware.
Eventually, we get real hardware, port to it, and the stuff runs in C.
With pretty good C at this point (the algorithms are straightforward,
well known, and we know them by now) the code runs between 10 and 30
times slower than the speed of light for the configuration, which is
of course, what we are committed to achieving. The next 90% of the schedule
is budgeted to optimization. We achieve this improvement by:

	recoding in assembler. Some routines are recoded several times with
	slightly different tradeoffs and degrees of aggresiveness in
	loop unrolling and inlining. This is extremely expensive.
	
	Incidentally, we do start with a few really important expensive
	routines, but those unimportant data shuffling, branching,
	procedure calling routines rapidly get important as we run out
	of things to do to those nifty little floating point routines.

	Sometimes we just recode in C a few times until we hit on a way to
	get the compiler to generate reasonable code.

	We seem to get about 5X performance out of this phase.

	Other things. I wrote a cache optimizing linker. Using a call
	graph and some human intervention (a hand generated sed script),
	It reorders modules so that there will be no cache misses
	in a direct mapped cache (it knows the size) for code paths
	that we think are important. Usually these are the paths for
	the graphics primitives with some of the infrequent stuff edited
	out of the call graph, and things that cannot happen in the same
	primitive flagged as OK to overlap in the cache.

	On the system I developed this for, we got a 2X performance increase.
	(Before, we had about 90% hits with a penalty for a miss of 5 cycles,
	since one miss implies another if you are thrashing on a direct cache,
	this means the penalty was really 10 cycles.) 10% of operations taking
	10 times as long means the whole thing takes twice as long.

	What clued us in to this cache problem, was a routine that we recoded in
	assembler. We brought this critical routine from 400 cycles to 40 cycles,
	but the code ran slightly slower. We looked at the link map, and sure enough,
	the resize had caused it to overlap with this other critical routine...
	To recoding in assembly CAN slow down your code, but usually there is a
	lesson more interesting than "Assembly coding is bad".

	Usually we improve the hardware a little bit as we have better
	data on which parts *really* are bottlenecks.


Why do we do this?
	Basically because we are a hardware company, and people
	view the part of the system which we produce as part of the
	hardware, and because it seems to be cheaper than including
	10X faster hardware on every system and charging the same.

	People do full custom VLSI too. If everybody refused to get
	their hands dirty and look at some bits, nobody would build
	hardware ( a much lower level process than assembly or
	microcoding ).

	The development costs are staggering. The maintanence costs
	are staggering, but still it is all quite a bit cheaper than
	10000 copies of 10X faster hardware.


	Another reason is because our code is floating point intensive,
	but not exactly vectorizable, and the compilers (MIPS included)
	do a relatively poor job of scheduling around multicycle instructions.


	The final reason, is that if we didn't do it, nobody would buy
	our boxes, and I would have to get another job, and I *like*
	working here.


Incidentally, my favorite language this year is scheme. It doesn't
run as fast as C, but sometimes you need to implement programs fast
instead of implementing fast programs.
--
John Danskin				| jmd@decwrl.dec.com  or decwrl!jmd
DEC Advanced Technology Development	| (415) 853-6724 
100 Hamilton Avenue			| My comments are my own.
Palo Alto, CA  94301			| I do not speak for DEC.