[comp.lang.misc] Compiler Costs

zrra07@backus (Randall R. Appleton) (07/05/90)

I have a simple question:  Does anyone know what sort of speed-up
on gets by going from a good implementation of some algorithm in a
third generation language (C, Fortran) and a good optimizing compiler
to hand-coded assembly?

In other words, if I take my average well written program, compiled
with a good optimizing compiler, and re-write it in assembler, what sort
of speedup should I expect to see?

Thanks,
Randy

schow@bcarh185.bnr.ca (Stanley T.H. Chow) (07/06/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>I have a simple question:  Does anyone know what sort of speed-up
>on gets by going from a good implementation of some algorithm in a
>third generation language (C, Fortran) and a good optimizing compiler
>to hand-coded assembly?
>
>In other words, if I take my average well written program, compiled
>with a good optimizing compiler, and re-write it in assembler, what sort
>of speedup should I expect to see?
>
>Thanks,
>Randy

The short answer is: Depends.

OKay, you ask, on what does it depend?

Well, it depends on just about every single adjective you used in
your question:
  How good is "a good implementation"?
  Which algoritm is the "some algorithm"?
  Which "good optimizing compiler"?
  Whose average? Over which set of programs?
  "Well written" for what purpose?
  Who will do the rewriting?
  For which computer?

Basically, performance is like anything else, the variation between
people (using identical tools) is larger than variation between tools.
So the "optimum" stratagy is to get the best person for the job.

You really will have to specify more details to get any meaningful
answer to your question. For example, There are some Fortran compilers
that are extremely good; for many programs, the code is just about
optimal.

The meaning of "well written" is very context dependent. Some people
insist on clear, understandable code; others want it fast, still others
want it to be robust and bullet-proof. Which properties does your
program have and which properties do you want to preserve in assembler?

--
Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!psuvax1!BNR.CA.bitnet!schow
(613) 763-2831		     ..!utgpu!bnr-vpa!bnr-rsc!schow%bcarh185
Me? Represent other people? Don't make them laugh so hard.

cik@l.cc.purdue.edu (Herman Rubin) (07/06/90)

In article <1797@apctrc.UUCP>, zrra07@backus (Randall R. Appleton) writes:
> I have a simple question:  Does anyone know what sort of speed-up
> on gets by going from a good implementation of some algorithm in a
> third generation language (C, Fortran) and a good optimizing compiler
> to hand-coded assembly?
> 
> In other words, if I take my average well written program, compiled
> with a good optimizing compiler, and re-write it in assembler, what sort
> of speedup should I expect to see?

This only looks simple.  The results might even be drastically different
on different machines.  Even the choice of algorithm might depend quite
heavily on the specific machine.  The infamous 4.2BSD frexp() was an 
example of a problem which can be efficiently (and even very simply) coded
in assembler, but which cannot be efficiently handled by such limited 
languages as those mentioned.  The thing missing from these languages 
are the operations of pack and unpack; utterly trivial, but not there.
In fact, even less is needed on most machines.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

toma@tekgvs.LABS.TEK.COM (Tom Almy) (07/06/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>I have a simple question:  Does anyone know what sort of speed-up
>on gets by going from a good implementation of some algorithm in a
>third generation language (C, Fortran) and a good optimizing compiler
>to hand-coded assembly?

Simple Answer: That depends...

If the program implements an algorithm that is well suited for the compiler
(standard matrix multiplication in Fortran, for example) or is a program that
is a standard benchmark for which *every* compiler writer optimizes, you
might not see any improvement at all. If fact it might run slower if you are
not as clever as the compiler writer!

On the other hand, if the algorithm cannot be efficiently represented in
the language (string manipulation in Fortran, complex numbers in C) or there
are special cpu instructions you can take advantage of that the compiler
doesn't, then you can get major improvements, even order of magnitude.

In general, there is only a small portion of the code that uses most of the
execution time, and by coding those portions in assembler, you can get most
of the assembler performance while still enjoying the convenience of a 
high level language for the bulk of the code.

If you are concerned about code size, there still is a major improvement
to be offered with assembler, typically because you can custom tailor 
subroutine calling conventions and register usage. But then again many
modern compilers let you do the same.

(I hope this won't be the start of some "language war" :-)

Tom Almy
toma@tekgvs.labs.tek.com
Standard Disclaimers Apply

henry@zoo.toronto.edu (Henry Spencer) (07/06/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>I have a simple question:  Does anyone know what sort of speed-up
>on gets by going from a good implementation of some algorithm in a
>third generation language (C, Fortran) and a good optimizing compiler
>to hand-coded assembly?

It can be negative.  Good optimizing compilers can often do a better
job on large bodies of code than humans can.  Small code fragments
often do better when carefully crafted by humans, although this is
a very time-consuming activity.

Improving the algorithm is usually a more cost-effective way of boosting
performance.
-- 
"Either NFS must be scrapped or NFS    | Henry Spencer at U of Toronto Zoology
must be changed."  -John K. Ousterhout |  henry@zoo.toronto.edu   utzoo!henry

gillett@ceomax..dec.com (Christopher Gillett) (07/07/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>I have a simple question:  Does anyone know what sort of speed-up
>on gets by going from a good implementation of some algorithm in a
>third generation language (C, Fortran) and a good optimizing compiler
>to hand-coded assembly?
>

This is, among other things, an interesting question, but I certainly
wouldn't characterize it as "simple" :-).  There are a huge number of
factors to consider in making this determination, and many of them
are subjective and/or incalculatable.  Here are some things that you
should consider:

     1.  Nature of the program:  Something that is fairly linear and 
         spends all its time calling system functions is not likely
         to be sped up by conversion to assembler.  Something fairly
         complex that uses lots of control transfers, and is reasonably
         non-obvious (itself a very subjective term) is more likely to
         be improved.  

     2.  How good is good?  The quality of the compiler that you are
         using is significant.  Compilers range from utter trash to
         seemingly clairvoyant in their abilities to generate efficient
         code.  Since there's no really good metric that you can apply
         to an arbitrary compiler to determine its effectiveness, you're
         left making a judgement call about whether or not the compiler
         is a better assembly language programmer than you.

     3.  How talented are you?  The art of assembly language programming
         is rapidly dying out.  Many people are really gifted high level
         programmers who fall apart in the assembly world (absolutely no
         jabs at anybody at all intended here, btw :-) ).  Problems are
         solved much differently in the two different worlds.  If you try
         to do a straight translation from high level to low level, you're
         not doing much better than the compiler (in fact, probably much
         worse).  However, if you are a gifted machine-level hacker, and
         you have really understand the program you want to convert (like,
         did you write it yourself?), then you might gain significantly
         by converting the code. 

These are just the first few things that pop into mind.  My opinion on the
whole matter is that you should consider language issues long before putting
down any code.  The very nature of the program should dictate which language
to use.  Further, it's been my observation that code conversion jobs (whether
from one high level language to another, or from a high level to a low level,
or low to high) are expensive, time consuming, and not very rewarding.

>Thanks,

You're welcome!

>Randy

/Chris
---
Christopher Gillett               gillett@ceomax.dec.com
Digital Equipment Corporation     {decwrl,decpa}!ceomax.dec.com!gillett
Hudson, Taxachusetts              (508) 568-7172

patrick@convex.com (Patrick F. McGehearty) (07/07/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>I have a simple question: ...
>In other words, if I take my average well written program, compiled
>with a good optimizing compiler, and re-write it in assembler, what sort
>of speedup should I expect to see?

Since I have done this for a living before, I will provide my answer:
Which is: "it all depends"
(1) How much algorithm tuning have you already done (more gain is
	usually available here than any amount of assembly tuning)
(2) How much code/language/machine tuning have you done?
	This step is frequently overlooked.  Look at the output of
	the optiming compiler, revise the high level language by the
	use of declarations or temporaries or code rearrangement to
	generate better assembly.
(3) How irregular is the architecture you are compiling for?
	i.e. it took a lot longer to have "good" optimizing compilers
	for the 8x86 line than it did for the 680x0 line.
(4) How good is a "good" optimizing compiler?
(5) How good are you at writing excellent assembly for the architecture
	in question?  Do you know the details and quirks of the
	timings of the architecture?  What about future generations
	of the same machine?

It is more common than people realize to have assembly code which is
much slower than the best high level language program as well as
much harder to maintain and enhance in the future.  Going to assembly
too early makes it harder to radically revise an algorithm.

In the best of cases, 0-4% gain might be obtained.
For many combinations of compilers and assembly code writings 10-20%
	is available.
For the real turkey compilers and really wizard tuners, >100% has been seen.

So, I repeat, "it all depends"

tif@doorstop.austin.ibm.com (Paul Chamberlain) (07/07/90)

In article <2317@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>In article <1797@apctrc.UUCP>, zrra07@backus (Randall R. Appleton) writes:
>> In other words, if I take my average well written program, compiled
>> with a good optimizing compiler, and re-write it in assembler, what sort
>> of speedup should I expect to see?
>This only looks simple.

Yes, yes, yes.  I'm sure he knows that the real answer is "it depends."
I'd say between 1 and 10 times faster, probably between 2 and 3.
Anyone will tell you though that you usually only need to
rewrite a couple dozen lines to get your 2-3 times faster.

P.S.	Another good example is parity.  Takes about 25 instructions
	in C, about 1 in assembly (using intel anyway).

Paul Chamberlain | I do NOT represent IBM	  tif@doorstop, sc30661@ausvm6
512/838-7008	 | ...!cs.utexas.edu!ibmaus!auschs!doorstop.austin.ibm.com!tif

nmm@cl.cam.ac.uk (Nick Maclaren) (07/07/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>I have a simple question:  Does anyone know what sort of speed-up
>on gets by going from a good implementation of some algorithm in a
>third generation language (C, Fortran) and a good optimizing compiler
>to hand-coded assembly?
>
>In other words, if I take my average well written program, compiled
>with a good optimizing compiler, and re-write it in assembler, what sort
>of speedup should I expect to see?

It is impossible to predict, in general, but the following rules of thumb
may be useful.  Consider the minimal program to take unit time (or space);
the following multiplicative factors are good guidelines:

Competence of assembler programmer    1x  -  5x
Competence of 3 GL programmer         1x  -  3x
Suitability of 3 GL for algorithm     1x  -  10x 
Competence of compiler's code         1x  -  3x
Degree of optimisation                1x  -  10x 

You can therefore expect the assembler to be anything from 900 times
faster to 5 times slower!  This excludes even more drastic factors (in
either direction), caused by gross mistakes or misunderstandings (e.g.
dropping into a floating-point emulator to do 2-D array indexing).

Nick Maclaren
University of Cambridge Computer Laboratory
nmm@cl.cam.ac.uk

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (07/07/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>... if I take my average well written program, compiled
>with a good optimizing compiler, and re-write it in assembler, what sort
>of speedup should I expect to see?

The speedup will vary.  A lot. There are rare examples where
well-written assembler code was slower.

-- 
Don		D.C.Lindsay

mwm@raven.pa.dec.com (Mike (Real Amigas have keyboard garages) Meyer) (07/07/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
   In other words, if I take my average well written program, compiled
   with a good optimizing compiler, and re-write it in assembler, what sort
   of speedup should I expect to see?

Lots of good answers, but everyone missed my favorite reason for
compilers generating lousy code - they have a machine that's
ill-matched to the language in question. The best example is to take a
language that really expects there to be a stack, and compile it for a
machine that doesn't have one. An excellent compiler will decide which
routines aren't possibly recursive and compile them into code that
doesn't use the stack, but other routines will have to simulate the
stack in some (probably slow & ugly) manner. I never saw a compiler
that smart, but it's been years since I worked on such a machine.

Given that all routines are compiled as if they might be recursive,
recoding in machine language in such a case - even if the coder is
only moderatly competent - to use fixed memory locations for local
variables and pass arguments in registers instead of doing those
things on a fake stack can easily run an order of magnitude faster,
and a factor of 2 smaller.

	<mike
--
So this is where the future lies			Mike Meyer
In a beer gut belly; In an open fly			mwm@relay.pa.dec.com
Brilcreamed, acrylic, mindless boys			decwrl!mwm
Punching, kicking, making noise

ssr@taylor.Princeton.EDU (Steve S. Roy) (07/07/90)

In article <1797@apctrc.UUCP>, zrra07@backus (Randall R. Appleton) writes:
> I have a simple question:  Does anyone know what sort of speed-up
> on gets by going from a good implementation of some algorithm in a
> third generation language (C, Fortran) and a good optimizing compiler
> to hand-coded assembly?
> 
> In other words, if I take my average well written program, compiled
> with a good optimizing compiler, and re-write it in assembler, what sort
> of speedup should I expect to see?

It depends A LOT on the processor involved, the particular memory
system, the i/o and so on.

Much of the point of the whole RISC thing is to make it easier for
compilers to get a reasonable fraction of the speed of a hand compiled
code fragment.  I'm told that the MIPS compiler generates code that is
nearly as good as a human would write, and this is largely due to the
fact that the assembly language is tuned for the compiler.

To illustrate the opposite end of the spectrum, I've been working on
writing high speed codes for the i860 recently, and current compiler
technology is VERY FAR from equaling hand coding on that style of
processor.  There are a couple of reasons for this, but they come down
to the need for large scale restructuring (strip mining and other
cache handling things), dealing with a fairly small register set, and
a plethora of pipelines (read from memory, write to memory, multiply,
add), and a couple of instructions that have been left out.  The end
result is that there is often a SPEEDUP FACTOR OF 10 OR MORE to be
gained for compute intensive applications with intelegent hand coding.

Steve

oz@yunexus.yorku.ca (Ozan Yigit) (07/07/90)

In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>In other words, if I take my average well written program, compiled
>with a good optimizing compiler, and re-write it in assembler, what sort
>of speedup should I expect to see?

When I was a young & budding assembler hack, I tried this on several
compiler outputs. At that time, VMS fortran and C compilers were
considered very good. I was able to out-optimize them by as much as 30%.
[As I said, I was young :-)] These days, compilers like GCC come very
close, and while I am convinced that for a moderate chunk of code, a
determined hacker can still out-optimize a compiler, I fail to see its
value except in some very special cases. [I had to improve a chess program
once, by re-writing its bitmap routines (which got called zillions of
times) in assembler. It was about 30 lines of code, and made about 40%
speed improvement] 

oz

amull@Morgan.COM (Andrew P. Mullhaupt) (07/07/90)

In article <1797@apctrc.UUCP>, zrra07@backus (Randall R. Appleton) writes:
> I have a simple question:  Does anyone know what sort of speed-up
> on gets by going from a good implementation of some algorithm in a
> third generation language (C, Fortran) and a good optimizing compiler
> to hand-coded assembly?
> 
> In other words, if I take my average well written program, compiled
> with a good optimizing compiler, and re-write it in assembler, what sort
> of speedup should I expect to see?

On many machines, like the CDC 6600 and the R/6000, the FORTRAN compilers
are going to get as much or more than you are. You will normally
produce very slow code on a 6600 unless you really understand the
scheduling of the functional units. The FORTRAN compiler for the
R/6000 produces such good code that IBM has decided to implement
ESSL (a major assembly code on the 3090) in FORTRAN on the R/6000.
Their reason is that they can't beat the FORTRAN code.

There are machines like the i860 where hand coded assembler is a lot
faster than most compiled code, but if people still have i860 compilers
ten years from now, you should expect the compiler code to be faster
than hand generated, since the i860 is so complicated. Once you figure
out how to take advantage of the weird hardware, the compiler _always_
takes advantage. The assembler language programmer is not as
consistent.

Later,
Andrew Mullhaupt

tbray@watsol.waterloo.edu (Tim Bray) (07/08/90)

zrra07@backus (Randall R. Appleton) writes:
>In other words, if I take my average well written program, compiled
>with a good optimizing compiler, and re-write it in assembler, what sort
>of speedup should I expect to see?

Isn't "average well written program" an oxymoron?

The speedup, as several people have pointed out, is difficult to predict.  In
general, however, the *cost-effectiveness* is liable to be negative.  If the
program is used heavily enough that hard-won speedups look attractive, it
will require regular maintenance as people's needs change.  Performing this
maintenance on assembler code is apt to be dramatically more difficult and
burn tons of expensive programmer time.

Second, a *really* cost-effective way to make a program faster is to run
it on this year's hot chip, probably twice as fast as your current computer
if it's > 18 months old.  But if you've committed to one architecture by
going to assembler, you've closed that door.

Third, as some have already said, but it can't be said too often, the
expected performance improvement per unit effort from repeatedly profiling
your code and whittling away at the algorithms in the critical areas is
typically dramatically greater than that for assembly hacking.  And the
results are portable.

Yup, assembler hacking is a dying art, thank goodness.  But I'm glad I'm
old enough to have done it.  Doing serious work with VAX assembler was
out and out *fun* - I suppose hunting dinner with a stone knife was too.

Cheers, Tim Bray, Open Text Systems

cik@l.cc.purdue.edu (Herman Rubin) (07/08/90)

In article <103726@convex.convex.com>, patrick@convex.com (Patrick F. McGehearty) writes:
> In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
> >I have a simple question: ...
> >In other words, if I take my average well written program, compiled
> >with a good optimizing compiler, and re-write it in assembler, what sort
> >of speedup should I expect to see?

			.............................

> It is more common than people realize to have assembly code which is
> much slower than the best high level language program as well as
> much harder to maintain and enhance in the future.  Going to assembly
> too early makes it harder to radically revise an algorithm.
> 
> In the best of cases, 0-4% gain might be obtained.
> For many combinations of compilers and assembly code writings 10-20%
> 	is available.
> For the real turkey compilers and really wizard tuners, >100% has been seen.

There are very simple situations where no compiler not being provided with the
assembler code already has a hope of doing a reasonable job in the language.
Since this is a classic example, the 4.2BSD code being in portable C, I will
remind the readers of it, as well as adding a few comments at the end.

The function is frexp(x,&n).  The argument x is a double precision floating
point number.  The result is to be that number scaled down by a power of 
2 so that the absolute value is between .5 and 1, with the power stored in
the location &n.

Now I submit it will take a real turkey machine language user to match the
slowness of any program written in C or Fortran, USING ONLY THE OPERATIONS
OF THOSE LANGUAGES.  Also, inlining, with n being placed in the register
where it is to be used, is likely to cut the running time by more than
half.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

cavrak@uvm-gen.UUCP (Steve Cavrak,113 Waterman,6561483,) (07/08/90)

From article <1797@apctrc.UUCP>, by zrra07@backus (Randall R. Appleton):
> I have a simple question:  Does anyone know what sort of speed-up
> on gets by going from a good implementation of some algorithm in a
> third generation language (C, Fortran) and a good optimizing compiler
> to hand-coded assembly?
> 

An example with VS-FORTRAN on an IBM-3090 with a vector processor.

a.  plain fortran matrix		1.0
b.  full optimized fortran		0.3
c.  IBM's ESSL hand coded library	0.1 or better

As another example, you could track down the standard LINPACK
benchmarks from net-lib@ornl.gov.  The report there gives a good
summary of how the LINPACK performs under different conditions.

Steve

usenet@nlm.nih.gov (usenet news poster) (07/09/90)

cavrak@uvm-gen.UUCP (Steve Cavrak,113 Waterman,6561483,) writes:
>in response to zrra07@backus (Randall R. Appleton):
>> I have a simple question:  Does anyone know what sort of speed-up
>> on gets by going from a good implementation of some algorithm in a
>> third generation language (C, Fortran) and a good optimizing compiler
>> to hand-coded assembly?
>
>An example with VS-FORTRAN on an IBM-3090 with a vector processor.
>
>a.  plain fortran matrix		1.0
>b.  full optimized fortran		0.3
>c.  IBM's ESSL hand coded library	0.1 or better

It is my understanding that on the RS/6000, ESSL is implemented in Fortran,
and IBM claims that most ports have experienced zero or negative gains
in attempting to hand code assembler on the machine.  There have been a 
number of references to "the ill conceived notions" of higher level 
languages in this thread, but it is not at all clear to me what these are
supposed to be.  C was designed as a higher level alternative to assembler.
On a RISC machine with a good compiler, a C programmer willing to work 
on his code ought to be able to do almost anything hand coded assembler can.
This may involve looking at assembly output, recoding routines a few times
and testing alternative alogorithms, but in the end you still have C (that
will run on another processor even if not as fast).

On a complicated processor (CISC, non-orthogonal registers, segmented
addresses etc.) assembler may do better because the compiler writers made
simplifying assumptions about the processor, making some instructions and
address constructs inaccessible.  Seems like Intel processors will be the
last bastion of assembly language.

>Steve

David States

hankd@dynamo.ecn.purdue.edu (Hank Dietz) (07/09/90)

In article <1990Jul6.161158.1297@zoo.toronto.edu>:
>In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>>I have a simple question:  Does anyone know what sort of speed-up
>>on gets by going from a good implementation of some algorithm in a
>>third generation language (C, Fortran) and a good optimizing compiler
>>to hand-coded assembly?
>
>It can be negative.  Good optimizing compilers can often do a better
>job on large bodies of code than humans can.  Small code fragments
>often do better when carefully crafted by humans, although this is
>a very time-consuming activity.

Basically, I agree.  Given the caveat that good compilers are quite
rare, I have seen good compilers consistently outperform handwritten
assembly code -- but humans could usually do local tweaks to compiler
generated code to speed it up even a little more.  IMHO, this is
primarily because compilers usually don't do a very good job of
modeling irregular or complex instructions, whereas humans often are
great at using those, but write inferior code for the more mundane
operations (e.g., they do a poorer job of allocating registers, aren't
consistent about taking advantage of some properties, etc.).  Notice
that RISC architectures don't have many "funny" instructions, so
compilers for them can do particularly well against humans....

Basically, optimized code generation is a search procedure -- if the
compiler knows to search for the right things (which is learned in
part by observing what the best human programmers do), the fact that
computers can search faster and more consistently than humans will
eventually make compiler technology generate better code than a
typical human every time.  Think about how computers play chess -- at
first, clever heuristics were best, but lately the winners have
focussed more on very fast search techniques, and now they are
generally better than most human chess players.  Yes, the very best
human chess players still win, but they are a very small group, and
each game played so carefully is a major effort (i.e., costly).  So it
is/will be for compilers versus humans writing assembly language
programs....

In any case, for good C and Fortran compilers using 1970s compiler
technology and generating code for a PDP11ish target, I'd say a factor
of 2.5x slower than carefully written assembly code is typical.  For a
compiler using 1980s compiler technology (there are very few of these)
and targeting a RISC machine, I'd say it's common to be 30% or more
*faster* than hand-written code (assuming a dumb, non-optimizing,
assembler).  Of course, there are billions and billions of caveats on
these numbers....

						-hankd@ecn.purdue.edu

peter@ficc.ferranti.com (Peter da Silva) (07/09/90)

In article <2324@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> The function is frexp(x,&n).  The argument x is a double precision floating
> point number.  The result is to be that number scaled down by a power of 
> 2 so that the absolute value is between .5 and 1, with the power stored in
> the location &n.

I suspect that something like this will work:

	double frexp(x,n)
	double x;
	int *n;
	{
		x = NORMALISE(x);
		*n = ((long *)&x)[EXPONENT_WORD] & EXPONENT_BITS;
		((long *)&x)[EXPONENT_WORD] &= ~EXPONENT_BITS;
		return x;
	}

You didn't say it had to be portable C. For most machines this will come
down to 2 or 3 instructions, given a decent compiler.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.
<peter@ficc.ferranti.com>

cik@l.cc.purdue.edu (Herman Rubin) (07/09/90)

In article <1196@s8.Morgan.COM>, amull@Morgan.COM (Andrew P. Mullhaupt) writes:
> In article <1797@apctrc.UUCP>, zrra07@backus (Randall R. Appleton) writes:
> > I have a simple question:  Does anyone know what sort of speed-up
> > on gets by going from a good implementation of some algorithm in a
> > third generation language (C, Fortran) and a good optimizing compiler
> > to hand-coded assembly?
> > 
> > In other words, if I take my average well written program, compiled
> > with a good optimizing compiler, and re-write it in assembler, what sort
> > of speedup should I expect to see?
> 
> On many machines, like the CDC 6600 and the R/6000, the FORTRAN compilers
> are going to get as much or more than you are. You will normally
> produce very slow code on a 6600 unless you really understand the
> scheduling of the functional units. The FORTRAN compiler for the
> R/6000 produces such good code that IBM has decided to implement
> ESSL (a major assembly code on the 3090) in FORTRAN on the R/6000.
> Their reason is that they can't beat the FORTRAN code.

In the far distant future, there may be languages in which algorithms as
I can express them NOW can be reasonably coded.  I happen to be familiar
with the scheduling of the functional units on the 6600, and would very
definitely change my algorithms to take advantage of that.  It is quite
likely that the HLLs do not have the capabilities of recognizing the
modified algorithms.

The programmer who is not given that scheduling information cannot produce
near-optimal code.  The compiler is not going to be able to revise the
algorithm other than trivially.  An intelligent human with an optimizing
_assembler_ which provides feedback will do the best.

> There are machines like the i860 where hand coded assembler is a lot
> faster than most compiled code, but if people still have i860 compilers
> ten years from now, you should expect the compiler code to be faster
> than hand generated, since the i860 is so complicated. Once you figure
> out how to take advantage of the weird hardware, the compiler _always_
> takes advantage. The assembler language programmer is not as
> consistent.

The compiler cannot take advantage of the weird hardware to change the
algorithm.  The best practical solution is to enable the programmer to
provide some method of trying out many (maybe even thousands or millions)
algorithmic versions and seeing which turn out better.  There probably 
should be interactions between the processor and the programmer as to
whether certain modifications can be made.  But these are things the
compiler alone cannot do.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

rmarks@KSP.Unisys.COM (Richard Marks) (07/09/90)

I wrote a version of UUDECODE for the IBM PC.  It is used by many 
internet users.  It is written in Turbo PAscal 5.5.  I coded it very
carefully with much consideration for speed.

Then I rewrote about 5% of the code, the inner loops, in assembler and
got a 50% speed up.

Moral:  Consider implementing in your favorite higher level language;
but then recode the inner loops in assembler for perfromance.

Regards,
Richard Marks
rmarks@KSP.unisys.COM

atk@boulder.Colorado.EDU (Alan T. Krantz) (07/09/90)

In article <1990Jul8.230954.18881@ecn.purdue.edu> hankd@dynamo.ecn.purdue.edu (Hank Dietz) writes:
>In article <1990Jul6.161158.1297@zoo.toronto.edu>:
>>In article <1797@apctrc.UUCP> zrra07@backus (Randall R. Appleton) writes:
>>
>>It can be negative.  Good optimizing compilers can often do a better
>>job on large bodies of code than humans can.  Small code fragments
>>often do better when carefully crafted by humans, although this is
>>a very time-consuming activity.
>
>Basically, I agree.  Given the caveat that good compilers are quite
>rare, I have seen good compilers consistently outperform handwritten
>assembly code -- but humans could usually do local tweaks to compiler
>generated code to speed it up even a little more. 

I'm not so sure I've seen this very often. It seems that on some of the
newer machines (IBM RS/6000) which have very special conditions for
optimizing the pipeline this might be more accurate - but on many machines it
isn't. The trick is when you hand write assembly code you get to
optimize register usage across routine calls much better then most
compilers can (actually, to be honest I haven't seen any compilers do
global register optimization - though this is an area of active
research). Anyways, almost every comment I've seen on compilers vs hand
optimization has been on a routine by routine case - and I still believe
(though I could be wrong) that the optimial results from assembly code
comes from register usage across/through routine calls...

Ho hum - I'm sure someone will set me straight ....

------------------------------------------------------------------
|  Mail:    1830 22nd street      Email: atk@boulder.colorado.edu|
|           Apt 16                Vmail: Home:   (303) 939-8256  |
|           Boulder, Co 80302            Office: (303) 492-8115  |
------------------------------------------------------------------

ingoldsb@ctycal.UUCP (Terry Ingoldsby) (07/10/90)

In article <1797@apctrc.UUCP>, zrra07@backus (Randall R. Appleton) writes:
> I have a simple question:  Does anyone know what sort of speed-up
> on gets by going from a good implementation of some algorithm in a
> third generation language (C, Fortran) and a good optimizing compiler
> to hand-coded assembly?

This seems to be following along in the thread of compiler optimization,
where various people are suggesting that optimizers have or have not
replaced human ingenuity in generating efficient code.  This discussion
loses sight of two things:
  1) How hard (long) does the human try?
    - Some years ago I wrote some critical code for an application, part
      of which ran on a TMS32010 and part on an 8086 (it was a multi-
      processor system).  There were two sections of the program that
      had to be *very* fast.  I worked almost a week to write at most
      three pages of code.  When I was done, I doubt that any optimizer
      could have beat that code.  I tried the implications of using
      different registers for different purposes, and so on until I
      had what was the fastest possible implementation.  But at what
      cost?  I wrote about 150 assembler instructions in 5 days.
      That is 30 instructions/day, or 1 instruction every 15 minutes.

  2) Does portability mean anything to you?
    - The code I wrote was very non-portable.  If the processors had
      been changed, even to other members of the same family (eg.
      TMS32020 or 80286) then all my optimizations would have been
      for naught.  The new processors would have required a different
      optimization strategy.  The old code would have run, but not
      optimally.

To me the argument of which is better is, man or machine, depends on
what you are trying to accomplish.  If time and money are not a
problem, then an expert human can probably beat a compiler.  If
budgets or portability have any importance the compiler wins hands
down.

P.S.
 To answer your question, I probably saved 50% runtime on the tight
portions of my code.  


-- 
  Terry Ingoldsby                ctycal!ingoldsb@calgary.UUCP
  Land Information Services                 or
  The City of Calgary       ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

jkrueger@dgis.dtic.dla.mil (Jon) (07/10/90)

cavrak@uvm-gen.UUCP (Steve Cavrak,113 Waterman,6561483,) writes:

>a.  plain fortran matrix		1.0
>b.  full optimized fortran		0.3
>c.  IBM's ESSL hand coded library	0.1 or better

But what is this code DOING?  Without that information these numbers
are useless.  It's reporting what happened to the dependent variables
without specifying the independent variables.

-- Jon
-- 
Jonathan Krueger    jkrueger@dtic.dla.mil   uunet!dgis!jkrueger
Drop in next time you're in the tri-planet area!

grunwald@foobar.colorado.edu (Dirk Grunwald) (07/10/90)

The MIPS compiler suite (and old bit of software, too) does global
optimization. I think that they still adhere to a set register passing
convention, but they do things like eliminate stack manipulation for
leaf procedures, etc etc.

Several people (Steinkist & Hennessy, Wahl, Wallace) have investigated
the advantages of global register allocation at link time, and it does
very well.  You don't even need to have runtime performance data to
tune the allocation; simple bottom-up coloring appears to work very
well.

Someone just needs to hack it into Gnu C. This would raise the common
denominator of compiler performance & make companies either provide
something of comparable performance or support the Gnu compiler. I
still can't believe how many systems are shipped with shitty
compilers.

mcdaniel@adi.com (Tim McDaniel) (07/10/90)

Henry Spencer wrote:
   It can be negative. ...

hankd@dynamo.ecn.purdue.edu (Hank Dietz) wrote:
   Basically, I agree.

atk@boulder.Colorado.EDU (Alan T. Krantz) writes:
   I'm not so sure I've seen this very often.

Damn right!  Henry Spencer and Hank Dietz AGREEING on something?!?

8-)

--
"I'm not a nerd -- I'm 'socially challenged'."

Tim McDaniel
Internet: mcdaniel@adi.com             UUCP: {uunet,sharkey}!puffer!mcdaniel

jon@hitachi.uucp (Jon Ryshpan) (07/10/90)

>>I have a simple question:  Does anyone know what sort of speed-up
>>on gets by going from a good implementation of some algorithm in a
>>third generation language (C, Fortran) and a good optimizing compiler
>>to hand-coded assembly?
...
>							Notice
>that RISC architectures don't have many "funny" instructions, so
>compilers for them can do particularly well against humans....

One of the reasons that it's not so easy to improve on the C compiler
for RISC chips is that they were designed to support good C compilers.
So if what you want to say (what your algorithm requires) is easy to
express in C, you're likely to get an compilation that's hard to improve
on.

Even so, if what you want to say is not easy to express in C, like
"complex number" or "function in a function" or "static variable bound
to a fixed memory location" you may get an object that would benefit
from hand coding.

Jonathan Ryshpan		<...!uunet!hitachi!jon>

M/S 420				(415) 244-7369  	
Hitachi America Ltd.
2000 Sierra Pt. Pkwy.
Brisbane CA 94005-1819

meissner@osf.org (Michael Meissner) (07/10/90)

In article <23285@boulder.Colorado.EDU> grunwald@foobar.colorado.edu
(Dirk Grunwald) writes:

| The MIPS compiler suite (and old bit of software, too) does global
| optimization. I think that they still adhere to a set register passing
| convention, but they do things like eliminate stack manipulation for
| leaf procedures, etc etc.
| 
| Several people (Steinkist & Hennessy, Wahl, Wallace) have investigated
| the advantages of global register allocation at link time, and it does
| very well.  You don't even need to have runtime performance data to
| tune the allocation; simple bottom-up coloring appears to work very
| well.

I've always wondered about the gains.  Speaking off the cuff, I would
imagine you would get gains if you limit yourself to a Fortran style
of coding (no matter what language you use).  By Fortran style, I mean
little or no recursion, all calls are direct (not through pointers),
and all routines exist when the program is loaded.

In particular, calls to routines through pointers have to fall back on
some sort of fixed rules such as which set of registers are caller
save or callee save, and where the arguments go.  Most of the object
oriented programs that I'm aware of ultimately call functions through
pointers, and this in fact is touted as a good thing, so that a
subclass can replace a function as needed.  Also in dynamic languages
like LISP, the function being called may not have been created yet, or
will have been replaced by the time the call is made.  A weaker
version of this is if your system has shared libraries -- when the
link is made, the shared libraries are not bound in, they are
generally bound in when either the program starts up, or an unbound
function is called.

I also wonder about the computational complexity of global register
analysis.  My experience has tended to be that register allocation is
one of the more memory intensive pieces of compilers, and the size of
the memory is needed scales with the number of basic blocks used and
number of distinct pseudo regs requested.  I could easily imagine a
global register analysis phase thrashing on really large programs,
unless severe restrictions where used, and those restrictions would
tend to limit optimization.

| Someone just needs to hack it into Gnu C. This would raise the common
| denominator of compiler performance & make companies either provide
| something of comparable performance or support the Gnu compiler. I
| still can't believe how many systems are shipped with shitty
| compilers.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Do apple growers tell their kids money doesn't grow on bushes?

cliffc@sicilia.rice.edu (Cliff Click) (07/11/90)

In article <MEISSNER.90Jul10102037@osf.osf.org> meissner@osf.org (Michael Meissner) writes:
>I also wonder about the computational complexity of global register
>analysis.  

Global register allocation is NP-complete.

>My experience has tended to be that register allocation is
>one of the more memory intensive pieces of compilers, and the size of
>the memory is needed scales with the number of basic blocks used and
>number of distinct pseudo regs requested.  I could easily imagine a
>global register analysis phase thrashing on really large programs,

Some folks at Rice here experimenting with a nameless commerical compiler
discovered a simple 2000-line fortran program that took 90+ HOURS to compile
on a 10+ Mips, 16 Meg workstation.  The program spent roughly 98% of it's time
thrashing the disk.  The algorithm the compiler used required exponentional
memory space.  The same compile took ~15 minutes on a 48 Meg workstation.

Cliff Click
cliffc@rice.edu

--
Cliff Click                
cliffc@owlnet.rice.edu

mikeb@salmon.ee.ubc.ca (Mike Bolotski) (07/11/90)

In article <2681@awdprime.UUCP>, tif@doorstop.austin.ibm.com (Paul
Chamberlain) writes:
> 
> P.S.	Another good example is parity.  Takes about 25 instructions
> 	in C, about 1 in assembly (using intel anyway).

Deja voodoo. Lookup table.  3-4 instructions. Maybe less, depending on
addressing modes available. 

---
Mike Bolotski, Department of Electrical Engineering,
               University of British Columbia, Vancouver, Canada 
mikeb@salmon.ee.ubc.ca             | mikeb%salmon.ee.ubc.ca@relay.ubc.ca

preston@titan.rice.edu (Preston Briggs) (07/11/90)

In article <2681@awdprime.UUCP> tif@doorstop.austin.ibm.com (Paul Chamberlain) writes:

>>> In other words, if I take my average well written program, compiled
>>> with a good optimizing compiler, and re-write it in assembler, what sort

>Anyone will tell you though that you usually only need to
>rewrite a couple dozen lines to get your 2-3 times faster.

Anyone?  Not me.  I think you're exaggerating the speedup
and minimizing the amount of rewriting.

The programs I work on aren't huge -- certainly all less than 10,000 lines.
But there aren't 24 lines that consume 50 to 67 percent of the time
(assuming you could replace them with assembler that executes in 0 time).
There's a lot of work being done by many different pieces of the code,
each consuming some fairly small percentage of the total time.

A smaller program doing something simple (sort this file, format this
text, or whatever) might could be sped up a significant amount.
For large programs (particularly "well written," that is, using the right
algorithms), I'd be suprised if you can get a factor of 2 without
major effort.

Thinking in terms of the 80-20 "rule", 20% of 10,000 lines suggests
we might look for the 2,000 most executed lines.  That's a lot of rewriting!

--
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

nevin@igloo.scum.com (Nevin Liber) (07/11/90)

[followups to comp.lang.misc only, please]

In article <1990Jul8.183551.13891@nlm.nih.gov> states@tech.NLM.NIH.GOV (David States) writes:
>On a RISC machine with a good compiler, a C programmer willing to work 
>on his code ought to be able to do almost anything hand coded assembler can.
>This may involve looking at assembly output, recoding routines a few times
>and testing alternative alogorithms, but in the end you still have C (that
>will run on another processor even if not as fast).

Bleech!!  (This is not intended as a flame; it is just my humble
opinion.)  If you are going to go to all that bother, you are better
off writing that portion of your code directly in assembler.

There are two ways of programming in C:

	(1) writing high-level portable code
	(2) using it as a macro-assembler

Unfortunately, it is darn near impossible to write code which satisfies
both of these goals!  And from a maintenance standpoint, assembler
itself is better than C for goal #2.  For one thing, it separates the
code which needs to be redone during a port.  For another, it always
does what you expect it to do, even when you change versions of the
compiler.
-- 
	NEVIN ":-)" LIBER
	nevin@igloo.scum.com  or  ..!gargoyle!igloo!nevin
	(708) 831-FLYS
Advertisement:  Hire me!

jlg@lanl.gov (Jim Giles) (07/12/90)

From article <1319@fs1.ee.ubc.ca>, by mikeb@salmon.ee.ubc.ca (Mike Bolotski):
> In article <2681@awdprime.UUCP>, tif@doorstop.austin.ibm.com (Paul
> Chamberlain) writes:
>> P.S.	Another good example is parity.  Takes about 25 instructions
>> 	in C, about 1 in assembly (using intel anyway).
> Deja voodoo. Lookup table.  3-4 instructions. Maybe less, depending on
> addressing modes available. 

Works just fine (but still slower than the assemble - by a lot) - provided
that you are doing parity checking of something _small_, like a single
byte.  The table gets kind of large for a 32-bit integer.  Similarly,
population counts and leading (trailing) zero (one) counts are not the
kind of thing that you'd want to do with table look-ups.  Of course,
no HLL I know of lets you inline assembly - the best solution.  _SOME_
implementations of C, Pascal, Fortran, even BASIC allow such inlining,
but the capability is not standard (makes your code - shudder - machine
dependent).

J. Giles

ballen@csd4.csd.uwm.edu (Bruce Allen) (07/12/90)

There is a nice discussion of the relative speedups that can be obtained
by various kinds of optimization in "Writing efficient programs"
by Jon Bentley (Prentice-Hall Software Series, 1982).

In two examples, discussed in detail by Bentley (pgs 26-28 and 130)
the speedup obtained in hand-coding in machine-language was a factor
of 2 in one case and a factor of 5 in the other case.  The latter
case was one in which the machine had a particularly well-suited
instruction, the former case is more typical.

The major lesson that one gets from Bentley's book is that almost
all the significant speedups are obtained in designing data
structures, algorithms, and general good coding.  In this book
many examples are given of "typical" code being speeded up by
factors of ten or more just by careful rewriting of high-level
code.

I think this is an excellent, fun-to-read book. Anyone who has not
looked at this book should peruse it the next time they're in a 
good bookstore (if you don't get distracted by "Programming Pearls",
by the same author, just next to it!).

nevin@igloo.scum.com (Nevin Liber) (07/12/90)

Let me try and summarize my thoughts on this subject.  (My apologies to
those who inspired my conclusions; news expires too quickly around here
to be able to quote them directly.)

Humans are better than compilers at optimizing small pieces of code.
In this sense, programming is more of an art than a science.  Most
comparisions of compiler output of a single function to the hand-coded
version shows the hand-coded version to be better.

However, humans are not very good at CONSISTENTLY (or correctly, for
that matter) applying their optimzation techniques to all of their code.  
Most don't even know exactly how they do it.  Compilers are better at
applying the same types of optimizations over and over.

And compilers SHOULD be better at that, since it is merely a procedural
(albeit complex) task.  What the humans need to do is understand how
they perform optimizations, so they can design them in to their compilers,
and move the "art" of programming to a higher level of abstraction.


Just some of my thoughts, that's all.  I welcome your comments.
-- 
	NEVIN ":-)" LIBER
	nevin@igloo.scum.com  or  ..!gargoyle!igloo!nevin
	(708) 831-FLYS
Advertisement:  Hire me!

diamond@tkou02.enet.dec.com (diamond@tkovoa) (07/12/90)

In article <2617@igloo.scum.com> nevin@igloo.scum.com (Nevin Liber) writes:

-There are two ways of programming in C:
-	(1) writing high-level portable code
-	(2) using it as a macro-assembler
-Unfortunately, it is darn near impossible to write code which satisfies
-both of these goals!

True.  However:

-And from a maintenance standpoint, assembler
-itself is better than C for goal #2.

Some people at Bell Labs felt exactly the opposite of the way you feel.
Because of their feelings, they invented C.

-- 
Norman Diamond, Nihon DEC     diamond@tkou02.enet.dec.com
This is me speaking.  If you want to hear the company speak, you need DECtalk.

peter@ficc.ferranti.com (Peter da Silva) (07/16/90)

In article <56704@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
[lookup table]
> Works just fine (but still slower than the assemble - by a lot) - provided
> that you are doing parity checking of something _small_, like a single
> byte.  The table gets kind of large for a 32-bit integer.

Luckily, parity is conserved. Thus you can use the byte lookup table 4 times
over the word. If things are getting tight enough and you're working on a hot
spot, then break out the assembler (but leave the original code ifdeffed in).

Harder to do with checksums, though.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.
<peter@ficc.ferranti.com>

njk@diku.dk (Niels J|rgen Kruse) (07/16/90)

peter@ficc.ferranti.com (Peter da Silva) writes:

}In article <56704@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
}[lookup table]
}> Works just fine (but still slower than the assemble - by a lot) - provided
}> that you are doing parity checking of something _small_, like a single
}> byte.  The table gets kind of large for a 32-bit integer.

}Luckily, parity is conserved. Thus you can use the byte lookup table 4 times
}over the word.

Better yet, xor the bytes together and then use the fine lookup table.
Parity is just the xor of all the bits in any old order, remember.
-- 
Niels J|rgen Kruse 	DIKU Graduate 	njk@diku.dk

sra@ecs.soton.ac.uk (Stephen Adams) (07/17/90)

In article <1990Jul16.163019.3933@diku.dk> njk@diku.dk (Niels J|rgen Kruse) writes:

 > peter@ficc.ferranti.com (Peter da Silva) writes:
 > 
 > }In article <56704@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
 > }[lookup table]
 > }> Works just fine (but still slower than the assemble - by a lot) - provided
 > }> that you are doing parity checking of something _small_, like a single
 > }> byte.  The table gets kind of large for a 32-bit integer.
 > 
 > }Luckily, parity is conserved. Thus you can use the byte lookup table 4 times
 > }over the word.
 > 
 > Better yet, xor the bytes together and then use the fine lookup table.
 > Parity is just the xor of all the bits in any old order, remember.

Indeed, you can use this property and `fold' the byte in
half using shift and xor.  Then fold the 4bit word in half
and so on until you have just one bit which is the parity.
The other bits in the word become junk but that doesnt
matter because you just ignore them.  This seems to work
quite fast in practice and doesnt require a lookup table.
On a RISC it should be about as fast as a lookup table as
all the work is done in registers.

	x = datum;
	x^=(x>>4);
	x^=(x>>2);
	x^=(x>>1);
	parity_bit = x&1;

It extends naturally to the 32bit case:

	x = datum;
	x^=(x>>16);
	x^=(x>>8);
	x^=(x>>4);
	x^=(x>>2);
	x^=(x>>1);
	parity_bit = x&1;

If you need to use the parity bit in the correct bit
position then shift the other way (bytes):

	x = datum;
	x^=(x<<4);
	x^=(x<<2);
	x^=(x<<1);
	/*parity_bit = x&0x80;*/
	x^=(x&0x80);   /* force even parity */

--
Stephen Adams                        S.Adams@uk.ac.soton.ecs (JANET)
Computer Science                     S.Adams@ecs.soton.ac.uk (Bitnet)
Southampton SO9 5NH, UK              S.Adams@sot-ecs.uucp    (uucp)

njk@diku.dk (Niels J|rgen Kruse) (07/18/90)

I wrote:
> > Better yet, xor the bytes together and then use the fine lookup table.
> > Parity is just the xor of all the bits in any old order, remember.

sra@ecs.soton.ac.uk (Stephen Adams) writes:
>Indeed, you can use this property and `fold' the byte in
>half using shift and xor.  [and so on]
>On a RISC it should be about as fast as a lookup table as
>all the work is done in registers.

I really doubt that is true in general.

>       x = datum;
>       x^=(x>>4);
>       x^=(x>>2);
>       x^=(x>>1);
>       parity_bit = x&1;

I count 7 operations here, which on a typical RISC require 7 clocks.

Typical cost of a load from cache is 2 clocks.  At worst, a
table lookup require 2 loads, one of which is loading the
address of the table from the constant pool.  We can expect the
compiler to move the latter out of surrounding loops.  It is not
unreasonable to expect an often used and small lookup table to
be in cache.

I suspect you are thinking of a RISC like the Archimedes, that
doesn't have cache and has an unusual instruction set, which
allow shifting operands in the same instruction as an other
operation.  (I am not too familiar with the ARM).
-- 
Niels J|rgen Kruse 	DIKU Graduate 	njk@diku.dk

staff@cadlab.sublink.ORG (Alex Martelli) (07/19/90)

cavrak@uvm-gen.UUCP (Steve Cavrak,113 Waterman,6561483,) writes:
>...
>An example with VS-FORTRAN on an IBM-3090 with a vector processor.
>a.  plain fortran matrix		1.0
>b.  full optimized fortran		0.3
>c.  IBM's ESSL hand coded library	0.1 or better

d.  NAG's implementation, fully Fortran but "properly" coded with
    block-submatrices algorithm: within 10% of ESSL!

That's for linear-algebra stuff...  back in '88, I was working in the
same office at IBM Italy Scientific Center with the NAG guy that was
visiting there and implementing those routines; whose performance gains
were proving transferable to other machines with complex storage
hierarchies, by the way, via a simple variation in the "good-block-size"
parameter.  On the other hand, I suspect the wondrously fast and precise
library of Fortran intrinsics, coded in assembler with lots of clever
table-lookups and bit-twiddling, would suffer a LOT if it had to be
reimplemented in Fortran itself!  But in most cases, proper algorithms,
and, more specifically, appropriate data-access patterns, are THE key
to good numerical-computing performance - and Fortran remains adequate 
for expressing such algorithms and data-access patterns.
-- 
Alex Martelli - CAD.LAB s.p.a., v. Stalingrado 45, Bologna, Italia
Email: (work:) staff@cadlab.sublink.org, (home:) alex@am.sublink.org
Phone: (work:) ++39 (51) 371099, (home:) ++39 (51) 250434; 
Fax: ++39 (51) 366964 (work only; any time of day or night).