[comp.lang.misc] Common subexpression optimization

pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (02/02/90)

In article <2217@sunset.MATH.UCLA.EDU> pmontgom@sonia.math.ucla.edu (Peter Montgomery) writes:

   In article <PCG.90Jan29131938@rupert.cs.aber.ac.uk>
   pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) writes:

   >The same is true for common subexpression elimination; if there
   >are common subexpressions, and it is *important* that they be
   >shared, the programmer has better make those explicit, because
   >the common subexpression usually has an interesting reason for being
   >common.

   In my programs, common subexpressions often appear
   as a result of macro expansions.  Others are not visible at the
   source level, but arise when producing the object code.

Of course. But this is the result, most often, of insufficient
language design, both as to the need to use macro expansion, and
as to the hidden code problem (as I try to demonstrate later on).

Let me give one, but quite general example; common subexpressions
in fortran and other lanuages often arise from repeated subscript
calculations. Well, this most often happens because there are no
have slices, or, even better, array iterators.

Consider the usual fragment for matrix product (dimensions are
[m,o] [o,n]), written in some pseudo language:

	for i in 1..m
	for j in 1..n
	  c[i,j] = 0
	  for k in 1..o
	    c[i,j] = c[i,j] + a[i,k]*b[k,j]

Of course there are ample opportunities for common subexpression
elimination here, and other hairy optimizations. What about instead:

	for i in 1..m
	  let s[] alias c[i,],
	      fast ronly h[] alias a[i,]
	  for j in 1..n
	    let fast e alias s[j] initially 0,
		fast ronly v[] alias b[,j]
	    for fast k in 1..o
	      e += h[k]*v[k]

(which can be readily written in Algol 68, which was *carefully*
designed for easy, efficient code generation) which is vastly
more descriptive as well, as it makes obvious we are using a
vector product on one hand and that we *know* that e,h,v,k are
the variables to cache, the others are irrelevant, and that the
compiler need not bother "optimize" ?

	In a previous discussion on a similar theme, we compared
	C code for this, written both ways, the first subjected
	to a very highly optimizing compiler. Well, the second version
	was faster by two or three times.

Then the following version

	for s[] over c[*,], ronly h[] over a[*,]
	for p over s[*], ronly v[] over b[,*]
	  p = h innerprod v

	(ronly h[]) innerprod (ronly v[]) is
	  let fast d initially 0
	  for fast ronly h_i over h[*], fast ronly v_j over v[*]
	    d += h_i*v_j;
	  return d

would be in many ways less procedural and highly informative,
and allow for a lot of safe optimization (coupled with some heuristic
on expected access frequencies), especially on vector machines.

Somebody may claim that the style I advocate is "unnatural", because
the first version is more similar to the habits of a mathematically
trained person. Let me argue that:

1) As to this particular example, I think that matrix product as
usually described uses poor notation.  I am all in favour of
using more abstract, terse notation in mathematics.  There are
unfortunately many people that love lots of hanging indexes...

2) Programming is not the same as mathematics. You often have to
change *radically* technique when writing a program. Risch-Norman
for symbolic integration is totally unsuitable for human
execution, and has no resemblance to "by parts", etc...
Programming languages of the sort used for time critical
applications are not "non procedural" yet, and I think never will
be, and *the programmer* must adapt to them, not viceversa.

3) If thehot spot is localized, and is *important*, one doesn't give
a lot about naturalness, even if I maintain that often one can still
do a clean job.

Now back to your example. Your code contains:

		    MUL2(a1, b1, c1, d1)
		    MUL2(a1, b2, c1, d2)
		    MUL2(a2, b1, c2, d1)
		    MUL2(a2, b2, c2, d2) .

and you comment that

	    On the SPARC, the macro is being defined not to produce the
    first expansion but rather is defined so MUL2(a, b, c, d)
    becomes (TRI(a+b) + TRI(c+d)) - (TRI(a) + TRI(c)) - (TRI(b) + TRI(d)),
    where TRI is a table of triangular numbers (i.e., TRI(n) = n*(n+1)/2).

Ahhhhhh.... So. You want to avoid slow SPARC multiplications and you use
a table. This is highly unobvious; you add that

    It is easily to verify that this is functionally equivalent to
    the first expansion, if the TRI table is sufficiently large
    and if no integer overflow occurs.

which means that this highly machine dependent (because your
space vs. time *tradeoff* is such) code must also be supported by
easy but non obvious boundary conditions and mathematical reasoning.

In other words, *major* wizardry, which should be shouted about, not
hidden within different MUL2 expansions.

    After the macro is expanded, the compiler sees the four expressions

	    (TRI(a1+b1)+TRI(c1+d1)) - (TRI(a1)+TRI(c1)) - (TRI(b1)+TRI(d1))
	    (TRI(a1+b2)+TRI(c1+d2)) - (TRI(a1)+TRI(c1)) - (TRI(b2)+TRI(d2))
	    (TRI(a2+b1)+TRI(c2+d1)) - (TRI(a2)+TRI(c2)) - (TRI(b1)+TRI(d1))
	    (TRI(a2+b2)+TRI(c2+d2)) - (TRI(a2)+TRI(c2)) - (TRI(b2)+TRI(d2)).

    As previously mentioned, this is a time-critical loop, and the
    macro definition has been carefully written so the subexpressions
    (TRI(a1)+TRI(c1)), (TRI(b1)+TRI(d1)), (TRI(a2)+TRI(c2)),
    (TRI(b2)+TRI(d2)) appear twice in the resultant expansions.

    The module which has the four references to MUL2 does not know
    about this; it would defeat information hiding if I expanded
    everything there and removed the common subexpressions
    explicitly.

I think that a different choice of abstraction boundaries would do it;
probably it is not MUL2 the abstractions layer you should care about,
but instead the four successive calls. Here you are playing into my hands;
in your application, what is central is indeed the effect achieved by the
block of four calls, not that it is achieved by each multiplication step.

    In practice, a1, a2, c1, and c2 are loop invariants
    (but b1, b2, d1, and d2 vary on each iteration).

But this is very interesting information again, which should be
duly obvious to the reader of your code. Who reads your code, be
it human or program, should not discoer this important property
by him/her/itself.

    [... because of expensive array indexing on SPARC maybe ...]
    I want the compiler to optimize the array indexing, perhaps by
    computing the addresses TRI, TRI+4*a1, TRI+4*a2, TRI+4*c1,
    TRI+4*c2 outside the loop, with the inner loop using 4*b1, 4*b2,
    4*d1, and 4*d2 as offsets.  This requires the compiler to rewrite
    the address TRI+4*(a1+b1) as (TRI+4*a1)+(4*b1), recognizing that
    TRI+4*a1 is a loop invariant and that 4*b1 is also part of the
    address TRI+4*b1.  I cannot make these optimizations at the
    source level.

In C you can, for example. Step a pointer thru the arrays. This
would be a way to make obvious that you are building your result
by sequential scan of the tables. Again, e.g. for virtual memory,
if the application is really critical, this might be a machine
dependent valuable piece of information, or not. As another
example, on many cached machines it is *vital* to do memory to
memory copies that are aligned on cache line boundaries in the
destination. This is something that is highly interesting, as it
makes clear some unobvious design choices.

    Even on systems using the first macro expansion (in terms of
    multiplications), there may be hidden common subexpressions.  
    For example, suppose a1*b1 + c1*d1 is done by floating point 
    hardware, with all variables being converted from integer to floating 
    beforehand.

If this is important, I think you could well be doing it manually, just
like the avoidance of multiplications above. It would also make things
clearer to who reads your code, and the hazards involved (floating
point has quite different properties from integer arithmetic).

    This conversion to floating point does not appear in my 
    source code, but I would want the compiler to recognize that variables 
    appearing repeatedly in products need be converted to floating only once, 
    and that a1, a2, c2, c2 can be converted outside the loop.

This will be most easily done if you let the readers of your program
know that those are loop invariants. The it just takes a good code generator
to choose the most efficient strategy for carrying out your wishes. Of
course, if the application is really time critical, you also want to
second guess your compiler and architecture. Unavoidable.

In summary, I think your example is helping my argument; you take great
pains to do manual micro optimization (eschewing multiplies, for example),
but then you do this at what I deem too low a level, and you don't
include important information as to whether some variables are invariant,
and as to whether intermediate passage to floating point is safe.

Well, it's mostly a difference of programming style. You may not like mine,
but I think it can still cope with problems like this. On the other hand,

	I cannot make these optimizations at the source level.

which is a limit of current languages for which compilers that
are too clever by half (many of the optimizations you want your
compiler to do above require non obvious reasoning -- all too
many optimizers generate optimized code that is not equivalent to
unoptimized code). As such, we are all poor victims of the status
quo.

Eventually, my conclusion is that if one wants smart compilers,
one had better have languages that make that easier, less
procedural and more descriptive. For languages like C or Fortran,
for which the compiler must do a painful and often incorrect
deductive work if second guessing the programmer is desired, I
hold that careful manual optimization of the few relevant lines
is both preferable and more effective, and safer, and the compiler
should help with that (that's why I think that "register" in C
is such a big win).

On the other hand both are expensive in terms of man power, so we
see monstrosities such as the many hundred millions of dollars
that have been sunk in vectorizing compilers for fortran that try
to second guess programmers that wrote for long dead
architectures.  The thought of using more appropriate language
technology seems to be heretical.

Well, dusty decks have a morbid fascination. Many could not do
without their bugs (and don't remember what Knuth wisely observed
on this matter in his volume 1 about rewriting being better -- of
course if you have the manpower to do it).
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

preston@titan.rice.edu (Preston Briggs) (02/02/90)

In article <PCG.90Feb1164820@rupert.cs.aber.ac.uk> 
pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) writes:

	many interesting things...

He also gives some examples of how we might notate matrix multiplication
so that a comparitively simple compiler can generate good code.

>	for i in 1..m
>	  let s[] alias c[i,],
>	      fast ronly h[] alias a[i,]
>	  for j in 1..n
>	    let fast e alias s[j] initially 0,
>		fast ronly v[] alias b[,j]
>	    for fast k in 1..o
>	      e += h[k]*v[k]
>
>(which can be readily written in Algol 68, which was *carefully*
>designed for easy, efficient code generation) which is vastly
>more descriptive as well, as it makes obvious we are using a
>vector product on one hand and that we *know* that e,h,v,k are
>the variables to cache, the others are irrelevant, and that the
>compiler need not bother "optimize" ?

My impression has been that the generality of Algol-68 prohibits
efficient code.  Nevertheless, the example is interesting.
Note that we'd also like to see m*4 in a register
(assuming row major array ordering and 4 bytes/word) and k itself
eliminated.  The point being that the "fast" hint has some
perhaps unexpected side-effects.

>Then the following version
>
>	for s[] over c[*,], ronly h[] over a[*,]
>	for p over s[*], ronly v[] over b[,*]
>	  p = h innerprod v
>
>	(ronly h[]) innerprod (ronly v[]) is
>	  let fast d initially 0
>	  for fast ronly h_i over h[*], fast ronly v_j over v[*]
>	    d += h_i*v_j;
>	  return d
>
>would be in many ways less procedural and highly informative,
>and allow for a lot of safe optimization (coupled with some heuristic
>on expected access frequencies), especially on vector machines.

If you'd throw out the "ronly" and "fast" hints, I'd say these
forms aren't too bad.  Why do you need to say "ronly" when it's
fairly obvious to the compiler that h and h_i are not assigned?
These hints seem like non-abstract intrusions into the middle
of otherwise fairly declarative code.

Both these examples are written in a relatively high-level notation.
My approach to compiling either notation would be to translate
quickly into a fairly low-level form with very little analysis
of the original code.  To simplify my life, I'd make the translation
as straightforward as possible, generating deliberately naive code and
concentrating on producing a correct translation.
Then I'd optimize, using a few powerful, general purpose optimizations,
(i.e. partial redundancy elimination, strength reduction,
dead code elimination, value numbering).  Finally, I'd generate
object code from the optimized intermediate code.
I'd use a global register allocator to assign registers to
the important variables and temporaries.

Note the seperation of concerns.  Enforcing the language in the front
end, improvement (without changing the meaning of the program)
in the middle, and concern for the machine details in the back end.
So what happened to the "fast" hints (and register declarations in C)?
They get thrown out.  It's easier, more effective, and more complete to
discover opportunities for optimization during the latter stages
of compilation.

But let's imagine a slick compiler that can make the code you'd like
to see for either of these examples.  There exist Fortran compilers
that can produce faster code.  Why?  They can make fuller use of
the machines resources and they're willing to generate the ugly
code necessary.

You mustn't misunderstand me and think that I code in Fortran;
but I do recognize that Fortran compilers know a lot about
DO-loops and arrays.  If other languages had similar restrictions
on their arrays and loop structures, their compiler could
achieve similar levels of optimization.  It's also possible 
that the notation outlined by PCG would support fairly deep 
analysis; I'm not sure from the examples.

Preston Briggs
preston@titan.rice.edu

jlg@lambda.UUCP (Jim Giles) (02/02/90)

From article <PCG.90Feb1164820@rupert.cs.aber.ac.uk>, by pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi):
> [...]
> Consider the usual fragment for matrix product (dimensions are
> [m,o] [o,n]), written in some pseudo language:
> 
> 	for i in 1..m
> 	for j in 1..n
> 	  c[i,j] = 0
> 	  for k in 1..o
> 	    c[i,j] = c[i,j] + a[i,k]*b[k,j]
> 
> Of course there are ample opportunities for common subexpression
> elimination here, and other hairy optimizations. What about instead:
> 
> 	for i in 1..m
> 	  let s[] alias c[i,],
> 	      fast ronly h[] alias a[i,]
> 	  for j in 1..n
> 	    let fast e alias s[j] initially 0,
> 		fast ronly v[] alias b[,j]
> 	    for fast k in 1..o
> 	      e += h[k]*v[k]
> 
> (which can be readily written in Algol 68, which was *carefully*
> designed for easy, efficient code generation) which is vastly
> more descriptive as well, [...]

Yes, it is more descriptive.  But, it is descriptive of all the wrong
things.  Here, you are describing things for optimization purposes
which the programmer (with today's state-of-the-art) RIGHTLY assumes
the compiler sould be able to find for itself.  In fact, the only
reason that programmers accept the above Fortran-ish loop notation
is that (up to now) the compiler technology hasn't been capable of
doing better.  The mathematical notation for this is simply:

     C = A B

Now, you can convince a user that this is ambiguous: is it element-wise
array multiplication or matrix multiplication?  Well, this requires a
rewrite.  The experienced mathematician will then suggest:

      n   o n
     C = A B
      m   m o

This uses a convention that upper- and lower-indices that match on the
right are summed (sometimes called the 'Einstein summation convention').
Well, now you might conmvince the programmer that the above notation is
ill suited to keyboard/terminal usage since these devices are (so far)
not well suited to the two-dimensional layout given.  So, you might end
up with:

     C(m,n) = A(m,o) * B(o,n)

This is fine, except that now it matches the notation for a simple scalar
multiply using the current values of m,n, and o.  There needs to be some
way of explicitly representing the fact that m, n, and o are iteration
variables and not simple subscripts.  Hence, the loop notation that began
this message.  But, the mathematical programmer has legitimate reason to
expect that, as compiler/language technology improves, the notation will
converge back to the original mathematical notation.  So, by today's
state-of-the-art, you might expect a language to accept AND OPTIMIZE
the following:

     C(#m,#n) = A(#m,%o) * B(%o,#n)

Where the sharp (#) indices are iterated over and the percent (%)
indices are summed over.  This may be the best we can expect until
keyboards and fonts improve (which is already happening).  Still, the
correct direction for programming languages to move is toward notation
for _people_ to use, not toward notation to make (already standard)
optimizations easier.

> [...]
> 2) Programming is not the same as mathematics. You often have to
> change *radically* technique when writing a program. [...]

Quite true.  But, the key phrase in what you've said is "have to".
You shouldn't change your notation (radically or not) unless you
have to.  Even then you should change as little as you have to.
As time goes on, it is legitimate to expect that these occasions
when I have to change notation will become less frequent.  It is
toward that goal that language design should be turned.

> [... "register" attribute in C "a big win" ...]

The only difference between a register variable and any other local
variable in C is that you cant use the address-of (&) operator on
register variables.  In languages which don't have pointers (like
Fortran 77) this is a moot distinction.  In languages in which
pointers can only point to dynamic memory and are the only way to 
do so (like Pascal and Modula2) this is also an irrelevant distinction.
In fact, the only use of the "register" attribute is to be able to
declare variables that can't be aliased - an irrelevant distinction
in MOST languages where variables have that property by default.

J. Giles

preston@titan.rice.edu (Preston Briggs) (02/02/90)

In article <4466@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>In article <PCG.90Feb1164820@rupert.cs.aber.ac.uk> 

>>	for s[] over c[*,], ronly h[] over a[*,]
>>	for p over s[*], ronly v[] over b[,*]
>>	  p = h innerprod v
>>
>>	(ronly h[]) innerprod (ronly v[]) is
>>	  let fast d initially 0
>>	  for fast ronly h_i over h[*], fast ronly v_j over v[*]
>>	    d += h_i*v_j;
>>	  return d

>If you'd throw out the "ronly" and "fast" hints, I'd say these
>forms aren't too bad.  Why do you need to say "ronly" when it's
>fairly obvious to the compiler that h and h_i are not assigned?

Following up my own posting, ...

I'd like a compiler to make code to make the computer do what I want.
Unfortunately, it can't read minds, so I have to write a program
to give it a hint.  Since it's so stupid, I have to be fairly 
specific about how it's to do things like multiplying matrices.
On the other hand, an optimizing compiling allows me to avoid
too much tedium.

Strength reducers are good enough that I don't need to mess
with pointer variables.  Various forms of redundancy elimination
are good enough that I needn't bother about every little common
subexpression.  Vectorizers are good enough so that I need not
manually insert vector primitives.  Register allocators are
good enough that I need not scatter register declarations about.
Constant propagation is strong enough so that I can afford
to declare some constants in convenient places, without bothering about
folding every constant expression by hand.  And all these optimizations
work exceptionally well on the inner loops that consume so much time.

The optimizer is written once, by the compiler writer.
I, on the other hand, write many programs and I'm able to
take advantage of the optimizer in all my programs.
It costs compile time, but it's cheaper to use the computer
to optimize my code than to do it by hand.

Optimizers can't do everything.  They don't parallelize Fortran
very well, for example.  A better language would probably be
a good alternative.  But why should we do what the computer
can do well?

Preston Briggs
preston@titan.rice.edu

pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (02/06/90)

In article <14225@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:

   > [... "register" attribute in C "a big win" ...]

   The only difference between a register variable and any other local
   variable in C is that you cant use the address-of (&) operator on
   register variables.

Well, there is also actually a strong *hint* that the variable will
be heavily *dynamically* used across statement boundaries, in the
current scope. The compiler can use this hint very effectively.

A careful programmer will never declare a variable for a scope larger
than that in which it is used, will never reuse a variable with the same
name for two different roles. Very few people don't do the sloppy
practice of declaring a bunch of variables at the beginning of a major
scope, some of which are aonly used in some section of it.

The compiler could do a liveness analysis on such variables, but
the source code remains obscure, because the source misrepresents to
the reader what goes on. The compiler or a human can second guess the
author, but this had better not happen, because it means the source is
not informative enough.

For example, when I heavvily use for a section of code a variable
that is otherwise little used, I will create a new scope for that
section of code, a register variable in it in which I will copy
the outer one, and copy back at the end of the paragraph. In this
way I am conveying both to human and compiler reading my code the
important information that I reckon that the variable is being
heavvily used. A human reader can find this information valuable
just as a compiler.

	Of course this is because I am eccentric enough to think
	that the performance characterization of a program is
	part of its design and pragmatics should be as obvious
	as semantics.

   In languages which don't have pointers (like
   Fortran 77) this is a moot distinction.  In languages in which
   pointers can only point to dynamic memory and are the only way to 
   do so (like Pascal and Modula2) this is also an irrelevant distinction.

Let me differ. Fortran has had equivalence and common forever,
and even if dirty tricks are prohibited in theory, most compilers
cannot assume users are well behaved (hey, most fortran programs
depend on subroutine local variables keep their values between
invocations!). Pascal has 'var' parameters, and, more murkly, variant
records without discriminant.

   In fact, the only use of the "register" attribute is to be able to
   declare variables that can't be aliased

While it is true that guaranteeing the absence of aliases makes
the compiler's job easier in caching variables safely across
statements, the two things are not related. The abominable
'noalias' (and the slightly less horrid 'volatile') proposed
extension to Classic C had the same purpose, but unsafe, in that
the absence of presence of 'noalias' (and 'volatile') does change the
semantics of a program, while for 'register' this is not true; 'register'
also gives a *positive* hint on usage frequency.

A compiler compiler could watch a function body for cases where a
variable address is not taken, and thus deduce the 'register'
attribute by itself. 'register' is better, because it allows one
pass compilation, and documents the assumptions made by the
author (e.g. frequency of usage) to both human and compiler.

With 'register' safety (no aliasing) is a side effect, but a
clever one, under more than one aspect.

   - an irrelevant distinction in MOST languages where variables
   have that property by default.

Actually I am hard pressed to think of widely adopted languages
with side effects in which variables can be safely assumed not to
be aliased. Aleph and Euclid come to mind, but not many more (I have
not thought about Ada, and the exact semantics of 'IN OUT' parameters).

What ruins the alias show for most languages is either separate compilation
or parameter passing, even where pointers are not present. And even if
this not true, the compiler still has to guess which of the safe variables
are actually worth caching.

	Naturally this is a moot point on many of today's architectures,
	where register optimization is entirely unnecessary, as there usually
	far more registers available than variables (unless you really have
	something like multiple threads, e.g. superscalar, or vector, but
	then you are really speaking of multiple register sets).
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (02/06/90)

In article <4466@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:

   In article <PCG.90Feb1164820@rupert.cs.aber.ac.uk> 
   pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) writes:

   He also gives some examples of how we might notate matrix multiplication
   so that a comparitively simple compiler can generate good code.

Here is the keyword: *simple*. I don't like complex compilers
that second guess me. I like *simple*, obedient, compilers.
Simplicity is the path to reliability and economy. These seem to
be lost esoteric arts :-(.

   My impression has been that the generality of Algol-68 prohibits
   efficient code.  Nevertheless, the example is interesting.

Algol68 is difficult to *parse*, but it has been carefully
designed to make efficient code possible, and indeed even easy.
Array slicing, for example, is there for this reason, as are
other things. Too bad that something like 'fast' is missing (but
'ronly' is there).

   Note that we'd also like to see m*4 in a register
   (assuming row major array ordering and 4 bytes/word) and k itself
   eliminated.  The point being that the "fast" hint has some
   perhaps unexpected side-effects.

In C this is done by stepping a pointer thru an array. Well, I
was assuming a more conventional language. The idea that C
pointer arithmetic is automatically scaled is another win.

   If you'd throw out the "ronly" and "fast" hints, I'd say these
   forms aren't too bad.  Why do you need to say "ronly" when it's
   fairly obvious to the compiler that h and h_i are not assigned?
   These hints seem like non-abstract intrusions into the middle
   of otherwise fairly declarative code.

I like them because they document assumptions by the programmer
both to human and compiler reader. Both can check them and
exploit them. A program should also document such assumptions,
not just the algorithm. This is the major problem I have for
example with Dijkstra; his books are really about algorithm
design, thinly veiled as programs, not programming.

Programming is a hands on job. Economics *are* important. A
"programmer" will carefully shape an SQL query (and SQL as a
language is quite non procedural and pure) so that it will be the
most efficient of those with equivalent results, taking into
account the query optimization strategies of the fron tend and
the performance characterization of the DB engine and of the
configuration of the machine. The query optimizer will tend to do
a good job done, but for *important* large queries, the
programmer will know better, having more information and
insight (hopefully!).

The theme is where the point of diminishing returns is, not
whether the compiler can extract a further X% speedup from a
source that the programmer has not described carefully as to
pragmatics.

If a *simple* compiler can get 90% of source lines fast enough,
and there is a steep cost in compiler complexity to have it
optimize an extra 5%, and the remaining 5% requires real
intelligence, the tradeoff is clear.

	[ ... about compiling a high level notation into
	a lower level one and applying optimizations to this ... ]

   Note the seperation of concerns.  Enforcing the language in the front
   end, improvement (without changing the meaning of the program)
   in the middle, and concern for the machine details in the back end.

Not bad. In outline I tend to agree.

   So what happened to the "fast" hints (and register declarations in C)?
   They get thrown out.  It's easier, more effective, and more complete to

Let me disagree. "more effective" is a strong statement. Armchair
evidence tends to point out that current highly optimizing
compilers get better results with "register" declarations (some
flimsy data points: dhrystones, the matrix multiplication example
recoded in C, etc...).

This tallies with the expectation that most of the hairy
optimization is about irrelevant pieces of code anyhow, and that
current widely available optimizer technology does not take into
account *dynamic* usage, and thus does not know which are the hot
spots.

There are heuristics or profiler feedback optimizers around
already, but I tend to delude myself that programmers can do the
same job more intelligently and reliably, in the few spots where it
is necessary.

   discover opportunities for optimization during the latter stages
   of compilation.

On this let me disagree too. The hints should not get thrown out;
they are valuable, *quality* information. The compiler writer
should not believe that it can (reliably!) second guess the
author as to the semantic/pragmatic properties of the program.

The author should describe them as clearly as possible, and the
compiler should treasure such information. Of course the author
should be spared (not always! if it is *really* important,
machine dependent considerations should enter into the design of
a program) the work of machine dependent optimization (which is
actually just competent code generation), because there the
compiler is in charge.

In other words, this means that most compiler optimizations
should be about space, because this is what a code generator has
direct control on and knowledge of, and instruction reordering, and
other architecture dependent details.

I'd cleaim that this approach results in clearer programs, and in
simpler, faster, more reliable compilers, and object speed need not
suffer, quite the opposite.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (02/06/90)

In article <4467@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:

   I'd like a compiler to make code to make the computer do what I want.
   Unfortunately, it can't read minds, so I have to write a program
   to give it a hint.  Since it's so stupid, I have to be fairly 
   specific about how it's to do things like multiplying matrices.

Agreed. We are speaking of *implementation* languages.

   On the other hand, an optimizing compiling allows me to avoid
   too much tedium.

Let me be stroppy: I'd say "allows me sloppy, opaque coding".

   Strength reducers are good enough that I don't need to mess
   with pointer variables. Various forms of redundancy elimination
   are good enough that I needn't bother about every little common
   subexpression.  Vectorizers are good enough so that I need not
   manually insert vector primitives.  Register allocators are
   good enough that I need not scatter register declarations about.

About the "good enough" I am somewhat skeptical (especially about
the register allocators -- but then again, many modern machines
have more registers than variables...). Also, most of the time
these optimizations are irrelevant.

   Constant propagation is strong enough so that I can afford
   to declare some constants in convenient places, without bothering about
   folding every constant expression by hand.

This is *not* an optimization. It does not involve second
guessing the program's semantics. It is just good code
generation. Like removing redundant code, inlining, reordering
expression's operands (where allowed by language rules), etc...

   And all these optimizations work exceptionally well on the
   inner loops that consume so much time.

Not entirely true. Again, at least some armchair evidence tends
to support the idea that hand care does a lot of good. And the
inner loops are rare enough you can devote your attention to them.

   The optimizer is written once, by the compiler writer.
   I, on the other hand, write many programs and I'm able to
   take advantage of the optimizer in all my programs.

This argument remains true for good code generation, and the
machine independent programming it affords. It is even true for
SQL query optimizers, most of the time, and they do *very* high
level planning and strategies; but not always. Even with SQL (or
similar things) you have to do hand optimization in critical
cases.

   It costs compile time, but it's cheaper to use the computer
   to optimize my code than to do it by hand.

It encourages less precise coding, and makes the compiler many
times larger, and these are *very* bad news, and very costly too.
All this to optimize very little pieces of code, which could be
manually sorted out, and with improvement on their information
content for both humans and programs that read them.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

preston@titan.rice.edu (Preston Briggs) (02/07/90)

In article <PCG.90Feb5183407@rupert.cs.aber.ac.uk> 
pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) writes:

>Let me disagree. "more effective" is a strong statement. Armchair
>evidence tends to point out that current highly optimizing
>compilers get better results with "register" declarations (some
>flimsy data points: dhrystones, the matrix multiplication example
>recoded in C, etc...).

Which reminds me...
I'm still hoping to see somebody recode matrix multiply in C, using 
all the source hacks they want, so we can compare it to what we get
using Fortran and our (not mine, friends') latest transformations.

We'd run it on a MIPS M/120, with 32 integer registers and 16 fp regs,
so you've got plenty of room for register declarations, pointer variables,
and such.  64K data cache, so plenty of room there.

The Fortran source is

	subroutine mm(A, B, C, n)
	do 1 i = 1, n
	  do 1 j = 1, n
	    A(i, j) = 0.0
	    do 1 k = 1, n
1	      A(i, j) = A(i, j) + B(i, k) * C(k, j)
	end


Preston Briggs
preston@titan.rice.edu

chase@Ozona.orc.olivetti.com (David Chase) (02/07/90)

In article <PCG.90Feb5185112@rupert.cs.aber.ac.uk> pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) writes:
>   Strength reducers are good enough that I don't need to mess
>   with pointer variables. Various forms of redundancy elimination
>   are good enough that I needn't bother about every little common
>   subexpression.  Vectorizers are good enough so that I need not
>   manually insert vector primitives.  Register allocators are
>   good enough that I need not scatter register declarations about.
>
>About the "good enough" I am somewhat skeptical (especially about
>the register allocators -- but then again, many modern machines
>have more registers than variables...). Also, most of the time
>these optimizations are irrelevant.

Interesting assertion -- do you have references to support this?
Especially, do you have evidence that vectorization is usually
irrelevant?

>It [optimization] encourages less precise coding, and makes the
>compiler many times larger, and these are *very* bad news, and very
>costly too.  All this to optimize very little pieces of code, which
>could be manually sorted out, and with improvement on their
>information content for both humans and programs that read them.

(1) Large costly compilers are only "bad news" if few programs are
developed -- once a compiler is widely used, the extra time spent in
its development is paid back many times over.  (Of course, first you
have to get the compiler into wide use.)

(2) Why do you assume that "precise coding" is good?  Does it get a
program written more quickly, or make the program more maintainable,
or more likely to be correct, or increase its portability?  In
general, the answers are NO, NO, NO, and NO.  At best, "precise
coding" will make code run faster on one machine with one particular
compiler.  A good optimizing compiler can do just as well, with less
time spent doing it.  (Note -- you perhaps regard optimizers as
program-manglers; for C, you have no grounds for complaint, since
(last I heard) the language standard is not yet official -- if you
program in a language with no official definition, you should expect
some variation in the implemented semantics.  Nobody forced that
choice on you.)

A high-powered optimizer does allow programmers to get a program
running efficiently with less time spent thinking about it.  This is
bad?  We should program more slowly?  *Some* people use the extra time
to worry about higher-level issues, like correctness and algorithmic
complexity.  Some programmers aren't that thoughtful -- would you
trust such a person to hand-optimize code?

David

jlg@lambda.UUCP (Jim Giles) (02/07/90)

From article <PCG.90Feb5185112@rupert.cs.aber.ac.uk>, by pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi):
> [... constant folding ...]
> This is *not* an optimization. It does not involve second
> guessing the program's semantics. It is just good code
> generation. Like removing redundant code, inlining, reordering
> expression's operands (where allowed by language rules), etc...

Now I understand your problem.  You have a confused notion of the meaning
of "semantics" and "optimization".  Semantics refers to the _meaning_ of
the code.  That is: for a given input, what does the program generate
as output?  Optimization is the use of program transformations to
improve code performance _WITHOUT_CHANGING_SEMANTICS_!  If your optimizer
changes the semantics of your code - it's a broken optimizer.*  Your list
above are _all_ optimizations.  So are strength reduction, common sub-
expression elimination, pipelining, etc..  NONE of these code transformations
involve changes to the program's semantics.  ALL of these techniques fit
into the category of "just good code generation".

Footnote *: The only exception to this is the so-called associative
operators (+, *), which are not, in general, _computationally_
associative.  Languages which allow the compiler to reorder such
operators generally say so _explicitly_ in their documentation and
generally also provide an efficient way to force evaluation order
if it is important.  One could argue that reordering such operators
doesn't change the semantics of the code since such vagueness is part
of the semantics of the operators in the first place.

> [...]
>    It costs compile time, but it's cheaper to use the computer
>    to optimize my code than to do it by hand.
> 
> It encourages less precise coding, and makes the compiler many
> times larger, and these are *very* bad news, and very costly too.

No, coding should _always_ be precise.  Doing common subexpression
elimination by hand is _NOT_ more precise - it is merely more verbose.
Both versions would have _exactly_ the same _meaning_.

> All this to optimize very little pieces of code, which could be
> manually sorted out, and with improvement on their information
> content for both humans and programs that read them.

Manually sorting it out cost programmer time, which _may_ be more
expensive than any possible saving over the life of the program.
Further, be more verbose is not, necessarily, an improvement
in content for the human reader (or, sometimes not even for
the computer - see my next post).  In fact, the kinds of "sorting
out" that you are recommending won't save _any_ time at all on
state-of-the-art compilers.  I would recommend spending your human
resources on types of optimization that compilers _CAN'T_ yet
do automatically.

J. Giles
> Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
> Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
> Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

jlg@lambda.UUCP (Jim Giles) (02/07/90)

From article <PCG.90Feb5183407@rupert.cs.aber.ac.uk>, by pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi):
> [...]
> Here is the keyword: *simple*. [...]

I agree.  But it is clear that we have a different definition of the word
"simple".  Adding obfuscating detail to a program in order for the compiler
to optimize it better is _NOT_ a way of simplifying the code - quite the
reverse.  It is sometimes (_only_ sometimes) necessary to do so in order to
meet efficiency criterion for the program, but I don't have to like it.

> [...]                          I don't like complex compilers
> that second guess me.

If, by "second guess", you mean compilers which change the _meaning_ of
your code, then I agree!  This kind of defect in a compiler is called
"bad code generation" and is widely regarded as the worst defect a
compiler can have.

On the other hand, if you mean that you don't like compilers which
generate efficient code in ways you hadn't thought of - so what?

> [...]                  I like *simple*, obedient, compilers.


The simplest compiler I know of consists of front panel switches (ie. NO
compiler at all).  I like mine a little more capable than that.  As
compiler technology improves, the set of things regarded as _simple_
will increase.  This is all to the good - it means I can concentrate
more on getting the code correct, portable, maintainable, etc..  It means
that the frequency of occasions that I have to hand-optimize for efficiency
will decrease.  In short, it allows me to keep my source code _simple_.

> [...]
> Simplicity is the path to reliability and economy. These seem to
> be lost esoteric arts :-(.

I agree - see above.

> [... array indexing within a loop - strength reduction ...]
> In C this is done by stepping a pointer thru an array. Well, I
> was assuming a more conventional language. The idea that C
> pointer arithmetic is automatically scaled is another win.

Consider the following two code fragments (side by side):

      int a[200][2];                    int a[200][2]; int *p; int *q;
      ...                               ...
      ...                               p = &a; q = p+1;
      for (i=0;i<200;i++){              for (i=0;i<200;i++){
        a[i][0] = expr1(i);               *p = expr1(i);
        a[i][1] = expr2(i);               *q = expr2(i);
      };                                  p += 2; q += 2;
                                        };

The first is clearly more readible than the second.  Modern compilers
can easily find the optimizations necessary to make the first run as
fast as the second.  The first also has the important advantage of
clearly stateing something that the second obscures: there are no
dependencies between the two assignments!  The algorithm is stepping
through two independent vectors.  Compilers (and people) can easily
see this on the first example and can (for example) vectorize the
first more easily.  The general problem of analysing dependencies
in the second kind of code is VERY difficult - it's clearly and
explicitly stated in the first code fragment.  (I know, I made the
problem harder by looping on the first index instead of the second.
But the other way around won't make general dependency analysis
any easier for the pointered case.)

This is a case where your verbose approach actually decreases the
information content of the code for _both_ the programmer and the
compiler.

J. Giles

raulmill@usc.edu (R D R) (02/08/90)

Preston Briggs writes:
;> I'm still hoping to see somebody recode matrix multiply in C, using 
;> all the source hacks they want, so we can compare it to what we get
;> using Fortran and our (not mine, friends') latest transformations.
;> . . .

While we're on the topics of recoding, and how well source code maps
onto a machine, why not consider a few other mainstream languages
besides C and Fortran??????

I mean, C++ seems to rectify a large number of the problems one sees
with C.  (It allows inline functions, for example.)  You still have
pointers, but g++ (gnu's C++) allows one to declare that a function
has no side effects, allowing the optimizer to fold calls to it.  (gcc
has both of these features too, and for "hello world" programs
generates smaller code, but I'm getting tired of only hearing about c
vs FORTRAN.) 

On the other hand, there are languages like APL which with a few
changes (like variable scoping, and function rank declaration) ought
to make very nice compiling languages.  /* APL implementation of
matrix multiply:  A <- B +.x C */.  From what I've heard, it is
possible to DEVELOP code in APL about 6 or 7 times faster than what is
usual with FORTRAN.  (Weren't the two big advantages of FORTRAN over c
supposed to be that FORTRAN is faster to write, and FORTRAN allows
more optimizations in compilation, especially as machine architectures
change???)

--
Raul Rockwell
INTERNET:   raulmill@usc.edu                       !
UUCP:       ...uunet!usc!raulmill                  !  55 mph = 82 nc
U.S.SNAIL:  721 E Windsor #4,  GLENDALE CA  91205  !

brnstnd@stealth.acf.nyu.edu (02/08/90)

Piercarlo is entirely correct. If you care about ``information hiding''
and the ``purity'' of your code more than efficiency, you can't expect
fast code. If you can't bear to put your original code in comments and
write an efficient version with temporary variables, you don't deserve
fast code.

In article <2217@sunset.MATH.UCLA.EDU> pmontgom@math.ucla.edu (Peter Montgomery) writes:
> 	In my programs, common subexpressions often appear 
> as a result of macro expansions.  Others are not visible at the
> source level, but arise when producing the object code.

When efficiency is important, write out the expressions. If the efficient
version is different on different machines, write it out separately for
each. There is a limit to what the compiler can optimize; a competent
programmer assumes merely that the compiler understands scheduling and
register allocation, and does the rest of the job by hand.

> 	There was recently a discussion in comp.arch about the lack of
> an integer multiply instruction on the SPARC, and how to circumvent it.
> In one of my programs (for multiple precision arithmetic), the most 
> time-critical inner loop requires four sums of products (all variables integer)
  [ the expressions he wants, and how he implements them on the SPARC
    with macros referencing a table of triangular numbers ]

Actually, the code would be faster if you used squaring rather than
triangular numbers. Do you expect the compiler to figure this out, or
should the programmer perform this optimization?

Using squaring as the foundation for multiprecision multiplication
(i.e., to multiply arrays a and b, square a+b, square a-b, subtract, and
divide by 4) saves a lot of space (and code). Do you expect the compiler
to figure this out, or should the programmer perform this optimization?

> it would defeat information hiding if I expanded everything
> there and removed the common subexpressions explicitly.

Tough. Efficiency demands code space. You can leave the ``pure''
expressions in the code, commented out by #ifdef SLOW_N_PORTABLE.
Supposedly optimizing compilers don't remove the burden of thought
from programmers.

  [ more optimizations, based on loop invariants ]
> I cannot make these optimizations at the source level.

Again, you can and should perform this optimization by hand, sacrificing
your macros. Feel free to leave the original code there, but don't pretend
that the compiler can figure things out better than you can.

> 	Even on systems using the first macro expansion 
> (in terms of multiplications), there may be hidden common subexpressions.  

So don't trust Cray's compiler to figure out that it should store the fp
variables. How can the compiler tell the difference between a bottleneck
and a loop that's executed only twice? How can the compiler know whether
variable a is used more than variable b in a function? How can the compiler
know whether you prefer code space, compiling effort, memory, or time?

---Dan

nick@lfcs.ed.ac.uk (Nick Rothwell) (02/08/90)

In article <838.18:06:33@stealth.acf.nyu.edu>, brnstnd@stealth writes:
>Piercarlo is entirely correct. If you care about ``information hiding''
>and the ``purity'' of your code more than efficiency, you can't expect
>fast code. If you can't bear to put your original code in comments and
>write an efficient version with temporary variables, you don't deserve
>fast code.

This is rubbish, check out the performance of New Jersey ML
(high-level functional language, efficient optimised native code
generation).

>---Dan

		Nick.
--
Nick Rothwell,	Laboratory for Foundations of Computer Science, Edinburgh.
		nick@lfcs.ed.ac.uk    <Atlantic Ocean>!mcvax!ukc!lfcs!nick
~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~~
		       ...als das Kind, Kind war...

firth@sei.cmu.edu (Robert Firth) (02/08/90)

In article <838.18:06:33@stealth.acf.nyu.edu> brnstnd@stealth.acf.nyu.edu (Dan Bernstein) writes:
>Piercarlo is entirely correct. If you care about ``information hiding''
>and the ``purity'' of your code more than efficiency, you can't expect
>fast code. If you can't bear to put your original code in comments and
>write an efficient version with temporary variables, you don't deserve
>fast code.

As somebody who in his time has had to write and maintain substantial
quantities of numerical software, I emphatically disagree.

We are talking here about five or ten page subroutines containing
some pretty complicated algebra.  Radar antenna design programs,
for instance, can easily have expressions extending over three or
four lines of text.

This code is hard to maintain at best, but one precondition for
readability and maintainability is that, as far as possible, the
formula written in Algol (or whatever) look as much like the formula
in the book as possible.  No other way can you read it, check it,
or understand it.

If this means that I call sin(a[i,j-1]) six times in a single
expression, so be it.  Optimising the code, extracting common
subexpressions, calling pure functions only once, and all that
stuff, is the job of the compiler.  That's why it's there.  That's
why the first major higher-level language was called FORmula TRANslator.

Don't make people fit machines; make machines fit people.

cks@white.toronto.edu (Chris Siebenmann) (02/09/90)

brnstnd@stealth.acf.nyu.edu (Dan Bernstein) writes:
| When efficiency is important, write out the expressions. If the
| efficient version is different on different machines, write it out
| separately for each. There is a limit to what the compiler can
| optimize; a competent programmer assumes merely that the compiler
| understands scheduling and register allocation, and does the rest of
| the job by hand.

 I would say instead that a competent programmer would first get it
correct, then measure things to see whether or not you need to speend
them up, and then look at the various mechanisms for doing so. Always
know where your bottlenecks are; it makes little sense to shave
milliseconds off a loop when every iteration spends half a second
doing a system call.

 It's true that the compiler system can do noticably better with hints
of some sort from the user. However, there is little need for these
hints to be "register" declarations and unrolled expressions and so
on; feed the compiler system profile run output instead (see, for
example, the DECWRL paper(s) on link-time register allocation,
especially the one that compares it to register windows).

 Does anyone have numbers or hard information on the quality of
current compilers (especially C compilers) and how much one can
improve things by tweaking the source code?

--
	"I shall clasp my hands together and bow to the corners of the world."
			Number Ten Ox, "Bridge of Birds"
cks@white.toronto.edu		   ...!{utgpu,utzoo,watmath}!utcsri!white!cks

jlg@lambda.UUCP (Jim Giles) (02/09/90)

From article <838.18:06:33@stealth.acf.nyu.edu>, by brnstnd@stealth.acf.nyu.edu:
> Piercarlo is entirely correct. If you care about ``information hiding''

"Information hiding" is NOT part of the subject matter of this thread
in the newsgroup.  We can talk about "information hiding" if you like,
but it is a different subject.  ("Information hiding" refers to various
scoping rules which allow codes to share data without making the data
_global_.  In C, 'static' variables which are declared at the beginning
of a file are shared by all the procedures within that file and are not
seen outside the file - THIS is a form of information hiding.)

On the other hand: if, by "information hiding", you really mean
"obfuscating code" - I can't think of a quicker way of doing it
than putting in the kinds of optimizations recommended by Peter
Grandi.

> [...]
> and the ``purity'' of your code more than efficiency, you can't expect
> fast code.

Agreed.  If clarity, correctness, and maintainability are more important
than speed, then simple, clear code is the proper choice.  However, I CAN
still expect better speed performance from a good compiler than from the
kind Peter Grandi recommends.  In fact, for the types of optimizations
that we have been discussing, modern compilers can be expected to do as
well as the hand optimized stuff.  There are, indeed, other types of
optimization which compilers cannot yet do (at least not well) - these
are the proper targets of programmer effort (if anything is).

> [...]       If you can't bear to put your original code in comments and
> write an efficient version with temporary variables, you don't deserve
> fast code.

Temporary variables aren't, necessarily, the best way to achieve
improvement.  As compilers get better, they may become even _more_
irrelevant.  Furthermore, fast code is _NOT_ the only criterion
for measuring the quality of the programming.  Everyone has heard
(or should have) of the 90%/10% rule - 90% of the execution time
is in 10% of the code.  Programmer effort in that 10% might be
cost effective.  But to try to optimize the last cycle out of the
whole code is almost certainly NOT cost effective.  Programmers
generally cost about $100-$200 a day!  If it takes a day to optimize
a procedure which saves an hour of SUN workstation time over the
life of the code - I've WASTED a couple of hundred bucks!

For MOST things, the important thing is to get the code written and
correct.  Speed is, almost always, a secondary consideration.  (This
is not to say that speed isn't important - it _usually_ isn't - but
when it is, it REALLY IS.  Unfortunately, when speed REALLY IS
important, it is usually more cost effective to switch directly
to assembly than to mess around with half-way measures in a high-
level language.  The speed sensitive code is usually a small part
of the whole program and can be recoded in assembly with gratifying
results.)

J. Giles

brnstnd@stealth.acf.nyu.edu (02/09/90)

I've answered the ``maintainability'' objection before: Don't throw out
your original code after you've optimized it. Stick it inside #ifdef
SLOW_N_NATURAL or whatever your favorite language uses for comments.
If you change the original, ``natural'' code, throw away the efficient
version and rewrite it from scratch.

In article <6004@bd.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
> This code is hard to maintain at best, but one precondition for
> readability and maintainability is that, as far as possible, the
> formula written in Algol (or whatever) look as much like the formula
> in the book as possible.  No other way can you read it, check it,
> or understand it.
> 
> If this means that I call sin(a[i,j-1]) six times in a single
> expression, so be it.

	{ register float t = sin(a[i][j-1]);
	  x = t * foo(t,t+1) + bar(t,t-1) - t; }
#ifdef SLOW_N_NATURAL
	x = sin(a[i][j-1]) * foo(sin(a[i][j-1]),sin(a[i][j-1])+1)
	     + bar(sin(a[i][j-1]),sin(a[i][j-1])-1) - sin(a[i][j-1]);
#endif

As I understand the efficient code much better than the slow 'n' natural
code, I would leave the original out entirely. In fact, I doubt any
programmer would need to duplicate more than one percent of a program
in this fashion.

> Optimising the code, extracting common
> subexpressions, calling pure functions only once, and all that
> stuff, is the job of the compiler.  That's why it's there.  That's
> why the first major higher-level language was called FORmula TRANslator.

The compiler is allowed to help but it simply isn't as good as a human.

> Don't make people fit machines; make machines fit people.

Unfortunately, researchers in artificial intelligence report that we
have absolutely no idea how to make machines fit people. 

---Dan

preston@titan.rice.edu (Preston Briggs) (02/09/90)

In article <5453:23:28:32@stealth.acf.nyu.edu> brnstnd@stealth.acf.nyu.edu (Dan Bernstein) writes:
>In article <2115@castle.ed.ac.uk> nick@lfcs.ed.ac.uk (Nick Rothwell) writes:
>> This is rubbish, check out the performance of New Jersey ML
>> (high-level functional language, efficient optimised native code
>> generation).

>Can it figure out, say, Knuth's algorithm for computing the delta2 table
>in Boyer-Moore string searching, starting from a dumb algorithm for the
>same job?
>
>Humans can optimize better than computers. This will be true for a long
>time. I chose the above example randomly, but it'll serve well enough as

You've made a big step from eliminating common subexpressions to
understanding Knuth (anything by Knuth).  
One is a trivial task; the other requires "thought."  
Computers are good at trivial tasks;
humans, on the other hand, tend to fumble trivial tasks.

Optimization in a compiler isn't finding a good algorithm.
"Optimization" is a misnomer.  It's transforming the program to run 
more quickly.  You never get asymptotic improvements.  
You don't see discovery of new algorithms.

On the other hand, optimiztion lets me avoid having to express 
tedious details that can be discovered automatically (trivially).  
It lets me write matrix multiply (or whatever) in a short time, 
with good performance, instead of trying to squeeze 
the last ounce out of the implementation.

Hopefully you can put the saved time to use discovering better algorithms.
Good old seperation of concerns -- the computer does what it's good at
and I do what I'm good at.

As to "optimizing better than compilers", have you started working on 
a version of matrix multiply that will beat our Fortran?  Good luck!

Preston Briggs
preston@titan.rice.edu

brnstnd@stealth.acf.nyu.edu (02/09/90)

In article <2115@castle.ed.ac.uk> nick@lfcs.ed.ac.uk (Nick Rothwell) writes:
> In article <838.18:06:33@stealth.acf.nyu.edu>, brnstnd@stealth writes:
> >Piercarlo is entirely correct. If you care about ``information hiding''
> >and the ``purity'' of your code more than efficiency, you can't expect
> >fast code. If you can't bear to put your original code in comments and
> >write an efficient version with temporary variables, you don't deserve
> >fast code.
> 
> This is rubbish, check out the performance of New Jersey ML
> (high-level functional language, efficient optimised native code
> generation).

Are you saying that your compiler is smarter than you are or just that
it's not as lazy?

Can it figure out, say, Knuth's algorithm for computing the delta2 table
in Boyer-Moore string searching, starting from a dumb algorithm for the
same job?

Humans can optimize better than computers. This will be true for a long
time. I chose the above example randomly, but it'll serve well enough as
a goal that I won't live to see computers reach.

---Dan

jlg@lambda.UUCP (Jim Giles) (02/09/90)

From article <5453:23:28:32@stealth.acf.nyu.edu>, by brnstnd@stealth.acf.nyu.edu:
> [... Good ML compiler ...] 
> Can it figure out, say, Knuth's algorithm for computing the delta2 table
> in Boyer-Moore string searching, starting from a dumb algorithm for the
> same job?
> 
> Humans can optimize better than computers. This will be true for a long
> time. I chose the above example randomly, but it'll serve well enough as
> a goal that I won't live to see computers reach.


Good example.  It demonstrates the point I've been making all along!
Ther are optimizations that the compiler can't yet do well.  There _may_
be optimizations that compilers will _never_ be able to do.  It is these
types of things that programmers should spend their effort on.

However, your example does NOT support your claim that programmers should
do strength-reduction or array indexing by hand.  These are things that
compilers of today CAN do reasonably well.  They are also things which
decrease the clarity of the code.

All your example shows is that good choice of algorithm is worth more
than penny-pinching optimizations.  This old platitude is the commonplace
of programming.  So, what have you said that's new?

J. Giles
> 
> ---Dan

pcg@aber-cs.UUCP (Piercarlo Grandi) (02/10/90)

In article <48942@ricerca.UUCP> David Chase <chase@orc.olivetti.com> writes:
    >About the "good enough" I am somewhat skeptical (especially about
    >the register allocators -- but then again, many modern machines
    >have more registers than variables...). Also, most of the time
    >these optimizations are irrelevant.
    
    Interesting assertion -- do you have references to support this?

Plenty. Several sources, most recently J. Mashey discussing
exception handling on the MIPS in comp.arch, state that it is
rare to find that you need more than 24-28 registers to cache
*all* variables (a particularly funny example is the Pyramid,
that has over fifty registers, and some researcher found that
in all programs they tried never more than half could be used).

Some references (the DEC WRL ones for the Titan) demonstrate
that if you do cache only frequently *used* variables the knee
of the curve is at the 8-12 registers level at most.

There are many papers that confirm the idea that most
(traditionally 80%) of a program has a negligible or small
enough effect on the total running time that optimizing it is
irrelevant, either by hand or mechanically.

    Especially, do you have evidence that vectorization is usually
    irrelevant?

I never said so. On the other hand I think that vectorization is
a program tranformation problem, and the 'compiler' proper
should have nothing to do with it. Vectorization is an issue
only because for historical reasons people want to continue
using 'implementation' languages as 'application' ones. 

Dusty decks rule the waves, unfortunately. On the other hand,
Knuth's observations on this problem, in Vol. 1 of the Art of
Comp. Prog. are still valid, I think (shortly: rewrite!).

    (1) Large costly compilers are only "bad news" if few programs are
    developed -- once a compiler is widely used, the extra time spent in
    its development is paid back many times over.  (Of course, first you
    have to get the compiler into wide use.)

First you have to get the compiler generating correct code. I
may be wrong :-), but the number of bugs in a complicated piece
of software grows more than linearly with its size. A compiler
is possibly one of the most critical components of your
toolset.

I am a (young) fogey, and tend to think that in these cases
prudence advises against getting such a crucial component as
complex as your address space allows you to.
    
    (2) Why do you assume that "precise coding" is good?  Does
    it get a program written more quickly, or make the program
    more maintainable, or more likely to be correct, or
    increase its portability?  In general, the answers are NO,
    NO, NO, and NO.

Let me differ. A program that requires the reader to second
guess the author is not going to be very maintainable, is not
very likely to be correct, is not likely to be very portable.
It may be *coded* more quickly, oh yes :-(.

Making obvious, documenting in the program text (comments can
be dangerous) all your assumptions is one of the tenets of
disciplined software engineering. I used to like the 'lint'
directive (e.g. /* NOTREACHED */) approach, for example. A
great aid, both for lint and for the human reader.

    Some programmers aren't that thoughtful -- would you
    trust such a person to hand-optimize code?

If you reckon that your optimizing compiler is less buggy and
more intelligent than your programmer, all that I say blows up.
I have to admit that this is usually the case, though. On the
other hand, I think that the problem is better solved thru
programmer training (but we all know about the realities of
deskilling) or via programmer's assistants, which should not be
made parts of the compiler -- for the same reason for which
'lint' as not part of the compiler. I would dearly love to have
an "active" lint like tool.
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@aber-cs.UUCP (Piercarlo Grandi) (02/10/90)

In article <RAULMILL.90Feb7094338@alcor.usc.edu> raulmill@usc.edu (R D R) writes:
    
    On the other hand, there are languages like APL which with a few
    changes (like variable scoping, and function rank declaration) ought
    to make very nice compiling languages.  /* APL implementation of
    matrix multiply:  A <- B +.x C */.  From what I've heard, it is
    possible to DEVELOP code in APL about 6 or 7 times faster than what is
    usual with FORTRAN.

Well, this may be too optimistic, as APL has some obscurities
as well. On the other hand, APL is an application level
language, and here I have no qualms about optimizing it, as for
SQL, in another application context.

However, even languages at the correct level of abstraction for
the application, for which the compiler can use interesting
strategies for optimization, occasionally require the
programmer to use special attention, if the programmer knows
better than the optimizer. But thank goodness this is rarer.

There are many ways to code matrix multiplication; I'd rather
not write one and then expect the compiler to transform it into
another one that it thinks more appropriate, but just use the
high level concept and then the compiler can, more safely, do
what it wants. Unless of course I am the implementor of the
matrix multiplication module, where I want the compiler to
respect my code obediently.
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@aber-cs.UUCP (Piercarlo Grandi) (02/10/90)

In article <6004@bd.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
    
    We are talking here about five or ten page subroutines containing
    some pretty complicated algebra.  Radar antenna design programs,
    for instance, can easily have expressions extending over three or
    four lines of text.
    
    This code is hard to maintain at best, but one precondition for
    readability and maintainability is that, as far as possible, the
    formula written in Algol (or whatever) look as much like the formula
    in the book as possible.  No other way can you read it, check it,
    or understand it.

Let me suggest that your book's math is poorly notated. The
problems I was talking about exist in maths as well -- a lot of
maths is very poorly notated, because it is hard to read, what's
going on is less obvious than it should be. Mathematicians are
not the only tones hat think that a pageful of index rich
formulas is macho. Physicists seem to be more concerned about
the quality of notation they use, I seem to appreciate.

Too bad that maths is often as full of bugs (even typographical
ones) as programs, or more, because of this, even if supposedly
more 'abstract'.

    If this means that I call sin(a[i,j-1]) six times in a single
    expression, so be it.

Most probably that term also has a meaning in the formula in
the book, and it should have been made evident, to aid
comprehension.

Moreover, again, if all else fails, maybe a programmer's
assistant could be employed profitably. I emphatically don't
think that it should be part of the code generator, though.

-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@aber-cs.UUCP (Piercarlo Grandi) (02/10/90)

In article <2115@castle.ed.ac.uk> nick@lfcs.ed.ac.uk (Nick Rothwell) writes:
    In article <838.18:06:33@stealth.acf.nyu.edu>, brnstnd@stealth writes:
    >Piercarlo is entirely correct. If you care about ``information hiding''
    >and the ``purity'' of your code more than efficiency, you can't expect
    >fast code. If you can't bear to put your original code in comments and
    >write an efficient version with temporary variables, you don't deserve
    >fast code.
    
    This is rubbish, check out the performance of New Jersey ML
    (high-level functional language, efficient optimised native code
    generation).

Hey, this is not fair! you are compared a cleanly designed,
high level language, with C and Fortran here.

If you can roll your own, and make it clean, everything changes.
I don't like functional languages, but ML is pretty good. I used
to like very much Aleph, a CWI language based on CDL, based on
twolevel/attribute grammars. Very clever, very clean design.
Algol 68, for many applications, is not bad either. APL, SQL,
ICON, etcetera.
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

ske@pkmab.se (Kristoffer Eriksson) (02/10/90)

In article <4616@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>I'm still hoping to see somebody recode matrix multiply in C, using 
>all the source hacks they want, so we can compare it to what we get
>using Fortran and our (not mine, friends') latest transformations.
>
>We'd run it on a MIPS M/120, with 32 integer registers and 16 fp regs,

>The Fortran source is
>
>	subroutine mm(A, B, C, n)
>	do 1 i = 1, n
>	  do 1 j = 1, n
>	    A(i, j) = 0.0
>	    do 1 k = 1, n
>1	      A(i, j) = A(i, j) + B(i, k) * C(k, j)
>	end

I'm not a Fortran programmer, so I am not sure exactly how this routine
works. I guess that A, B, and C are arrays and n is the maximum index in
these arrays that is to participate in the multiply, and that the array
bounds are automatically passed to routine, without explicit mention.

That is not "natural" in C, thus we have the problem of what to do instead...

Most C compilers don't accept variable size array arguments. The simplest
way to code exactly the same functionality without that facility, I think,
is to use a macro. The compiler will have the information about the array
bounds available for the arrays the macro is used on. Each of the three
arrays can have a different size, and they need coincide with n. Maybe
you never would use it that way, but it is the same as what I suppose
the Fortran routine does.

----cut----
#define mm(A, B, C, n)					\
    {							\
	register unsigned idxsize k, j, i;		\
	register float acc;				\
							\
	for(i = 0; i < n; ++i) {			\
	    for(j = 0; j < n; ++j) {			\
		acc = 0.0;				\
		for(k = 0; k < n; ++k)			\
			acc += B[i][k] * C[k][j];	\
		A[i][j] = acc;				\
	    }						\
	}						\
    }

/* Simple test case */
typedef short idxsize;
float a[22][16], b[17][32], c[45][16];
main()
{
    mm(a,b,c,16);
}
----cut----

I don't bother using pointers (or reversing the looping direction). I don't
see any reason why the compiler should not have the possibility of optimizing
the array indexing on it's own. Probably many compilers don't do that, but I
think it should in principle be possible for the C-compiler to optimize this
routine in the same way as the Fortran compiler does. (This being a macro,
the C-compiler will even generate in-line code and use constant array sizes,
even without global code analyses.) I don't know about MIPS' compilers.

If I instead decide that n gives the array size (maximum indexes) for all
arrays, I can write a simple function for it:

----cut----
/* Simple test case */
typedef short idxsize;
float a[16][16], b[16][16], c[16][16];
main()
{
	mm(a,b,c,16);
}

/* Matrix multiply. */
mm(A, B, C, n)
    register n;
    register float *B, *C;
    float *A;
{
    register unsigned idxsize k, j, i;
    register float acc;

    for(i = 0; i < n; ++i) {
	for(j = 0; j < n; ++j) {
	    acc = 0.0;
	    for(k = 0; k < n; ++k)
		acc += B[i * n + k] * C[k * n + j];
	    A[i * n + j] = acc;
	}
    }
}
----cut----

I could write a function that takes differing array sizes too. It isn't
very hard, but there are many ways to do it to choose among.

If nothing else works, it should be possible to explicitly code most
of the things the Fortran compiler might do automatically. That's the
character of C, for good or worse.
-- 
Kristoffer Eriksson, Peridot Konsult AB, Hagagatan 6, S-703 40 Oerebro, Sweden
Phone: +46 19-13 03 60  !  e-mail: ske@pkmab.se
Fax:   +46 19-11 51 03  !  or ...!{uunet,mcvax}!sunic.sunet.se!kullmar!pkmab!ske

ske@pkmab.se (Kristoffer Eriksson) (02/10/90)

In article <14229@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
>From article <PCG.90Feb5183407@rupert.cs.aber.ac.uk>, by pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi):
>> [... array indexing within a loop - strength reduction ...]
>> In C this is done by stepping a pointer thru an array. ... The idea that C
>> pointer arithmetic is automatically scaled is another win.
>
>Consider the following two code fragments (side by side):
>
>      int a[200][2];                    int a[200][2]; int *p; int *q;
>      ...                               ...
>      ...                               p = &a; q = p+1;
>      for (i=0;i<200;i++){              for (i=0;i<200;i++){
>        a[i][0] = expr1(i);               *p = expr1(i);
>        a[i][1] = expr2(i);               *q = expr2(i);
>      };                                  p += 2; q += 2;
>                                        };
>
>The first is clearly more readible than the second.  Modern compilers
>can easily find the optimizations necessary to make the first run as
>fast as the second.  The first also has the important advantage of
>clearly stateing something that the second obscures: there are no
>dependencies between the two assignments!  The algorithm is stepping
>through two independent vectors.

Not that I necessarily agree with the first poster about always using
pointers for array loops, but why would you use *two* pointers? It
would be much more "natural" and less obscure this way:

int a[200][2]; int *p;
p = &a[0][0];			/* Start fromthe first int of a. */
for (i = 0; i < 200; i++) {
    *(p++) = expr1(i);
    *(p++) = expr2(i);
};

or this way:

int a[200][2];  /* Array of 200 arrays of two int. */
int (*p)[2];	/* Pointer to array of two int. */
p = a;		/* Start with the first array of two int of a. */
for (i = 0; i < 200; i++) {
    (*p)[0] = expr1(i);
    (*p)[1] = expr2(i);
    p++;	/* Note automatic scaling. */
}

>  (I know, I made the
>problem harder by looping on the first index instead of the second.

Did you? I'd rather think it would be the other index that would be
somewhat harder, and possibly could motivate use of two pointers.

>This is a case where your verbose approach actually decreases the
>information content

No need to be just that verbose...
-- 
Kristoffer Eriksson, Peridot Konsult AB, Hagagatan 6, S-703 40 Oerebro, Sweden
Phone: +46 19-13 03 60  !  e-mail: ske@pkmab.se
Fax:   +46 19-11 51 03  !  or ...!{uunet,mcvax}!sunic.sunet.se!kullmar!pkmab!ske

chase@Ozona.orc.olivetti.com (David Chase) (02/10/90)

In article <1625@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>If you reckon that your optimizing compiler is less buggy and
>more intelligent than your programmer, all that I say blows up.
>I have to admit that this is usually the case, though. 

Thanks -- that's one point on which lots of this rests.  I think that
one root in our apparent difference of opinion is the culture -- I've
got a background in compiling, and how optimizations Ought To Be Done
(correctly, to start with).  You seem to enjoy programming in C, and
(in my experience, though things appear to be improving lately) (1) C
compilers have bugs (2) the bugs go unfixed (3) C has been defined by
its compilers, so one person's "bug" is another person's "undocumented
feature".  IF there's someone (the vendor, perhaps?) responsible for
repairing the compiler when bugs are found, THEN it is probably more
cost-effective to repair the compiler than to (try to) educate an army
of programmers in the way of "precise coding".  Now, if you live in a
world of perpetually broken and useless optimizers, then C is not such
a bad way to go, but I'd rather repair the compilers.

Note, too, that people have been writing compilers for some time, and
there have actually been improvements in technology.  There's
table-driven instruction selection (not as useful on a RISC, but
there's still those Other Chips out there), and the algorithms for
doing things like (global) constant propagation and loop invariant
code motion are getting cleaner and less ad hoc.  There's reasonable
algorithms for optimal instruction scheduling (Simons and someone
else, POPL 1990) for a reasonable class of machines.  (Note -- latency
and block loads are one thing that will suck a good deal of registers.
"Vectorization" isn't just for vector machines)

The best argument I know against hand-optimizing your code is that
programmers often don't know what runs fast on the machine that
they're compiling to, and they certainly don't know what will run fast
on the machine that their program will run on 5 years from now.  The
people at Ardent spent some time working on analysis to "deoptimize"
those fucked-up little loops that C programmers are so fond of
writing, so that they could vectorize them (Allen and someone, SIGPLAN
PLDI, 1988).

David

brnstnd@stealth.acf.nyu.edu (02/10/90)

In article <14235@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
> Good example.  It demonstrates the point I've been making all along!
> Ther are optimizations that the compiler can't yet do well.  There _may_
> be optimizations that compilers will _never_ be able to do.  It is these
> types of things that programmers should spend their effort on.

Finally, we agree! The argument is one of degree: how much time to invest
in optimizing various ``levels'' of a program. You trust your compilers
to do a good job of strength reduction, array indexing, common expression
elimination, and so on, so you don't worry about them. I'm more cynical,
so I write things like #ifdef HANDOPT ... #else ...  #endif in critical
sections of a program. (As long as I remain consistently better at this
than my machine's optimizer, I'll keep doing it.)

In any case, neither of us bothers with trivial optimizations, and both
of us worry about major changes.

  [ hand optimizations ]
> They are also things which
> decrease the clarity of the code.

How? 

In one program using FFT-related methods I use a reverse(n,bits) function.
n can be very large, up to 2^30. First I wrote out the obvious code. Now
it's in comments, replaced by very fast code and a precomputed 128K table.
You might just leave the slow version, as it only took 10% of the program's
time before optimization.

How is my version any less clear than the original? If you want to
understand the ``real'' workings of the code, the original is still
there for you to see. It's in comments, but it's there.

You shouldn't criticize hand optimization. You should criticize the
poor coding practice of throwing away the unoptimized version.

---Dan

sommar@enea.se (Erland Sommarskog) (02/11/90)

Kristoffer Eriksson (ske@pkmab.se) writes:
 >Jim Giles (jlg@lambda.UUCP) writes:
 >>      int a[200][2];                    int a[200][2]; int *p; int *q;
 >>      ...                               ...
 >>      ...                               p = &a; q = p+1;
 >>      for (i=0;i<200;i++){              for (i=0;i<200;i++){
 >>        a[i][0] = expr1(i);               *p = expr1(i);
 >>        a[i][1] = expr2(i);               *q = expr2(i);
 >>      };                                  p += 2; q += 2;
 >>                                        };
 >>
 >>The first is clearly more readible than the second.  ...
 >
 >Not that I necessarily agree with the first poster about always using
 >pointers for array loops, but why would you use *two* pointers? It
 >would be much more "natural" and less obscure this way:
 >
 >int a[200][2]; int *p;            
 >p = &a[0][0];			/* Start fromthe first int of a. */
 >for (i = 0; i < 200; i++) {
 >    *(p++) = expr1(i);
 >    *(p++) = expr2(i);
 >};

OK, I don't speak C, maybe this is a standard approach once you
know the language, but I find Kristoffer's proposal to be at
least as obscure as Jim Giles' right-hand example. I fail to
see any problems with the left-hand fragment which is the obvious
way to write it. The other two look as the programmer's been
training for that obsfucated (or whatever the word is) C contest.

 >or this way:
 >
 >int a[200][2];  /* Array of 200 arrays of two int. */
 >int (*p)[2];	/* Pointer to array of two int. */
 >p = a;		/* Start with the first array of two int of a. */
 >for (i = 0; i < 200; i++) {
 >    (*p)[0] = expr1(i);
 >    (*p)[1] = expr2(i);
 >    p++;	/* Note automatic scaling. */
 >}

Slightly more understandable, but still obscure. It seems to assume
that the dimension are stored in a certain order. Although this
probably is defined by the language, it still seems unnecessary
to rely on such a fact. Anyway, the human reader have to think
a while before he can ensure that the fragment is correct.
-- 
Erland Sommarskog - ENEA Data, Stockholm - sommar@enea.se
Unix is a virus.

ske@pkmab.se (Kristoffer Eriksson) (02/13/90)

In article <767@enea.se> sommar@enea.se (Erland Sommarskog) writes:
>Kristoffer Eriksson (ske@pkmab.se) writes:
> >Not that I necessarily agree with the first poster about always using
> >pointers for array loops, but why would you use *two* pointers? It
> >would be much more "natural" and less obscure this way:
> >
> >int a[200][2]; int *p;
> >p = &a[0][0];			/* Start from the first int of a. */
> >for (i = 0; i < 200; i++) {
> >    *(p++) = expr1(i);
> >    *(p++) = expr2(i);
> >};
>
>OK, I don't speak C, maybe this is a standard approach once you
>know the language,

Yes, it is.

> but I find Kristoffer's proposal to be at
>least as obscure as Jim Giles' right-hand example.

Jim complained about that his own example that it obscured the fact that
expr1() was stored only in a...[0], and expr2() only in a...[1], at least
that is what I think he meant. But the only thing that obscured that (in
my view), was his use of two indepent pointers to step through [0] and
[1] respectivel. It's hard to see that they don't step on each other. That
is what I perceived as his main point with the example. So I showed that
the example was badly written.

In my example above, one can easily see that expr1() and expr2() will never
be stored on each other, since thee loop just sequentially steps through
the whole a[][] array (should be easy for any C programmer at least). Not
as easy as in Jim's left hand example, but *that was not the point*!

(And if you found the initial assignment to p (p = &a[0][0]) hard to
read, that's no argument against my example, because Jim's should have
read so too. Just saying "p = &a" is wrong. And besides, "&a[0][0]" is
not that hard to read. It just takes the address of the first element
of a.)

(And if you find "++" hard to read, that only shows you are not used to C.)

> I fail to
>see any problems with the left-hand fragment which is the obvious
>way to write it.

You apparently did not read my very first sentence:
>>Not that I necessarily agree with the first poster about always using
>>pointers...

I, too, find the left-hand example slightly more readible. But Jim's
example wasn't a fair demonstration of that. Wasn't it you that in another
news group, not long ago, said you could object to unfair arguments without
necessarily disagreeing with the conclusion?... (swnet.politik)

> The other two look as the programmer's been
>training for that obsfucated (or whatever the word is) C contest.

You wouldn't win a bit with those examples.

> >[ second example]
>
>Slightly more understandable, but still obscure. It seems to assume
>that the dimension are stored in a certain order.

It is basically the same as the example above it, just written another
way. And doesn't make any more assumptions than either the one above it
or Jim's right-hand one.

> Although this probably is defined by the language,

In C arrays are written name [index1] [index2] [and_so_on], not
name [index1, index2, and_so_on], wich seems to indicate a slight
difference in the conception of arrays...

> it still seems unnecessary to rely on such a fact.

This is C, after all.

> Anyway, the human reader have to think
>a while before he can ensure that the fragment is correct.

Normally, I would use Jim's right-hand example, if for no other reason
than that it uses fewer variables, and is shorter.
-- 
Kristoffer Eriksson, Peridot Konsult AB, Hagagatan 6, S-703 40 Oerebro, Sweden
Phone: +46 19-13 03 60  !  e-mail: ske@pkmab.se
Fax:   +46 19-11 51 03  !  or ...!{uunet,mcvax}!sunic.sunet.se!kullmar!pkmab!ske

jlg@lambda.UUCP (Jim Giles) (02/13/90)

From article <2844@pkmab.se>, by ske@pkmab.se (Kristoffer Eriksson):
> [...]
> int a[200][2];  /* Array of 200 arrays of two int. */
> int (*p)[2];	/* Pointer to array of two int. */
> p = a;		/* Start with the first array of two int of a. */
> for (i = 0; i < 200; i++) {
>     (*p)[0] = expr1(i);
>     (*p)[1] = expr2(i);
>     p++;	/* Note automatic scaling. */
> }

Still more obscure than using 'a' directly.  Furthermore, using the array
notation directly removes the need for _ANY_ pointer increment in the loop
(whether it is automatically scaled or not).  Other than that, the code
is now nearly identical to the code I gave as best - the array indexing
is still done _automatically_ by the compiler.  Since the supposed advantage
of using pointers was _not_ to rely on the compiler for optimizing the
index calculations, this version doesn't support Peter Grandi's claims.

J. Giles

dts@quad.uucp (David T. Sandberg) (02/13/90)

In article <767@enea.se> sommar@enea.se (Erland Sommarskog) writes:
:Kristoffer Eriksson (ske@pkmab.se) writes:
: >Jim Giles (jlg@lambda.UUCP) writes:
: >>  int a[200][2]; int *p; int *q;
: >>  ...
: >>  p = &a; q = p+1;
: >>  for (i=0;i<200;i++){
: >>    *p = expr1(i);
: >>    *q = expr2(i);
: >>    p += 2; q += 2;
: >>  };
: >
: >would be much more "natural" and less obscure this way:
: >
: >  int a[200][2]; int *p;            
: >  p = &a[0][0];			/* Start fromthe first int of a. */
: >  for (i = 0; i < 200; i++) {
: >      *(p++) = expr1(i);
: >      *(p++) = expr2(i);
: >  };
:
:OK, I don't speak C, maybe this is a standard approach once you
:know the language, but I find Kristoffer's proposal to be at
:least as obscure as Jim Giles' right-hand example.

You're right: you don't speak C.  Kristoffer's version is a very
common way to do operations like this in C.  I've written several
bits of code nearly identical to that in the last week.  And as a
C programmer, I found Kristoffer's version instantly comprehendable,
whereas Jim Giles' right hand example required a second look.  And
I daresay that Kristoffer's version would compile into more efficient
code.

BTW, you don't even need the parens in the lines which do the
assignment and increment to pointer *p, because it evaluates that
way anyway, so I would have written it as:

      *p++ = expr1( i );

-- 
                "Now beat it, or the red guy screams again!"
David Sandberg, dts@quad.sialis.mn.org or ..uunet!rosevax!sialis!quad!dts

root@kunivv1.sci.kun.nl (Privileged Account) (02/13/90)

(Dan Bernstein) writes:
>(Nick Rothwell) writes:
>> (Dan Bernstein) writes:
>> >Piercarlo is entirely correct. If you care about ``information hiding''
>> >and the ``purity'' of your code more than efficiency, you can't expect
>> >fast code. If you can't bear to put your original code in comments and
>> >write an efficient version with temporary variables, you don't deserve
>> >fast code.
>> 
>> This is rubbish, check out the performance of New Jersey ML
>> (high-level functional language, efficient optimised native code
>> generation).
>
>Are you saying that your compiler is smarter than you are or just that
>it's not as lazy?
>
>Can it figure out, say, Knuth's algorithm for computing the delta2 table
>in Boyer-Moore string searching, starting from a dumb algorithm for the
>same job?
>
After some discussion with my colleague Norbert V\"olker (who doesn't
read news), I can come up with the following reply:

- first, Knuth's algorithm as published in [1] is *WRONG*. This already
shows that not even DEK, who is an *extremely* competent programmer, to my
knowledge, was able to come up with the algorithm. See [2] for the
improved algorithm.

- second, N. V\"olker is currently working on a (transformational) derivation
of the delta-algorithm for BM pattern matching (with H. Partsch). Their
version of the algorithm is going to be different from the delta2
algorithm in [2], it uses more space and less time.
It is Norbert's feeling, however, that in the future compilers will be
able to come up with the delta2 algorithm, since all that is involved in
a derivation by hand is finite differencing (aka strength reduction).
The reason that current compilers are not able to do this is that they
do not ``know'' what lists (strings, sequences, or whatever) *are*. Cf.
[3] for the basic laws about strings.

My view of this whole issue is the following:
- there are two kinds of optimisations:
  routine ones and smart ones.
  Routine optimisations are:
  strength reduction, code motion, loop merging, (`bout everything you
  heard about in your course on Optimizing Compilers)
  Smart optimisations are inventions of new algorithms, like the BM
  pattern matching algorithm *itself*, the non-trivial sorting
  algorithms (cf. Darlington's paper `A Synthesis of Several Sorting
  Algorithms')
  The only problem I don't know for sure how to place is register
  allocation.
- the routine optimisations can be done (eventually) by a compiler; all
the examples of things that `you'd better do yourself because your
compiler is too stupid' mentioned so far fall in this class.
- the smart optimisations can only be done by humans because they
involve ``eureka's'' and design decisions; it is, however, possible
to limit the number of eureka's by inventing strategies (cf. some work
by D.R.Smith and A.Pettorossi) and by gaining knowledge about your
problem domain first (e.g. lists, cf. [3]).

References:
[1] Knuth, Morris, Pratt: Fast Pattern Matching in Strings, Siam J.
Comput. 6, 1977.
[2] Rytter: A correct preprocessing algorithm for Boyer-Moore
string-searching, Siam J. Comput. 9, 1980.
[3] R.S. Bird: An introduction to the theory of lists, in: M.Broy (ed.):
Logic of Programming and Calculi of Discrete Design, ASI Series Vol.
F36, Berlin:Springer 1987.

I'm very sorry for you, mr. Bernstein, that you chose an example that
proves my (and Nick Rothwell's) point.


Eerke Boiten, Department of Informatics, K.U.Nijmegen
Toernooiveld, 6525 AD Nijmegen, The Netherlands.
Tel. +31-80-612236.     eerke@cs.kun.nl

jlg@lambda.UUCP (Jim Giles) (02/14/90)

From article <2858@pkmab.se>, by ske@pkmab.se (Kristoffer Eriksson):
> [...]
> (And if you found the initial assignment to p (p = &a[0][0]) hard to
> read, that's no argument against my example, because Jim's should have
> read so too. Just saying "p = &a" is wrong. [...]

I'll have to look at my example again, but I don't think I said just:
"p = &a".  What I thought I said was: "p = (int *) &a", which is
correct.  If I forgot the cast, I apologize.

J. Giles

brnstnd@stealth.acf.nyu.edu (02/14/90)

The argument is whether hand optimizations are bad. My point is that
there are optimizations that current compilers can't do at all in
practice, even though there are no barriers in theory; if those must
be done by hand, what's wrong with doing simpler ones by hand? You
can't draw the line between trivial optimizations and so-called linear
programming (inductive generalization, trading space for time, or
whatever you want to call it).

I'm answering this article because the author was the only respondent
who realized that my example is, in principle, mechanically solvable.
Others tried to draw that line, saying (in essence) that only humans
could figure out Knuth's delta2 algorithm. But there's no reason that
compilers can't do linear programming---the necessary AI is old hat.
They can insert one temporary variable; why not allocate some memory
and insert n of them?

In article <1041@kunivv1.sci.kun.nl> root@kunivv1.sci.kun.nl (Privileged Account) writes:
> (Dan Bernstein) writes:
> >Can it figure out, say, Knuth's algorithm for computing the delta2 table
> >in Boyer-Moore string searching, starting from a dumb algorithm for the
> >same job?

> - first, Knuth's algorithm as published in [1] is *WRONG*.

I'm not talking about Knuth's delta2 algorithm as presented in his
addendum to the 1976 Knuth-Morris-Pratt paper on the subject. I'm
talking about Knuth's delta2 algorithm, the perfectly correct example
of a solution to a linear programming problem. Don't misinterpret.
(If you don't understand this, read Knuth's ACP historical comments on
Euclid's algorithm, and think about what we call Euclid's algorithm.)

> It is Norbert's feeling, however, that in the future compilers will be
> able to come up with the delta2 algorithm, since all that is involved in
> a derivation by hand is finite differencing (aka strength reduction).

Exactly! I agree completely with Volker. This all falls more generally
under linear programming. That's why I chose the example: Calculating
the delta2 table quickly is a (conceptually) simple job of inserting
appropriate temporary variables.

> My view of this whole issue is the following:
> - there are two kinds of optimisations:
>   routine ones and smart ones.

Volker just erased this line. Now you're turning around and redrawing it?

>   Smart optimisations are inventions of new algorithms, like the BM
>   pattern matching algorithm *itself*,

I'll bet that a few hundred years from now, compilers will be able to
figure this out. After all, the basic idea is just to sort a certain
three-dimensional comparison network into an efficient order. (Didn't
think of it that way, didja.) You can't make this distinction.

>   The only problem I don't know for sure how to place is register
>   allocation.

I place it at the very bottom: it's the only thing that can't be done
by hand, because the language has no support for it. (This is about the
only unarguable point in this discussion.) So the compiler had better
do a good job.

> I'm very sorry for you, mr. Bernstein, that you chose an example that
> proves my (and Nick Rothwell's) point.

Not at all. Have you lost track of what the point is? Read again.
Everything you say above ``my view is'' supports my argument; what
you say below ``my view is'' has nothing to do with my example.

---Dan

rang@cs.wisc.edu (Anton Rang) (02/15/90)

In article <449@quad.uucp> dts@quad.uucp (David T. Sandberg) writes:
>You're right: you don't speak C.  Kristoffer's version is a very
>common way to do operations like this in C.  I've written several
>bits of code nearly identical to that in the last week.

  Of course, if the array is redeclared to a different size, you get
interesting side-effects in the pointer versions, which can be rather
annoying.  For instance, adding a third element to each row (say, to
hold a precomputed function of the other two elements later on) will
screw up every loop like this.

		Anton
   
+---------------------------+------------------+-------------+
| Anton Rang (grad student) | rang@cs.wisc.edu | UW--Madison |
+---------------------------+------------------+-------------+

preston@titan.rice.edu (Preston Briggs) (02/15/90)

In article <3668:04:34:52@stealth.acf.nyu.edu> brnstnd@stealth.acf.nyu.edu (Dan Bernstein) writes:
>The argument is whether hand optimizations are bad. My point is that
>there are optimizations that current compilers can't do at all in
>practice, even though there are no barriers in theory; if those must
>be done by hand, what's wrong with doing simpler ones by hand? You
>can't draw the line between trivial optimizations and so-called linear
>programming (inductive generalization, trading space for time, or
>whatever you want to call it).

Sure we can draw a line.  If the compiler can, in practise,
save us some work, it should.  Extending your argument will 
have us adding on our fingers again

	I don't need the machine to do this!
	After all, it's not my equal in number theory.

Better seems

	I don't need to do this!
	I've got machines to do it for me.


(Do you suppose we could modify an Eliza to carry on these arguments for us?)

Preston Briggs
preston@titan.rice.edu

ske@pkmab.se (Kristoffer Eriksson) (02/15/90)

In article <14239@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
>From article <2858@pkmab.se>, by ske@pkmab.se (Kristoffer Eriksson):
>> (And if you found the initial assignment to p (p = &a[0][0]) hard to
>
>I'll have to look at my example again, but I don't think I said just:
>"p = &a".  What I thought I said was: "p = (int *) &a", which is
>correct.

Correct, yes, but you defeat any type-checking by the compiler, as the
cast overrides all checking. It hides the type of "a", both to the human
reader and to the compiler. If you accidentally substituted "a" for e.g.
a character string, the compiler would still accept and compile it.

Explicitly taking the address of the first element of the array exactly
states what you are doing, and automatically gives you the type of that
element. If it doesn't match the type of the variable it is being assigned
to, the compiler will tell you.

An alternative to the complicated expression "&a[0][0]" is the simple
"a[0]", but I think that one is more confusing to the reader, as it can
mean several different things.
-- 
Kristoffer Eriksson, Peridot Konsult AB, Hagagatan 6, S-703 40 Oerebro, Sweden
Phone: +46 19-13 03 60  !  e-mail: ske@pkmab.se
Fax:   +46 19-11 51 03  !  or ...!{uunet,mcvax}!sunic.sunet.se!kullmar!pkmab!ske