[comp.lang.misc] Clever programming tricks wanted

wald-david@CS.YALE.EDU (david wald) (01/13/89)

In article <4061@hubcap.UUCP> mcvax!etive.ed.ac.uk!gvw@uunet.UU.NET (MOBY) writes:
>I'm looking for "clever" programming tricks.

[modified version of (x &= x-1) trick for clearing lsb deleted]

>I would like to hear about similar tricks that people have invented/
>discovered.  In particular,...[I'm] trying to find a description of a
>mythical technique for creating doubly-linked lists using single
>pointers and XOR operations.

I've divided this up for clarity:

======================================================================
The doubly-linked list technique simply consists of storing the
exclusive-or of the back and front pointers for each node.  When
searching, you always keep a trailing pointer, which allows you to
decode the XORed pointers.  Thus, your structure might be (pardon my
C.  This requires a language with either no typing or capability for
type punning.  Or, I suppose, exclusive or for pointers, but that's a
good deal more unusual.):

struct node {
    int data;
    struct node *ptr;  /* contains XOR of back & front pointers */
    };


then, assuming "pointer" points to a node and "trailer" to the
previous node (or NULL if you're at the beginning of the list), you
can move the pointers forward in the list is done as follows:

  struct node *temp;

  temp = pointer;
  pointer = (struct node *)((long)(pointer->ptr) ^ (long)trailer);
  trailer = temp;


Similarly, to go backward:

  struct node *temp;

  temp = trailer;
  trailer = (struct node *)((long)(trailer->ptr) ^ (long)pointer);
  pointer = temp;

======================================================================
This technique is similar to one for swaping two values without using
a temporary variable.  Simply do as follows, for two variables a and
b:

	a = a ^ b;
	b = a ^ b;
	a = a ^ b;

to see that this does the right thing, let A and B be the initial
values of a and b.  This is then equivalent to:

	a = A ^ B;
	b = (A ^ B) ^ B    /* which equals A ^ (B ^ B), or A */
	a = (A ^ B) ^ A    /* which similarly equals B */

======================================================================
In fact, I suppose we could combine the two techniques, and come up
with a pointer advancement that looks like (type punning omitted):

	trailer = pointer ^ trailer ^ pointer->ptr;
	pointer = pointer ^ trailer;
	trailer = pointer ^ trailer;

Well, presumably XOR is fast.

============================================================================
David Wald                                              wald-david@yale.UUCP
waldave@yalevm.bitnet                                 wald-david@cs.yale.edu
"A monk, a clone and a ferengi decide to go bowling together..."
============================================================================

wald-david@CS.YALE.EDU (david wald) (01/13/89)

In article <4061@hubcap.UUCP> mcvax!etive.ed.ac.uk!gvw@uunet.UU.NET (MOBY) writes:
>I'm looking for "clever" programming tricks.

A year or so ago someone showed me a paper in which the author described
a semi-automated hack generator (my phrasing, not the author's).
Essentially, the generator would be given a description of a common
function (say, min() or max() or some such), and would then attempt to
generate the shortest-length sequence of assembly language instructions
(or perhaps, shortest sequence with fewest branch instructions) that
would compute the function.  Yes, I know this is uncomputable in
general.  The point was that 1) the functions given were all both simple
and reasonably common, thus being perhaps worth a certain amount of
effort (from the point of view of writers of code generators), and 2)
the program didn't actually try to prove its answers.  It would test its
generated functions for equality with the supplied function at (say) 100
scattered points.  If the hack worked on those points it would be
returned to the user, who could then attempt to prove it correct.

Some of the code generated by this thing was extremely non-intuitive,
even when it was correct, making use of obscure flag settings and such,
and certainly qualified for the "hack" title.

Does any one recognize this work?  I'll try to dig up the reference.



============================================================================
David Wald                                              wald-david@yale.UUCP
waldave@yalevm.bitnet                                 wald-david@cs.yale.edu
"A monk, a clone and a ferengi decide to go bowling together..."
============================================================================

tim@crackle.amd.com (Tim Olson) (01/13/89)

In article <47380@yale-celray.yale.UUCP> wald-david@CS.YALE.EDU (david wald) writes:
| In article <4061@hubcap.UUCP> mcvax!etive.ed.ac.uk!gvw@uunet.UU.NET (MOBY) writes:
| >I'm looking for "clever" programming tricks.
| 
| A year or so ago someone showed me a paper in which the author described
| a semi-automated hack generator (my phrasing, not the author's).
|
| Some of the code generated by this thing was extremely non-intuitive,
| even when it was correct, making use of obscure flag settings and such,
| and certainly qualified for the "hack" title.

A lot of the trickiness was using the carry bit in strange ways.

| Does any one recognize this work?  I'll try to dig up the reference.

"Superoptimizer: A look at the Smallest Program", H. Massalin, ASPLOS II
Proceedings.


	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

djones@megatest.UUCP (Dave Jones) (01/13/89)

From article <47380@yale-celray.yale.UUCP), by wald-david@CS.YALE.EDU (david wald):
) In article <4061@hubcap.UUCP) mcvax!etive.ed.ac.uk!gvw@uunet.UU.NET (MOBY) writes:
))I'm looking for "clever" programming tricks.
) 
) A year or so ago someone showed me a paper in which the author described
) a semi-automated hack generator (my phrasing, not the author's).
) Essentially, the generator would be given a description of a common
) function (say, min() or max() or some such), and would then attempt to
) generate the shortest-length sequence of assembly language instructions
) (or perhaps, shortest sequence with fewest branch instructions) that
) would compute the function.  Yes, I know this is uncomputable in
) general.  The point was that 1) the functions given were all both simple
) and reasonably common, thus being perhaps worth a certain amount of
) effort (from the point of view of writers of code generators), and 2)
) the program didn't actually try to prove its answers.  It would test its
) generated functions for equality with the supplied function at (say) 100
) scattered points.  If the hack worked on those points it would be
) returned to the user, who could then attempt to prove it correct.
) 
) Some of the code generated by this thing was extremely non-intuitive,
) even when it was correct, making use of obscure flag settings and such,
) and certainly qualified for the "hack" title.
) 
) Does any one recognize this work?  I'll try to dig up the reference.
) 
) 

I recall having read such a paper in some computer journal.
But as I recall, this paper claimed to prove that the result correctly
computed the function, and was optimum.  I found a counterexample (a shorter
instruction sequence computing one of the functions -- min() I think
it was.  I mailed the counterexample to the author, but I never received
a response.

haahr@phoenix.Princeton.EDU (Paul Gluckauf Haahr) (01/13/89)

In article <47380@yale-celray.yale.UUCP> wald-david@CS.YALE.EDU (david wald) writes:
 Essentially, the generator would be given a description of a common
> function (say, min() or max() or some such), and would then attempt to
> generate the shortest-length sequence of assembly language instructions
> (or perhaps, shortest sequence with fewest branch instructions) that
> would compute the function.

> Does any one recognize this work?  I'll try to dig up the reference.

the super-optimizer.  from the asplos 2 conference, summer 87 i think.
proceedings published as an issue of sigplan notices.

landman%hanami@Sun.COM (Howard A. Landman) (01/13/89)

In article <4061@hubcap.UUCP> mcvax!etive.ed.ac.uk!gvw@uunet.UU.NET (MOBY) writes:
>I would like to hear about similar tricks that people have invented/
>discovered.  In particular, I'm looking for fast ways to count the
>number of 1 bits in a word (on machines that don't have this as a
>machine instruction :-)

int CountBitsInWord(rw)
int      rw;
{
int      tmp;
	/* This uses a clever algorithm I picked up somewhere.  */
	/* Try a few examples and you'll see why it works.      */
	/* Phase 1 - add up pairs of bits. */
	tmp = rw & 0xAAAAAAAA;
	rw &= 0x55555555;
	rw += tmp >> 1;
	/* Phase 2 - add up 2-bit sums to get 3-bit sums in 4 bits. */
	tmp = rw & 0xCCCCCCCC;
	rw &= 0x33333333;
	rw += tmp >> 2;
	/* Phase 3 - add up 3-bit sums to get 4-bit sums in 8 bits. */
	tmp = rw & 0xF0F0F0F0;
	rw &= 0x0F0F0F0F;
	rw += tmp >> 4;
	/* Phase 4 - add up 4-bit sums to get 5-bit sums in 16 bits. */
	tmp = rw & 0xFF00FF00;
	rw &= 0x00FF00FF;
	rw += tmp >> 8;
	/* Phase 5 - add up 5-bit sums to get 6-bit sum in 32 bits. */
	tmp = rw; /* & 0xFFFF0000 not needed due to shift below */
	rw &= 0x0000FFFF;
	rw += tmp >> 16;
	/* All done. */
	return (int) rw;
}

	Howard A. Landman
	landman@hanami.sun.com

trt@rti.UUCP (Thomas Truscott) (01/14/89)

In article <85188@sun.uucp>, landman%hanami@Sun.COM (Howard A. Landman) writes:
> In article <4061@hubcap.UUCP> mcvax!etive.ed.ac.uk!gvw@uunet.UU.NET (MOBY) writes:
> > ...  In particular, I'm looking for fast ways to count the
> >number of 1 bits in a word (on machines that don't have this as a
> >machine instruction :-)

If your machine has a fast integer multiply,
it can be used as follows:

: int CountBitsInWord(rw)
: int      rw;
: {
: int      tmp;
: 	/* This uses a clever algorithm I picked up somewhere.  */
: 	/* Try a few examples and you'll see why it works.      */
: 	/* Phase 1 - add up pairs of bits. */
: 	tmp = rw & 0xAAAAAAAA;
: 	rw &= 0x55555555;
: 	rw += tmp >> 1;
: 	/* Phase 2 - add up 2-bit sums to get 3-bit sums in 4 bits. */
: 	tmp = rw & 0xCCCCCCCC;
: 	rw &= 0x33333333;
: 	rw += tmp >> 2;
: 	/* Phase 3 - add up 3-bit sums to get 4-bit sums in 8 bits. */
: 	tmp = rw & 0xF0F0F0F0;
: 	rw &= 0x0F0F0F0F;
: 	rw += tmp >> 4;
#ifdef MULTIPLY
	/* Phase 4,5 add up sums */
	rw = (rw * 0x01010101) >> 24;
#else
: 	/* Phase 4 - add up 4-bit sums to get 5-bit sums in 16 bits. */
: 	tmp = rw & 0xFF00FF00;
: 	rw &= 0x00FF00FF;
: 	rw += tmp >> 8;
: 	/* Phase 5 - add up 5-bit sums to get 6-bit sum in 32 bits. */
: 	tmp = rw; /* & 0xFFFF0000 not needed due to shift below */
: 	rw &= 0x0000FFFF;
: 	rw += tmp >> 16;
#endif
: 	/* All done. */
: 	return (int) rw;
: }

If the count is guaranteed never to exceed 15 then phases 3,4,5
can be done with "rw = ((unsigned)(rw * 0x11111111)) >> 28;"
I thought of this trick (as did others, no doubt) a decade
ago when working on a computer chess program.
But we never used it since table lookup seemed faster yet.
	Tom Truscott

dan-hankins@cup.portal.com (Daniel B Hankins) (01/14/89)

Re a 'superoptimizer' to optimize a function to the shortest sequence of
machine instructions:

     This is fairly simple.  Construct a genetic algorithm in which the
survival function is:

      code length / sum-of-squares distance from real function

     The genetic string itself would of course be the machine code.  This
should come up with a very good answer quite rapidly.


Dan Hankins

seanf@sco.COM (Sean Fagan) (01/15/89)

In article <47380@yale-celray.yale.UUCP> wald-david@CS.YALE.EDU (david wald) writes:
>Some of the code generated by this thing was extremely non-intuitive,
>even when it was correct, making use of obscure flag settings and such,
>and certainly qualified for the "hack" title.

Yes, this is the "SuperOptimiser."  I believe the author had gotten to work
on 8086's and 6800's (not 68000's, but 00's).

We used to have the paper written on it, but it seems to have disappeared
from the office.

BTW, the code it generated assumed you didn't care about which registers you
trashed.

-- 
Sean Eric Fagan | "May the forces of evil become confused on the way to 
seanf@sco.UUCP  |    you house."  -- George Carlin
(408) 458-1422  | Any opinions expressed are my own, not my employers'.

rjh@ihlpa.ATT.COM (Herber) (01/15/89)

In article <85188@sun.uucp> landman@sun.UUCP (Howard A. Landman) writes:
.In article <4061@hubcap.UUCP> mcvax!etive.ed.ac.uk!gvw@uunet.UU.NET (MOBY) writes:
.>I would like to hear about similar tricks that people have invented/
.>discovered.  In particular, I'm looking for fast ways to count the
.>number of 1 bits in a word (on machines that don't have this as a
.>machine instruction :-)
.
.int CountBitsInWord(rw)
|int CountBitsInWord(arg)
.int      rw;
|int      arg;
.{
.int      tmp;
|unsigned long      tmp, rw = (unsigned long) arg & 0xFFFFFFFF;
.	/* This uses a clever algorithm I picked up somewhere.  */
.	/* Try a few examples and you'll see why it works.      */
.	/* Phase 1 - add up pairs of bits. */
.	tmp = rw & 0xAAAAAAAA;
.	rw &= 0x55555555;
.	rw += tmp >> 1;
.	/* Phase 2 - add up 2-bit sums to get 3-bit sums in 4 bits. */
.	tmp = rw & 0xCCCCCCCC;
.	rw &= 0x33333333;
.	rw += tmp >> 2;
.	/* Phase 3 - add up 3-bit sums to get 4-bit sums in 8 bits. */
.	tmp = rw & 0xF0F0F0F0;
.	rw &= 0x0F0F0F0F;
.	rw += tmp >> 4;
.	/* Phase 4 - add up 4-bit sums to get 5-bit sums in 16 bits. */
.	tmp = rw & 0xFF00FF00;
.	rw &= 0x00FF00FF;
.	rw += tmp >> 8;
.	/* Phase 5 - add up 5-bit sums to get 6-bit sum in 32 bits. */
.	tmp = rw; /* & 0xFFFF0000 not needed due to shift below */
.	rw &= 0x0000FFFF;
.	rw += tmp >> 16;
.	/* All done. */
.	return (int) rw;
.}
.
.	Howard A. Landman
.	landman@hanami.sun.com

On some machines, right shift of signed integers proprogates the sign bit.
Also, on some machines an int is not at least 32 bits;

	Randolph J. Herber, Amdahl Sr Sys Eng, ..!att!ihlpa!rjh,
	(312) 979-6553, IH 6X213, AT&T Bell Labs, Naperville, IL 60566
	@ home: {att|amdahl|mcdchg|clout|obdient}!yclept!{rjh|root}

w-colinp@microsoft.UUCP (Colin Plumb) (01/16/89)

seanf@scolex.UUCP (Sean Fagan) wrote:
> Yes, this is the "SuperOptimiser."  I believe the author had gotten to work
> on 8086's and 6800's (not 68000's, but 00's).

No, 68000s.  In fact, the timings the thing used were for 68020s.
The 8086 version didn't have a hand-tuned intner loop, and was a lot
slower, so most of the examples were for 68000s.

> BTW, the code it generated assumed you didn't care about which registers you
> trashed.

True, but just trashing registers arbitrarily takes cycles, and so the
optimiser wonn't trash many more than necessary.  The rest can easily
be diddled by hand or a post-pass to allocate non-conflicting values to
the same register.
-- 
	-Colin (uunet!microsof!w-colinp)

pratt@polya.Stanford.EDU (Vaughan R. Pratt) (01/20/89)

In article <4061@hubcap.UUCP> mcvax!etive.ed.ac.uk!gvw@uunet.UU.NET (MOBY) writes:
>I'm looking for "clever" programming tricks.  For example,
>consider the following problem:
>
>    Given a computer word W in which one or more bits are set,
>  .....
>
>Greg Wilson
>-- 
>Edinburgh Concurrent Supercomputer Project

Many programming tricks for bit-vector machines can be found in the
following article.

Pratt, V.R., and L.J. Stockmeyer, "A Characterization of the Power of
Vector Machines", JCSS, 12, 2, 1976, 198-221.  (A preliminary version
appeared in STOC-74.)

The machine model is a bit-vector machine with the following
register-to-register instructions.  (This is freely paraphrased from the
original to make it look more like C.)

	Ri = Rj << Rk, Ri = Rj >> Rk;		Rk in 2's complement
	Ri op= Rj;		for any of the 16 binary Boolean ops
	Ri = 0;
	if (Ri == 0) goto L;	Any instruction may be labeled

Note absence of arithmetic operations, see comments on this below.

The machine in the paper had infinite-to-the-left registers, but we
should have made the point more clearly that our results applied
equally well to finite registers provided one puts up with some of the
algorithms working only for inputs of size smaller than a word.  Thus
although n-bit numbers can be added in n-bit registers, their
multiplication requires n^2-bit registers.  These limitations are reflected in
the second column in the table below.

We also considered random access memory (accessed by indirect reference
*Ri, with Ri interpreted as unsigned binary) and nondeterminism.
However none of the following algorithms require either of these capabilities,
and can be implemented using only a handful of registers.

Here are some of the functions we computed, along with the length of register
(in bits) needed, and the time needed.  All logs to base 2.  Constant factors
omitted (but all reasonably small, i.e. nothing in the 100's).

Data structures such as lists of numbers and matrices are stored in
single registers, e.g. a 4x4 matrix of 4-bit numbers would be stored in
a single 64-bit register.  The expectation was that these algorithms
would be at their most useful in a machine with very long words.

To keep things simple I'll assume that all numbers are n bits, all
lists are of length n and consist of n-bit numbers (so require
registers with n^2 bits), and all matrices are nxn and consist of n-bit
numbers (unless referred to as Boolean matrices), so require registers
with n^3 bits (n^2 for Boolean matrices).  Algorithms may require
longer registers than needed to hold the inputs, e.g. matrix
multiplication needs an additional factor of n.

Operation			Length of reg.  Time

Count 1's				n	log n
Ri = Rj+Rk				n	log n
Add n numbers				n^2	log n
Ri = Rj*Rk				n^2	log n
Boolean matrix transposition		n^2	log n
Boolean matrix multiplication		n^3	log n
Boolean matrix transitive closure	n^3	log^2 n
Integer matrix addition			n^3	log n
Integer matrix multiplication		n^4	log n
Simulation of nondeterministic
	    space S Turing machine	2^2S	S^2

Various other programming tricks appear in the paper as subroutines for the
above.  A particularly useful trick is the construction of masks Mi
having a 1 at bit j just when j in binary has a 1 at bit i; we showed how
to get the low-order 2^i bits of Mi in i steps, and then to construct
each of M(i-1), M(i-2), ..., M0 in constant time.

We also showed that a deterministic space T^2 Turing machine could
simulate a nondeterministic time T vector machine, with or without
random access memory, under the restriction that registers containing
shift distances could themselves only be shifted by constant distances
(to prevent the sort of blowup you get when you iterate
Ri = Ri << Ri).  (This restriction was removed several years later by
Janos Simon using an ingenious compact coding of the enormous numbers
that this iteration generated.)

For polynomial space (PSPACE) on Turing machines, nondeterminism does
not help.  That is, PSPACE = NSPACE.  Since the above simulations are
to within a polynomial (namely a quadratic), it follows that, on bit
vector machines, P = NP.

This paper was the first of a series showing this relationship between serial
and parallel computers for a variety of parallel architectures, leading to
a general thesis that time on parallel computers behaved like space on serial
computers, to within a square and independently of determinism or
nondeterminism.  It is wide open whether anything better than quadratic
is possible in either direction of the simulation.

Since addition takes time log n, many programs would run faster if the
instruction set were expanded to include an add instruction.  However
the examples above of adddition of n numbers, multiplication of two
numbers, and multiplication of nxn arithmetic matrices, show that this
overhead of log n can often be absorbed into the other log n's, so the
advantage of addition as a primitive is not as great as it might have
seemed.

We did not track down the the history of the log n solution to the problem
of counting bits.  Peter Wegner informed us that he had been party to a 1960
paper describing this solution, though I don't have a reference.
	Vaughan Pratt
-- 
		_______________________________________________________
		Vaughan Pratt   pratt@polya.stanford.edu   415-494-2545
		=======================================================