[comp.arch] Rubin debate

ok@cs.mu.oz.au (Richard O'Keefe) (10/07/89)

In article <1637@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:
> I admit I am not familiar with .il.  How would you do the following:
> 	fabs(x) implemented as		x &~ CLEARSIGN;
> where this clears the sign bit, and may depend on the precision of x.  No
> moving of x to specific registers, etc., should be used.

To start with, we have to realise that this CANNOT be done in a machine-
independent way.  For example, some machines (/370, MC68020+MC68881, PR1ME,
80386+80387) have two sets of registers: floating point and integer.
Most such machines do not provide bitwise operations on floating-point
registers.  The /370 does not provide moves between integer registers
and floating-point registers, so if you want to do this you have to
store the floating-point value in memory, load part of it into an integer
register, store it back in memory, and reload it into the floating-point
registers.  Rather silly, really, when you might have used the
load-absolute-float instruction.

Here's how I would obtain the requisite effect on a 68020 using .il:
	.inline	_fabs, 8	# function name, #argument bytes
	movl	sp@+,d0		# argument is assumed to be on the
	movl	sp@+,d1		# stack, result must be left in (d0,d1)
	andl	#0x7FFFFFFF,d0	# clear the sign bit
	.end
Ah, but Rubin demanded that "no moving of x to specific registers, etc.,
should be used".  Haven't I violated that?  Not at all.  .il expansion
is followed by an assembly-code optimisation pass, which is quite able
to eliminate the use of the stack, and can also optimise some register
operations.

I think that it is worth stressing that bit-twiddling like this can make
the code WORSE.  I have already cited the /370.  But the 68881 has a set
of fabs<size> instructions and the Sun FPA has fpabs<s,d> instructions.
If you do something like fabs(x*y), you have the result of x*y in a
floating-point register.  If you let the compiler generate an
f[p]absd instruction, the operation takes place in a single instruction
without any data movement between CPU and coprocessor.  If you insist on
using a bit mask, you force data movement between the 68881 co-processor
(or the FPA) and the 68020 CPU, and that means moving the intermediate
value to memory!

Leaving aside the fact that bit-twiddling may be slower than letting the
compiler do its thing (even if that is a conditional branch around a
load-negative-float), it is worth noting that the location of the sign
bit differs from machine to machine.  For example, if you load a VAX-11
floating-point number using integer instructions, you will not find the
sign bit where you would find the sign bit of an integer.  Assumptions
about the location of the sign bit are not even portable among machines
which obey the IEEE 754 standard; that says where the bits are in
_registers_, but says nothing about the format of a number stored in
memory.

This kind of consideration is what makes the .il approach (or similar
approaches) so attractive:  the machine-specific code is not there in
the main source file.  I can hand-tune certain functions for a particular
machine if I wish, without requiring compilers for _other_ machines to
support the particular instructions I use.

> And suppose I want to use << and >> on double longs?  How do I get this in
> without using any unnecessary memory references, and keeping things which
> are already in registers there?

On many popular machines it simply can't be done, for the simple reason that
there are no instructions to shift floating-point registers, and to get a
number where you can shift it, you have to use memory references (/370,
PR1ME 50 series, 68020+68881, others).  And once you have done that, it's
surprising the number of machines that haven't got double-length integer
shifts.  I must admit that it isn't perfectly clear to me what shifting
a double would be good for:  shifting a float left one bit would move the
top bit of the biassed exponent into the sign bit, and the second bit of
the significand (after the hidden bit) into the exponent.  (On a /370 the
effect would be rather different.)

On the assumption that it did make sense, here's how to shift a double
left one bit with the aid of .il:
	.inline	_lsldbl, 8
	movl	sp@+,d0
	movl	sp@+,d1
	lsll	#1,d1
	roxl,	#1,d0
	.end
Again, the stack movement may be optimised away by later compiler passes.
Other shifts are left as an exercise for the reader.

Suppose that you have the full set of floating-point operations
required, recommended, or suggested in the IEEE 754 and 854 standards.
What bit-twiddling operations are then required, and what are they good for?

> I want to be able to define a heavily overloaded power operator with a
> reasonable notation such as a!b or something similar.

Well then, *DO* it!

> I see no more reason
> why I should have to use prenex form for this than I should have to write
> sum(a,b) instead of a+b.  To paraphrase, why should I be a slave to the 
> compiler's notation rather than using familiar mathematical notation?

Why, because you *choose* to be!  Nut out a syntax you like, knock together
a parser using Lex and Yacc (public-domain versions of which are readily
available) or talk a CS undergraduate into doing it for you, and write in
the language of your dreams, letting the preprocessor take care of
generating sum(_,_) or whatever.  The University of Wisconsin-Madison had
a rather interesting Fortran preprocessor which they used to support
interval and triplex arithmetic in a variety of ways.  The Linpack algorithms
were developed using a preprocessor called TAMPR.  I believe the ToolPack
project has a preprocessor which isn't hard to obtain; I don't know what it
is capable of.

Another possibility:  get hold of the source code for gcc or g++ (all it
takes is FTP access or a small cheque to the Free Software Foundation),
and add a few operators to it (or get a CS graduate to do it for you).
Of course, that compiler may not have been ported yet to the machine that
Rubin is using (what machine _is_ that, by the way?), but if it has been,
I think that modifying gcc or g++ would be far and away the most profitable
thing for Rubin to do.

I wrote
> > I don't know precisely what Rubin means by returning a list;

> When computers first came out, most of them would multiply an integer by
> an integer, giving a most and least significant part.  Most of them would
> divide a double length integer by an integer, giving a quotient and a
> remainder.

Ah yes.  Now I remember.  A year or so I posted a file to the net called
muldivrem.il, which yielded in-line code for
	mul_div_rem(A, B, C, D, *Q, *R) is
		T := A*B+C;
		Q, R := T div D, T mod D
That was a follow-up to a posting from Rubin.  I believe that I pointed
out in that debate that Common Lisp has multiple-value-return and other
related operations, e.g.
	(multiple-value-bind ((Q R) (truncate (+ (* A B) C) D)
	    ;; Q is (A*B+C) div D, R is (A*B+C) mod D
	    ;; no intermediate data-structure was involved
	    )
There is a program called CGOL which provides you with an extensible
operator-based syntax for Lisp.


I wrote
> > I normally use a language which runs 2 to 4 times
> > slower than C; I do this gladly because the reduction in _language_
> > complexity means that I can manage more complex _algorithms_, and win
> > back O(n) speed that way.

> Getting O(n) may or may not be appropriate.

What I meant was an O(N) **SPEED-UP**.  As a very crude example of what I
mean, I once saw a very complicated Fortran subroutine published in a
Journal.  After fighting through the details for about half an hour, I
discovered that it was an O(N**3) algorithm for evaluating a polynomial
(and no, it wasn't optimising or taking special precautions to avoid
overflow, underflow, or cancellation).  Rewriting it the obvious way
without any attempt at bit-level optimisation gave an O(N) algorithm.
In the same way, I am glad to pay an O(1) cost in speed _when_ that
means that I can manage more complex data structures and handle the
development of an O(N) or O(NlgN) algorithm instead of an O(N**2) one.
It is not an uncommon thing for me to see highly tweaked C code for an
O(N**2) algorithm when a simpler but more expressive language makes it
easy to write an O(N) algorithm for the same problem.


> Two of my operations are transferring on a bit and
> "removing" the bit whether or not the transfer occurs, and finding the
> distance to the next one in a bit stream and "removing" all bits scanned. 
> These are tools, as basic as addition.

    Hypothetical C code (assume bitno is a constant):

	if (x & (1 << bitno)) { x &=~ (1 << bitno); goto Foo; }

Simple-minded 680x0 code:

	btst	#bitno,x
	jnes	L1
	bclr	#bitno,x
	jra	Foo
    L1:

Optimised 680x0 code:
	bclr	#bitno,x
	jeq	Foo

This is at most a factor of 2, and the optimisation is well within the
reach of a peep-hole optimiser.  If gcc doesn't do it, and it probably
doesn't, it would be about as hard to add the optimisation to the compiler
as it would be to write the hand-optimised code once.  (I repeat that I
think it would be very useful to Rubin to get a copy of gcc.)

I hardly think that either of these operations is "as basic as addition".
Bear in mind that while some machines have a "find first one" instruction,
many don't.  Also, I do read Applied Statistics and similar journals from
time to time, and it's amazing what people have been able to contribute
in the way of random number generators and such without needing more than
Fortran gives them.  (Of course, they were trying to write code that
other people could use even if they didn't have the same kind of computer.)

> In a university, it is almost impossible to get programming assistance
> in mathematics and statistics.

I know.  

> Also, these supposedly highly optimizing compilers do not seem to be available.

Try gcc.

> I doubt that I would have any more difficulty in writing a decent
> machine language than any HLL, because a decent machine language would
> be weakly typed and overloaded operator.

There are, then, remarkably few "decent machine languages".  Seriously,
the examples that Rubin has given are extremely machine-specific.  If
code is not portable, there is no disgrace in writing it in assembler.

> > This is not unlike the RISC/CISC debate.  I want simple regular
> > languages (C is a monster of complexity) which I can master completely.
> > I don't care whether it's a ROMP or a WM or a /360 underneath.  I
> > particularly don't want to _have_ to know.

> When I design an algorithm, I _have_ to know.
> I will use different algorithms on different computers,
> and have no problem keeping them straight.  I find it no problem whatever
> in seeing how the hardware interacts with the algorithm and exploiting it.

But then you can only use the machines you have special hacks for.
If you are content to write different algorithms for different machines,
why should you demand to use the same language to write them?

> O'Keefe wants to master
> his language completely, but he is a slave to the language, whether or
> not he knows it.

As I have written programs in several dozen languages and have written
assembly code on nearly a dozen machines, I don't think I'm a slave to
any one language.

> Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
                         ^^^^^^^^^^
My statistics programming was done mainly in GLIM.