[comp.lang.c] Multiplying two shorts...

dan@srs.UUCP (Dan Kegel) (08/11/88)

Sun's compilers seem to let you multiply two short integers and get all 
32 product bits, if you code like this:
    register short x, y;
    register long z;

    z = x * y;

If the compiler is nice, it emits the correct (16x16=32) multiply instruction.
If it isn't nice, it emits the slower (32x32=64) instruction and throws
away the top 32 bits.

Do most modern compilers perform this optimization?
It would be nice to be able to depend on it.
-- 
  Dan Kegel   "We had to get it passed before the columnists attacked!"
  srs!dan@cs.rochester.edu  rochester!srs!dan dan%srs.uucp@harvard.harvard.edu

ark@alice.UUCP (08/11/88)

In article <948@srs.UUCP>, dan@srs.UUCP writes:

> Sun's compilers seem to let you multiply two short integers and get all 
> 32 product bits, if you code like this:
>     register short x, y;
>     register long z;

>     z = x * y;

> If the compiler is nice, it emits the correct (16x16=32) multiply instruction.
> If it isn't nice, it emits the slower (32x32=64) instruction and throws
> away the top 32 bits.

> Do most modern compilers perform this optimization?
> It would be nice to be able to depend on it.

The effect of the statment above is undefined.

To get the right answer, you should write

	z = (long) x * (long) y;

Actually, it is enough to cast only one of x and y, so you
can say

	z = (long) x * y;

but that may be a little confusing.  If your compiler is clever,
it will realize that it can use the multiply instruction that
takes short operands into a long result.  If it's not clever,
you'll still get the right answer.

Wouldn't you rather risk having your program get the right answer
slowly rather than risk getting the wrong answer outright?

cik@l.cc.purdue.edu (Herman Rubin) (08/11/88)

In article <8101@alice.UUCP>, ark@alice.UUCP writes:
> In article <948@srs.UUCP>, dan@srs.UUCP writes:
> 
> > Sun's compilers seem to let you multiply two short integers and get all 
> > 32 product bits, if you code like this:
> >     register short x, y;
> >     register long z;
> 
> >     z = x * y;
> 
> > If the compiler is nice, it emits the correct (16x16=32) multiply instruction.
> > If it isn't nice, it emits the slower (32x32=64) instruction and throws
> > away the top 32 bits.
> 
> > Do most modern compilers perform this optimization?
> > It would be nice to be able to depend on it.
> 
> The effect of the statment above is undefined.
> 
> To get the right answer, you should write
> 
> 	z = (long) x * (long) y;
> 
> Actually, it is enough to cast only one of x and y, so you
> can say
> 
> 	z = (long) x * y;
> 
> but that may be a little confusing.  If your compiler is clever,
> it will realize that it can use the multiply instruction that
> takes short operands into a long result.  If it's not clever,
> you'll still get the right answer.
> 
> Wouldn't you rather risk having your program get the right answer
> slowly rather than risk getting the wrong answer outright?

It is a rare compiler which can recognize that it can ignore type
casting and use a different instruction to get a result.  It is hard
enough to get a compiler to do the obvious.  I find the attitude of
"ark" execrable; this is what has produced our current bad software
and also bad hardware.  Unfortunately, hardware manufacturers do not
realize that people like dan and me know how to use machine instructions
which the language gurus have not seen fit to put into the language,
and to recognize that hardware constructs beyond the scope of the
language can be useful.

We need the type of operator overloading which dan is using.  If the
hardware supports 16 x 16 -> 32 multiplication, it should be done with
that instruction.  On the VAX or PYRAMID or CRAY or CYBER, it can only
be done the way "ark" suggests.  I suggest that the compiler implementer
be required to enable dan to put in his improvement even if the implementer
had not thought of it.

Another example of the neglect is the problem of the power operator.  This
has been discussed in this group, and I will not elaborate.  However, the
minimalists (do it using the current C tools) do not even meet "ark"s
requirement of getting the correct answer slowly.

How about the use of long long?  I needed to compute a function which
required unpacking a double into an exponent field and a mantissa (long long),
doing integer and boolean arithmetic starting with the mantissa, etc.,
and putting things together to get a double result.  Needless to say, I
had to do this in assembler.  BTW, my biggest difficulty was the fact that
hex was not the default.

How about multiplying two longs to get a long long, both signed and unsigned?
Few machines have the unsigned version, cheap in hardware but expensive in
software.  And how about division?  One much needed instruction, cheap in
hardware and expensive in software, is dividing a float (of any length) by
a float and getting an integer quotient and a floating remainder.  And there
was the long discussion in this group about how integer/integer should be
treated if they are not positive.

We need software which allows those who understand their machine to exploit
the capabilities.  We need hardware which provides the capabilities.  We do
not need to be told that "this is not portable; you must not do it."
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

braner@batcomputer.tn.cornell.edu (braner) (08/12/88)

[]

I ran into a related bug in Laser C on the Atari ST.  The (legal) code was:
	struct foo *p;		/* 32-bit pointers */
	int n;			/* 16-bit ints <<<<<<<<<<<<<<<<<<<< */
				/* (In my view the right thing on a 68000) */
	...
	p += n;			/* <<<< here was the bug */
The compiler, turns out, did the last instruction as follows:
	load n
	load sizeof(struct foo)
	multiply using the 68000 mulu instruction (16*16=32 bits)
	truncate to an int (16 bits)	<<<<<<<<<< !!!!!!
	extend to a long (32 bits)
		(actually the last two were done together as ext.l d#)
	long-add to p
Of course, for (n > 64K/sizeof(struct foo)) the address thus calculated
was wrong.  The ironic part of it was that the compiler had the 32 bit
result right there, and had to stomp on it with the extra "ext.l"...
When coding the related C expression
	&p[n]
the same compiler did it correctly -- but slowly, as:
	load n
	extend to long
	load sizeof(struct foo) as a LONG constant
	call the long-multiply library function (32*32=32 bits)
	long-add result to p
this time not using the perfectly adequate "mulu" machine instruction,
which could be recognized as adequate since 'n' was a 16-bit int and
sizeof(struct foo), a constant known at compile time, was 6 bytes (< 64K).

I was forced to replace "p+=n;" with "p=&p[n];" and take the performance loss.

Happy debugging!

- Moshe Braner

(Try debugging on a parallel machine :-)

bill@proxftl.UUCP (T. William Wells) (08/12/88)

In article <8101@alice.UUCP> ark@alice.UUCP writes:
: In article <948@srs.UUCP>, dan@srs.UUCP writes:
:
: > Sun's compilers seem to let you multiply two short integers and get all
: > 32 product bits, if you code like this:
: >     register short x, y;
: >     register long z;
:
: >     z = x * y;
:
: > If the compiler is nice, it emits the correct (16x16=32) multiply instruction.
: > If it isn't nice, it emits the slower (32x32=64) instruction and throws
: > away the top 32 bits.
:
: > Do most modern compilers perform this optimization?
: > It would be nice to be able to depend on it.
:
: The effect of the statment above is undefined.

That is not true.  The bit of code above is equivalent to:

	register short x, y;
	register long z;
	int     tmp1;
	int     tmp2;
	int     tmp3;

	tmp1 = (int)x;
	tmp2 = (int)y;
	tmp3 = tmp1 * tmp2;
	z = (long)tmp3;

What this means is that if the product of the values of the two
shorts will fit in an int, the code will work as expected.  If
not, then the result IS undefined.  To make this statement work
do:

:       z = (long) x * y;

as suggested.  And, as long as the product of the two shorts will
fit in a long, this will work.


---
Bill
novavax!proxftl!bill

radford@calgary.UUCP (Radford Neal) (08/14/88)

In article <948@srs.UUCP>, dan@srs.UUCP writes:

> Sun's compilers seem to let you multiply two short integers and get all 
> 32 product bits, if you code like this:
>     register short x, y;
>     register long z;
>
>     z = x * y;

In article <8101@alice.UUCP>, ark@alice.UUCP writes:

> The effect of the statment above is undefined.
> 
> To get the right answer, you should write
> 
> 	z = (long) x * (long) y;
> ...
> Wouldn't you rather risk having your program get the right answer
> slowly rather than risk getting the wrong answer outright?

In article <864@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:

> It is a rare compiler which can recognize that it can ignore type
> casting and use a different instruction to get a result.  It is hard
> enough to get a compiler to do the obvious.  I find the attitude of
> "ark" execrable...

Very puzzling. Why is informing people that their programs may break
when moved to another compiler such a crime? If you want to argue that
the C language _ought_ to be defined so that the product of two shorts
is computed to long precision, that's another issue, though some might
object to the code this would imply for:

       int a, b, c, d;

       a = (a*b*c*d) % 10;

By your logic, the intermediate result (a*b*c*d) should be computed to
128-bit precision, assuming int's are 32 bits.

> We need software which allows those who understand their machine to exploit
> the capabilities.

We already have such software. It's called an assembler.

> We need hardware which provides the capabilities...

If you're willing to pay enough, I'm sure someone will oblige. Actually, 
your best bet in this case would seem to be RISC machines with software
multiplication and division, then you can do whatever you want.

> We do not need to be told that "this is not portable; you must not do it."

Any professional programmer ought to realize when a programming technique
is not portable. Whether they choose to do it anyway is another matter.

In fact, "ark" was arguing that including casts got you both - a portable
program, and a fast one if the compiler is smart. Contorting your program
to get fast code out of dumb compilers is a technique of limited usefullness.

     Radford Neal

throopw@xyzzy.UUCP (Wayne A. Throop) (08/16/88)

> From: cik@l.cc.purdue.edu (Herman Rubin)
>> ark@alice.UUCP writes:
>> Wouldn't you rather risk having your program get the right answer
>> slowly rather than risk getting the wrong answer outright?

> [...] I find the attitude of
> "ark" execrable; this is what has produced our current bad software
> and also bad hardware.

"Thud."  Sorry, my jaw just hit the floor.  It is hard to conceive of
anybody finding ark's "attitude" anything perjorative, let long
"excrable".  Look, folks, let me clue you in.  I don't care how fast
your CPU, how clever your legions of coders and microcoders, nor do I
care how many femtoseconds can be shaved by using the software that
results upon such a CPU.  If the silly thing gets the WRONG ANSWER, it
just doesn't matter how fast it is delivered.  I simply don't need a
hyper-fast pile of junk driven by oh-so-clever bit-twiddlers telling me
that pi is 3.  (Or, to use real examples, drawing multiple copies of my
cursor, or aborting a login session because one program is ill, or
crashing a system because one program uses "too much" memory, or any
other of the "wrong answers" I am given every day in the name of
"speed").

Hmpf.

--
Nothing is impossible.  Some things are just less likely than othes.
                        --- Jonathan Winters in "The Twilight Zone"
-- 
Wayne Throop      <the-known-world>!mcnc!rti!xyzzy!throopw

Paul_L_Schauble@cup.portal.com (08/17/88)

> From: cik@l.cc.purdue.edu (Herman Rubin)
>> ark@alice.UUCP writes:
>> Wouldn't you rather risk having your program get the right answer
>> slowly rather than risk getting the wrong answer outright?

> [...] I find the attitude of
> "ark" execrable; this is what has produced our current bad software
> and also bad hardware.

Gee. If you don't mind getting the wrong answer, I can make ANY program
run a lot faster.

        RESULT = 0.
        RETURN
        END
 
  --PLS

jfh@rpp386.UUCP (The Beach Bum) (08/17/88)

In article <576@proxftl.UUCP> bill@proxftl.UUCP (T. William Wells) writes:
>What this means is that if the product of the values of the two
>shorts will fit in an int, the code will work as expected.  If
>not, then the result IS undefined.

since this discussion orginally involved 16x16=32 multiply instructions,
i thought it might be nice to point out that any two 16 bit numbers can
be multiplied, and the result will always fit in 32 bits, without 
overflow.

unsigned:	FFFF x FFFF = FFFE0001
signed:		7FFF x 7FFF = 3FFF0001
-- 
John F. Haugh II                 +--------- Cute Chocolate Quote ---------
HASA, "S" Division               | "USENET should not be confused with
UUCP:   killer!rpp386!jfh        |  something that matters, like CHOCOLATE"
DOMAIN: jfh@rpp386.uucp          |         -- apologizes to Dennis O'Connor

jlg@beta.lanl.gov (Jim Giles) (08/20/88)

In article <1810@vaxb.calgary.UUCP>, radford@calgary.UUCP (Radford Neal) writes:
>        int a, b, c, d;
> 
>        a = (a*b*c*d) % 10;
> 
> By your logic, the intermediate result (a*b*c*d) should be computed to
> 128-bit precision, assuming int's are 32 bits.

That's not necessarily true.  Your particular example does not require
any precision above 32 bits.  The exact answer for the given statement
is the same as for:

	int a, b, c, d;

	a=((a%10)*(b%10)*(c%10)*(d%10))%10

or (with 64-bit intermediates):

	a=(((a*b)%10) * ((c*d)%10)) %10

Of course, neither of these really works in C because parenthesis
don't force execution order.  But the compiler is free to do the
computation in either way (or any of several other ways).  The same
is true if some of the multiplies are replaced by adds or subtracts.

I tend to agree with those who claim a 'correct' answer is better
than a fast answer.  The problem is that C doesn't specify what the
'correct' answer is.  If C explicitly said that all integer arithmetic
should be carried out modulo 2^(wordsize), then (a*b*c*d)%10 should be
done the obvious fast way - people who naively assume that the program
should give the mathematically 'correct' answer would deserve what they
got.  On the other hand, if C explicitly said that all integer arithemtic
was mathematically exact (as long as the answer didn't overflow), then
one of the more careful methods of evaluating the result should be used.

This whole problem arises because C (neither K&R nor the ANSI standard)
doesn't say how the arithmetic should be done.

J. Giles