[comp.arch] bizarre instructions

JBS@IBM.COM (02/21/91)

          Herman Rubin says:
Do you mean to tell me that computations involving both integers and
floating-point numbers are not important?  Or that dividing floats by
floats, obtaining an integer quotient and a floating remainder, likewise?
That particular step is the first step of any trigonometric or exponential
function computation when it is not known in advance that the argument is
small.  There are other periodic and related functions for which this is
useful.  It would also speed up interpolation, etc.

          Since argument reduction for the trigonometric functions is
done in practice by multiplying by 1/pi I do not see how it would ben-
efit from your proposed instruction.  I believe this instruction would
have little general utility.
          An efficient way to do the fortran "x=dint(y)" is important.
This is lacking on the Risc System 6000 as was pointed out in a Los
Alamos report.  By the way is it true that it is impossible to
reasonably express this in ADA?
        Peter Montgomery says:
        Yes, most programs are written in these languages.
As Dan says, the language designers made mistakes.  During one
review period for Fortran 90, I requested an operation which
takes four nonnegative integers a, b, c, n with
a < n or b < n (c is optional and defaults to 0).
The requested operation returns q and/or r, where

                a*b + c = q*n + r   and   0 <= r < n

        Why didn't you ask for an integer*8 data type?  Wouldn't
that give you what you want and have much more general utility?
        I would like an instruction which counts the number of ones in
the binary representation of an integer but I find it hard to argue
that this would be widely used.
                    James B. Shearer

dik@cwi.nl (Dik T. Winter) (02/21/91)

In article <9102210042.AA12291@ucbvax.Berkeley.EDU> JBS@IBM.COM writes:
 >>           Herman Rubin says:
 >>                                            Or that dividing floats by
 >> floats, obtaining an integer quotient and a floating remainder, likewise?
 >> That particular step is the first step of any trigonometric or exponential
 >> function computation when it is not known in advance that the argument is
 >> small.
 >           Since argument reduction for the trigonometric functions is
 > done in practice by multiplying by 1/pi I do not see how it would ben-
 > efit from your proposed instruction.
You must have a very limited experience.  You do it only by multiplying by
1/pi if you are willing to forego a lot of precision.  (Check your
favorite hand-held calculator.  If sin(pi) = 0, the implementation is
wrong!)
 >>         Peter Montgomery says:
 >>                               I requested an operation which
 >> takes four nonnegative integers a, b, c, n with
 >> a < n or b < n (c is optional and defaults to 0).
 >> The requested operation returns q and/or r, where
 >>                 a*b + c = q*n + r   and   0 <= r < n
 >         Why didn't you ask for an integer*8 data type?  Wouldn't
 > that give you what you want and have much more general utility?
No.  That would require that the hardware has some support for 64 bit
ints.  What Peter Montgomery asks is: given three ints (a, b, c),
multiply a and b together to a double size int; add c to it; calculate
quotient and remainder when dividing by n.  Most 32 bit processors provide
the instructions to do just that.  Asking for an integer*8 data type would
be asking for a 64*64 multiply which would than be used, and would be much
slower.
 >         I would like an instruction which counts the number of ones in
 > the binary representation of an integer but I find it hard to argue
 > that this would be widely used.
Ask Seymour Cray why that instruction was put in the Cray-1 as an
afterthought.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

hrubin@pop.stat.purdue.edu (Herman Rubin) (02/21/91)

In article <9102210042.AA12291@ucbvax.Berkeley.EDU>, JBS@IBM.COM writes:
> 
>           Herman Rubin says:
> Do you mean to tell me that computations involving both integers and
> floating-point numbers are not important?  Or that dividing floats by
> floats, obtaining an integer quotient and a floating remainder, likewise?
> That particular step is the first step of any trigonometric or exponential
> function computation when it is not known in advance that the argument is
> small.  There are other periodic and related functions for which this is
> useful.  It would also speed up interpolation, etc.
> 
>           Since argument reduction for the trigonometric functions is
> done in practice by multiplying by 1/pi I do not see how it would ben-
> efit from your proposed instruction.  I believe this instruction would
> have little general utility.

There is somewhat greater loss of accuracy in this, and it is still 
needed to extract the integer part to an integer register and the
fractional part.  Thus, it needs at least three operations, instead
of one.  Also, if one is calculating something like x - ln(1+x), a
natural operation in certain problems, the computing problems become
a little larger than one would expect.  In fact, to avoid a loss of
accuracy in some quite usual situations, it would even be a good idea
to have Boolean operations on floats, and unnormalized floating
arithmetic.  Floating arithmetic, apart from some preliminaries,
and normalization problems, is exactly the same as integer, so why
should there be a separate arithmetic unit even?

>           An efficient way to do the fortran "x=dint(y)" is important.
> This is lacking on the Risc System 6000 as was pointed out in a Los
> Alamos report.  By the way is it true that it is impossible to
> reasonably express this in ADA?
>         Peter Montgomery says:
>         Yes, most programs are written in these languages.
> As Dan says, the language designers made mistakes.  During one
> review period for Fortran 90, I requested an operation which
> takes four nonnegative integers a, b, c, n with
> a < n or b < n (c is optional and defaults to 0).
> The requested operation returns q and/or r, where
> 
>                 a*b + c = q*n + r   and   0 <= r < n
> 
>         Why didn't you ask for an integer*8 data type?  Wouldn't
> that give you what you want and have much more general utility?

Yes and no.  It would have much more general utility, but it would
do an abysmally inefficient job in this situation.  You would need
to have a way of indicating that the product of two integers*4 is
an integer*8, which I do not know how to do in any language with 
which I am familiar without writing it as a function, and I do not
think that one should have to write mult48(a,b) instead of a*b.  In
addition, how would you write the operation which, when applied to
a*b+c and n, yields both q and r?  Is the reason why it is difficult
to get double length multiplication, and division with quotient and
remainder, that the language designers left these out, and then the

As far as adding this type, why not allow the user to add any type?  This
was present for up to three additional types on Fortran VI(?) on the CDC3600.

>         I would like an instruction which counts the number of ones in
> the binary representation of an integer but I find it hard to argue
> that this would be widely used.

Cray seems to find this one important enough to include on his machines,
which have a real paucity of instructions.

I think you will find that those who want to add "bizarre" instructions in
both hardware and software to do what the present stuff does not do well
understand the problems of both, and have some idea of what is feasible 
at reasonable cost.  
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

khb@chiba.Eng.Sun.COM (Keith Bierman fpgroup) (02/22/91)

In article <2998@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:

   favorite hand-held calculator.  If sin(pi) = 0, the implementation is
   wrong!)

No. Since that is what people expect, many library writers go to
special pains to make sure that this case works out "right".
Historically, Suns (for example) provided user selectable pi precision
(see the Numerical Computation Guide for details) .... this hasn't
proved to be very useful because few know or seem to care; but has
caused a bit of grief once or twice. 

--
----------------------------------------------------------------
Keith H. Bierman    kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33			 | (415 336 2648)   
    Mountain View, CA 94043

new@ee.udel.edu (Darren New) (02/22/91)

How about this?  A language in which it is possible to write functions
in assembler and have them inlined automatically, and to have the
compiler smart enough to do dataflow analysis on the resultant code and
such.  Possibly some syntax close to what PCC uses could be used inside
such functions.  I would imagine that GCC is close to capable of doing
this if it isn't already. The minor problem of efficient multi-variable
returns might need to be worked out.

Then, the only stumbling block would be the flexible syntax for
procedure calls, which could be addressed with a preprocessor rather
than an entire language.

Am I missing something here, or is this problem really as easy to
solve as it seems?             -- Darren

-- 
--- Darren New --- Grad Student --- CIS --- Univ. of Delaware ---
----- Network Protocols, Graphics, Programming Languages, 
      Formal Description Techniques (esp. Estelle), Coffee, Amigas -----
              =+=+=+ Let GROPE be an N-tuple where ... +=+=+=

JBS@IBM.COM (02/22/91)

          Regarding my comment:
          Since argument reduction for the trigonometric functions is
done in practice by multiplying by 1/pi I do not see how it would ben-
efit from your proposed instruction.
          Dik Winter says:
You must have a very limited experience.  You do it only by multiplying by
1/pi if you are willing to forego a lot of precision.  (Check your
favorite hand-held calculator.  If sin(pi) = 0, the implementation is
wrong!)

          I don't understand this comment.  This problem can occur
with the divide with remainder instruction as well.  The problem is
that approximating x mod pi by x mod y where y is the machine rep-
resentation of pi may lose all accuracy.
          Herman Rubin says:
There is somewhat greater loss of accuracy in this, and it is still
needed to extract the integer part to an integer register and the
fractional part.  Thus, it needs at least three operations, instead
of one.  Also, if one is calculating something like x - ln(1+x), a
natural operation in certain problems, the computing problems become
a little larger than one would expect.  In fact, to avoid a loss of
accuracy in some quite usual situations, it would even be a good idea
to have Boolean operations on floats, and unnormalized floating
arithmetic.  Floating arithmetic, apart from some preliminaries,
and normalization problems, is exactly the same as integer, so why
should there be a separate arithmetic unit even?

         I don't understand the comments about loss of accuracy.
As noted above some or all of the bits of the remainder will be
bogus because pi is not a machine number.  Accurate argument re-
duction requires first finding the integer part, n, of the quotient
then computing x-n*pi carefully using pi to more than machine pre-
cision.  It is more convenient to do this if n is in floating for-
mat.  Counting operations is a poor test of speed when floating
divide typically takes 5 times or more as long as floating mult-
iply.  I don't understand the reference to x-ln(1+x).  This will
always lose accuracy if evaluated in this form for x near 0.  It
is usually possible to perform boolean operations on floats al-
though it may be a little awkward.  I presume the reason for a
separate floating point unit is that this leads to faster machines.
         Regarding my suggestion for an integer*8 type Herman Rubin
comments:
Yes and no.  It would have much more general utility, but it would
do an abysmally inefficient job in this situation.  You would need
to have a way of indicating that the product of two integers*4 is
an integer*8, which I do not know how to do in any language with
which I am familiar without writing it as a function, and I do not
think that one should have to write mult48(a,b) instead of a*b.  In
addition, how would you write the operation which, when applied to
a*b+c and n, yields both q and r?

         Regarding a function to get a the 8 byte product of 2
4 byte integers I don't see why this is any worse than Montgomery's
function.  In any case it is not needed.  Let i4,j4 be 4 byte
i8,j8,k8 be 8 byte.  Then write
         i8=i4
         j8=j4
         k8=i8*j8
A sufficiently intelligent compiler will do the right thing.  It may
work to write
         k8=i4*j4
This sometimes works in the analogous case where k8 is 4 byte and i4
and j4 are 2 bytes.  However I am not sure strictly speaking that it
should.  What does the Fortran standard say?
         As for getting both q and r this is easy just write
         iq=ia/ib
         ir=mod(ia,ib)
A reasonable compiler will only generate 1 divide with remainder.
I will confess however that while this works when everything is
4 bytes I don't quite see how to make it work when ia is 8 bytes
(since it seems unreasonable to define the quotient of a 8 byte
integer by a 4 byte integer to be 4 bytes and if it is defined
to be 8 bytes it is then unsafe to use the usual 8 byte by 4 byte
giving 4 byte quotient instruction).
                    James B. Shearer

ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/22/91)

new@ee.udel.edu (Darren New) wrote:
>How about this?  A language in which it is possible to write functions
>in assembler and have them inlined automatically, and to have the
>compiler smart enough to do dataflow analysis on the resultant code and
>such.  Possibly some syntax close to what PCC uses could be used inside
>such functions.  I would imagine that GCC is close to capable of doing
>this if it isn't already. The minor problem of efficient multi-variable
>returns might need to be worked out.

GCC is already.  I saw a 68881 floating-point include file for gcc that
was nothing more than

static inline const double sin(register double x)
{
	register double ans;
	asm("fsin %1,%0" : "=f" (ans) : "f" (x));
	return ans;
}

The syntax is a bit cryptic to express everything you can tell the
optimizer about, but basically, the first string is a pattern, and the
colon-separated parts are output and input operands (you can add
another section for clobbered registers).  "=f" means it needs an "f"
type operand (which the machine description uses for floating point
args) and it's assigned to.  "f" means it's not assigned to.  Gcc will
merge your code in with the caller, hoist sin() out of loops it's invariant
in (const in the function type says you can do that with this function,
although once it's inlined, gcc can see that for itself), and generally
behave as if sin() was an intrinsic rather than a library call.

You can also express the idea that a given input is not read until after
the outputs are written, that an input and output have to be in the same
place, and other things.

>Then, the only stumbling block would be the flexible syntax for
>procedure calls, which could be addressed with a preprocessor rather
>than an entire language.
>
>Am I missing something here, or is this problem really as easy to
>solve as it seems?             -- Darren

Well, the syntax is clumsy for really large blocks, but it works well for
dropping, for eaxmple, an atomic test-and-set or compare-and-swap into
some C code.

I believe the reason it isn't more commonly done is that few compilers
have the strong separation gcc does between the compiler and the machine
description.  So you can't easily add a description of an instruction at
run time and have the optimizer work with it smoothly.
-- 
	-Colin

khb@chiba.Eng.Sun.COM (Keith Bierman fpgroup) (02/23/91)

In article <15485@lanl.gov> jlg@lanl.gov (Jim Giles) writes:

   approximation you use).  Any user who needs trig functions should be
   _expected_ to know that PI is not a rational number and that floating

Tell that to a spreadsheet user. It is impossible to legislate what
people _will_ expect. We can pontificate on what they should, but
unless we are working in the education field we can't expect it work.
--
----------------------------------------------------------------
Keith H. Bierman    kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33			 | (415 336 2648)   
    Mountain View, CA 94043

herrickd@iccgcc.decnet.ab.com (daniel lance herrick) (02/23/91)

In article <2998@charon.cwi.nl>, dik@cwi.nl (Dik T. Winter) writes:
>  >         I would like an instruction which counts the number of ones in
>  > the binary representation of an integer but I find it hard to argue
>  > that this would be widely used.
> Ask Seymour Cray why that instruction was put in the Cray-1 as an
> afterthought.

I havn't seen Seymour posting here, and I don't often meet him.
Now that I am suitably impressed, would you share with us what
he would tell us if we did ask him?
> --
> dik t. winter, cwi, amsterdam, nederland
> dik@cwi.nl

dan herrick
herrickd@iccgcc.decnet.ab.com

ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/23/91)

new@ee.udel.edu (Darren New) wrote:
>How about this?  A language in which it is possible to write functions
>in assembler and have them inlined automatically, and to have the
>compiler smart enough to do dataflow analysis on the resultant code and
>such.  Possibly some syntax close to what PCC uses could be used inside
>such functions.  I would imagine that GCC is close to capable of doing
>this if it isn't already. The minor problem of efficient multi-variable
>returns might need to be worked out.

Here's a (sort of) working example using gcc 1.39 on a uVAX.  "g" is
the "general" operand class that covers all the VAX addressing modes.

int main()
{
	int i, j, k, l;
	i = 12;
	for (k =0; k < 10; k++) {
		asm("foo %1,%0": "=g" (j) : "g" (i));
		asm("bar %1,%0": "=g" (l) : "g" (i));
	}
	printf("%d\n",j);
	return 0;
}

Note that foo is loop-invariant, while bar is dead.  gcc -O -fstrength-reduce
produces:

#NO_APP
gcc_compiled.:
.text
LC0:
	.ascii "%d\12\0"
	.align 1
.globl _main
_main:
	.word 0x0
#APP
	foo $12,r1
#NO_APP
	movl $9,r0
L4:
	decl r0
	jgeq L4
	pushl r1
	pushab LC0
	calls $2,_printf
	clrl r0
	ret

I'm slightly annoyed by gcc's failure to delete the empty loop.
Testing reveals it fails to delete even trivially empty loops.  At
least it's not a commo case.  Interestingly, if we make bar depend on
the loop index k, it's still dead code, and still deleted, but it
inhibits the loop optimisation to count-down form.  I guess it's the
order in which optimisations are applied.  The code is deleted after
the loop is generated in the count-up form:

int main()
{
	int i, j, k, l;
	i = 12;
	for (k =0; k < 10; k++) {
		asm("foo %1,%0": "=g" (j) : "g" (i));
		asm("bar %1,%0": "=g" (l) : "g" (k));
	}
	printf("%d\n",j);
	return 0;
}

#NO_APP
gcc_compiled.:
.text
LC0:
	.ascii "%d\12\0"
	.align 1
.globl _main
_main:
	.word 0x0
	clrl r0
#APP
	foo $12,r1
#NO_APP
L4:
	incl r0
	cmpl r0,$9
	jleq L4
	pushl r1
	pushab LC0
	calls $2,_printf
	clrl r0
	ret

Oh, well, 2.0 is coming, and there has to be something to fix...
-- 
	-Colin

sef@kithrup.COM (Sean Eric Fagan) (02/23/91)

In article <3386.27c55e18@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes:
>I havn't seen Seymour posting here, and I don't often meet him.
>Now that I am suitably impressed, would you share with us what
>he would tell us if we did ask him?

Pop count.  Count the set bits in a word.  CXi on the CDC Cyber series.

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

glew@pdx007.intel.com (Andy Glew) (02/23/91)

> >  >         I would like an instruction which counts the number of ones in
> >  > the binary representation of an integer but I find it hard to argue
> >  > that this would be widely used.
> > Ask Seymour Cray why that instruction was put in the Cray-1 as an
> > afterthought.
> 
> I havn't seen Seymour posting here, and I don't often meet him.
> Now that I am suitably impressed, would you share with us what
> he would tell us if we did ask him?

I don't know about Cray, but I'm told that POP count was put in the
Gould machines (which tended to be used for things like telemetry,
i.e. listening to and decoding Russian phone calls and missile
launches) was put in for the security boys.

--
Andy Glew, glew@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
Hillsboro, Oregon 97124-6497

vhs@britesun.radig.de (Volker Herminghaus-Shirai) (02/24/91)

JBS@IBM.COM writes:
>        I would like an instruction which counts the number of ones in
>the binary representation of an integer but I find it hard to argue
>that this would be widely used.

It does seem to be widely used indeed; as far as I know the inmos transputer
T800 has this instruction, along with other instructions like inverting
the sequence of the bits in an int etc. I think those are used for
CRC calculation and picture processing, though I'm not sure about that.

-- 
Volker Herminghaus-Shirai (vhs@britesun.radig.de)
panic: 80x86! Trying to vomit...

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (02/24/91)

In article <6503@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu 
	(Herman Rubin) writes:
>I think you will find that those who want to add "bizarre" instructions in
>both hardware and software to do what the present stuff does not do well
>understand the problems of both, and have some idea of what is feasible 
>at reasonable cost.  

In fact, the old literature (and old machines!) are full of "bizarre"
instructions that proved ill-advised. A few examples:

- "uninterruptible" instructions which complicated the memory
  interface to the point of slowing down normal accesses.

- addressing modes which studies could not find a single use of.
  (PDP-11 autodecrement deferred - omitted from the VAX.)

- my personal favorite, the Star-100 instruction which took
  Fortran source (a vector of characters), and returned a vector
  with: comments and sequence numbers stripped; continuation lines
  catenated; and non-significant blanks elided. This was done
  because of the then-famous observation that the XPL compiler
  spent more time in "get next character" than in any other
  routine. The drawbacks, however, are serious. How to add new
  compiler directives? How to map line numbers back to the
  original line numbers, so as to have decent error messages? And
  most important, why bother, since modern compilers now spend
  their time elsewhere.

- The Symbol machine, which put the entire compiler in hardware.
  Do not be fooled by all the laudatory retrospective articles.
  They were written by the machine's designers, and the machine's
  gross failure is underdescribed in the literature.

I collect these little bird droppings of history. Any other favorites
(that hold lessons)?
-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

pmk@craycos.com (Peter Klausler) (02/24/91)

In article <2998@charon.cwi.nl>, dik@cwi.nl (Dik T. Winter) writes:
>  >         I would like an instruction which counts the number of ones in
>  > the binary representation of an integer but I find it hard to argue
>  > that this would be widely used.
> Ask Seymour Cray why that instruction was put in the Cray-1 as an
> afterthought.

Almost right. A scalar population count instruction was not new to the CRAY-1,
since the CDC 6600 and descendents all supported it ("CXi Xk", opcode 47).
What was missing on some CRAY-1As and CRAY-1Bs was a vector population count
instruction. The S and M models do support that.

The CRAY-1 (and X-MP/Y-MP followons) *is* missing a vector leading zero count
instruction, though it's there in scalar form. Need a CRAY-2 for the vector
version.

Both operations are quite useful, as has been noted in comp.arch often in the
past.

hrubin@pop.stat.purdue.edu (Herman Rubin) (02/25/91)

In article <9102220245.AA14853@ucbvax.Berkeley.EDU>, JBS@IBM.COM writes:

		        ................

>          Regarding a function to get a the 8 byte product of 2
> 4 byte integers I don't see why this is any worse than Montgomery's
> function.  In any case it is not needed.  Let i4,j4 be 4 byte
> i8,j8,k8 be 8 byte.  Then write
>          i8=i4
>          j8=j4
>          k8=i8*j8
> A sufficiently intelligent compiler will do the right thing.  It may
> work to write
>          k8=i4*j4
> This sometimes works in the analogous case where k8 is 4 byte and i4
> and j4 are 2 bytes.  However I am not sure strictly speaking that it
> should.  What does the Fortran standard say?

The first approach would require a very intelligent compiler, indeed.
The second approach is the "right" one.  However, every standard I have
read would try the first.  Since i8*j8 might not even exist on the machine,
at best an inlined function would have to be brought in instead of the
single hardware instruction for k8=i4*j4.  

I have never seen any language description in which the type of the result
in a replacement statement affected what operation is performed.  The closest
I know is the C format, which may or may not work,

		k8 = (*int8)i4*j4.

>          As for getting both q and r this is easy just write
>          iq=ia/ib
>          ir=mod(ia,ib)
> A reasonable compiler will only generate 1 divide with remainder.
> I will confess however that while this works when everything is
> 4 bytes I don't quite see how to make it work when ia is 8 bytes
> (since it seems unreasonable to define the quotient of a 8 byte
> integer by a 4 byte integer to be 4 bytes and if it is defined
> to be 8 bytes it is then unsafe to use the usual 8 byte by 4 byte
> giving 4 byte quotient instruction).
>                     James B. Shearer

I do not see why the case of ia being 8 bits is any worse.  Sure, there
is the problem of overflow, but so what?  Also, it should not be ever
necessary to do something as complicated as the mod(ia,ib) notation for
basic operators, or even not so basic, and more than it should be
necessary to write add(a,b) for a+b.  But what is really needed in
notation is something like

		iq,ir = ia/ib.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

peter@ficc.ferranti.com (Peter da Silva) (02/25/91)

In article <12071@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
> - addressing modes which studies could not find a single use of.
>   (PDP-11 autodecrement deferred - omitted from the VAX.)

Oh, that's not nearly as exciting as autodecrementing the PC. (immediate
and immediate indirect mode were implemented as autoincrement and
autoincrement deferred on the PC)

	TST	-(PC)

(on the other hand, I would expect that autodecrement deferred would have
 cost some to leave out, given the PDP-11 instruction coding).
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

augustss@cs.chalmers.se (Lennart Augustsson) (02/26/91)

In article <1991Feb25.134714.23523@linus.mitre.org> bs@gauss.mitre.org (Robert D. Silverman) writes:
>In article <9102220245.AA14853@ucbvax.Berkeley.EDU> JBS@IBM.COM writes:
>>
> 
>stuff deleted
>
>>         As for getting both q and r this is easy just write
>>         iq=ia/ib
>>         ir=mod(ia,ib)
>>A reasonable compiler will only generate 1 divide with remainder.
> 
>I have done numerical analysis work and multi-precision arithmetic on many
>different machines and operating systems.
>
>Only ONCE have I ever seen a compiler bright enough to do the above 
>optimization. (Burroughs 6700; a stack machine; Algol 68)
> 
I've seen another one!  Well, it's not a compiler it's the MIPS assembler
which recognises what you are doing.  I think this was very nice,
since that meant that I didn't have to build this optimisation into
my compiler, I got it for free from the assembler.

	-- Lennart Augustsson
[This signature is intentionally left blank.]

torek@elf.ee.lbl.gov (Chris Torek) (02/26/91)

In article <1991Feb25.134714.23523@linus.mitre.org> bs@gauss.mitre.org
(Robert D. Silverman) writes:
>In article <9102220245.AA14853@ucbvax.Berkeley.EDU> JBS@IBM.COM writes:
>... what one usually wants is (A*B + C)/D and (A*B + C) mod D.
>Even on machines that support double length integer multiplies, one
>cannot put the above operations into HLL because the compiler will not
>generate the double length multiply (say 32 x 32 --> 64) nor will it
>then do the (64 /32 --> 32 bit quotient & remainder). Since A*B can overflow
>32 bits one is FORCED to call assembler routines to do this.

Ah yes.  Clearly the following does not work....

	/*
	 * return quotient and remainder from (a*b + c) divrem d
	 */
	#if 0
	/*
	 * This is the way we would like to do it, but gcc emits one extra
	 * instruction, as it is not smart enough to completly eliminate the
	 * addressing on r (it uses a register for r, rather than a pointer,
	 * but never quite goes all the way).
	 */
	static __inline int divrem(int a, int b, int c, int d, int *r) {
		int q;
		double tmp;	/* force reg pair allocation */

		asm("emul %1,%2,%3,%0" : "=g"(tmp) : "g"(a), "g"(b), "g"(c));
		asm("ediv %3,%2,%0,%1" : "=g"(q), "=g"(*r) : "r"(tmp), "g"(d));

		return q;
	}
	#else
	/*
	 * So instead we will use rather peculiar gcc syntax.
	 * Note that the macro uses a, b, c, d, q, and r exactly once each,
	 * and thus side effects (*p++, etc.) are safe.
	 */
	#define divrem(q, r, a, b, c, d) ({ \
		double divrem_tmp; \
		asm("emul %1,%2,%3,%0" : "=g"(divrem_tmp) : \
		    "g"(a), "g"(b), "g"(c)); \
		asm("ediv %3,%2,%0,%1" : "=g"(q), "=g"(r) : \
		    "r"(divrem_tmp), "g"(d)); \
	})
	#endif

	int a[100], b[100], c[100], d[100];
	int q[100], r[100];

	void doit(int n) {
		int i;

		for (i = 0; i < n; i++) {
	#if 0
			q[i] = divrem(a[i], b[i], c[i], d[i], &r[i]);
	#else
			divrem(q[i], r[i], a[i], b[i], c[i], d[i]);
	#endif
		}
	}

But wait!  Maybe, just *maybe*, we should try it out before dismissing it.

Well goll-ee, it seems to work!

When compiled on a Tahoe (the Tahoe is a `RISC'---a `Reused Instruction
Set Computer'; its emul and ediv are just like those on the VAX) with
`gcc -O -S' this compiles to (compiler comments and other drek stripped):

	_doit:
		.word 0x3c0
		movl 4(fp),r4
		clrl r2
		cmpl r2,r4
		jgeq L13
		movab _a,r9
		movab _b,r8
		movab _c,r7
		movab _q,r6
		movab _r,r5
		movab _d,r3
	L12:
		emul (r9)[r2],(r8)[r2],(r7)[r2],r0
		ediv (r3)[r2],r0,(r6)[r2],(r5)[r2]
		incl r2
		cmpl r2,r4
		jlss L12
	L13:
		ret

(Note that the Tahoe does not have auto-increment addressing modes,
and this is in fact the best that can be done.)

On the VAX the loop changes to (gcc 1.37.1, -fstrength-reduce -mgnu):

	L12:
		emul (r2)+,(r3)+,(r4)+,r0	# the registers
		ediv (r5),r0,r1,r0		# are allocated in
		movl r1,(r6)+			# a different order.
		movl r0,(r7)+
		addl2 $4,r5
		jaoblss r9,r8,L12

Apparently the machine-dependent part has not been taught to combine
`ediv' properly; it should be:

	L12:
		emul (r2)+,(r3)+,(r4)+,r0
		ediv (r5)+,r0,(r6)+,(r7)+
		jaoblss r9,r8,L12

A bit of work on vax.md should fix it.

This has its drawbacks: the syntax is distinctly un-pretty, and it
requires gcc, and it is machine-dependent.  It does, however, work.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

bs@faron.mitre.org (Robert D. Silverman) (02/26/91)

In article <10244@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
:In article <1991Feb25.134714.23523@linus.mitre.org> bs@gauss.mitre.org
:(Robert D. Silverman) writes:
:>In article <9102220245.AA14853@ucbvax.Berkeley.EDU> JBS@IBM.COM writes:
:>... what one usually wants is (A*B + C)/D and (A*B + C) mod D.
:>Even on machines that support double length integer multiplies, one
:>cannot put the above operations into HLL because the compiler will not
:>generate the double length multiply (say 32 x 32 --> 64) nor will it
:>then do the (64 /32 --> 32 bit quotient & remainder). Since A*B can overflow
:>32 bits one is FORCED to call assembler routines to do this.
:
:Ah yes.  Clearly the following does not work....
:
 
stuff deleted...

:This has its drawbacks: the syntax is distinctly un-pretty, and it
:requires gcc, and it is machine-dependent.  It does, however, work.
 
Ah yes. A universal solution. Machine dependent. Ugly syntax. Only
works with one compiler.
 
Furthermore, it does nothing to contradict the assertions I made above
that the arithmetic could not be done in the HLL and that assembler
routines had to be called. It does inline the subroutine, however.

--
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

terry@venus.sunquest.com (Terry R. Friedrichsen) (02/26/91)

Donald Lindsay writes:

>In fact, the old literature (and old machines!) are full of "bizarre"
>instructions that proved ill-advised. A few examples:
>
>  [ some good examples removed]
>
>- addressing modes which studies could not find a single use of.
>  (PDP-11 autodecrement deferred - omitted from the VAX.)

This example differs somewhat from his others, which were basically
deliberate design decisions that are not now generally seen as the
wise way to go.

But the autodecrement deferred addressing mode arose more out of
symmetry (== simple instruction decoding) than somebody saying "gee,
I'll bet that would be a useful addressing mode").  The pdp-ll had
autoincrement/autodecrement, and it had deferred/non-deferred.  The
"ill-advised" combination of autodecrement deferred was just a result
of symmetrical instruction decode logic.

DEC's pdp-10 instruction set had the same oddities.  The machine had
a plethora of no-op equivalent instructions (some of which referenced
memory and some of which did not), not because the designers thought
it would be advisable to have lots of no-ops, but because the very
simple, highly gate-efficient instruction decode logic took advantage
of instruction-set symmetries.

(I think I have notes around here that say the instruction decode for
the original instruction set was done in something like 700 gates - but
don't quote this; I may have the number wrong ...)

The whole point is just that unused instructions or opcodes don't
necessarily equate to stupidity or lack of foresight on the part of
the designers.  Sometimes they're really a result of cleverness :-).

Terry R. Friedrichsen

terry@venus.sunquest.com  (Internet)
uunet!sunquest!terry	  (Usenet)
terry@sds.sdsc.edu        (alternate address; I live in Tucson)

Quote:  "Do, or do not.  There is no 'try'." - Yoda, The Empire Strikes Back

hassey@matrix.rtp.dg.com (John Hassey) (02/26/91)

In article <1991Feb25.134714.23523@linus.mitre.org> bs@gauss.mitre.org (Robert D. Silverman) writes:
>In article <9102220245.AA14853@ucbvax.Berkeley.EDU> JBS@IBM.COM writes:
>>
> 
>stuff deleted
>
>>         As for getting both q and r this is easy just write
>>         iq=ia/ib
>>         ir=mod(ia,ib)
>>A reasonable compiler will only generate 1 divide with remainder.
> 
>I have done numerical analysis work and multi-precision arithmetic on many
>different machines and operating systems.
>
>Only ONCE have I ever seen a compiler bright enough to do the above 
>optimization. (Burroughs 6700; a stack machine; Algol 68)
> 
>Furthermore, what one usually wants is (A*B + C)/D and (A*B + C) mod D.
>(see Peter Montgomery's requests)
>Even on machines that support double length integer multiplies, one
>cannot put the above operations into HLL because the compiler will not
>generate the double length multiply (say 32 x 32 --> 64) nor will it
>then do the (64 /32 --> 32 bit quotient & remainder). Since A*B can overflow
>32 bits one is FORCED to call assembler routines to do this. Calling an
>assembler routine for simple arithmetic is EXPENSIVE because of subroutine
>call overhead.
>
>

   I disagree !

   Below is the output produced by gcc on a 386.  It looks like gcc is
   able to handle both of these cases.

   It does cheat some, in the multiply it only uses the lower 32 bits of the
   result (even though the machine does 64).  This would be a fairly simple
   problem to fix.

===== sample 1 =====
int i,j,k,l;

main()
{
    i = k/l;
    j = k%l;
}
	.file	"showem.c"
gcc_compiled.:
.text
	.align 4
.globl main
main:
	pushl %ebp
	movl %esp,%ebp
	subl $4,%esp
	movl k,%eax
	cltd
	movl %eax,k
	idivl l
	movl %eax,i
	movl %edx,j
	leave
	ret
.comm l,4
.comm k,4
.comm j,4
.comm i,4

==== sample 2 ====
int a,b,c,d,j,k;

main()
{
  j = (a*b+c)/d;
  k = (a*b+c)%d;
}

	.file	"showem2.c"
gcc_compiled.:
.text
	.align 4
.globl main
main:
	pushl %ebp
	movl %esp,%ebp
	movl a,%eax
	imull b,%eax
	addl c,%eax
	cltd
	idivl d
	movl %eax,j
	movl %edx,k
	leave
	ret
.comm k,4
.comm j,4
.comm d,4
.comm c,4
.comm b,4
.comm a,4

======

torek@elf.ee.lbl.gov (Chris Torek) (02/26/91)

[Bob Silverman asks for a way to compute (A * B + C) divrem D and
claims that `Even on machines that support double length integer
multiplies, one cannot put the above operations into HLL' (because
the compiler cannot use 32x32=>64 bit operations) and that `one is
FORCED to call assembler routines to do this'.]

In article <10244@dog.ee.lbl.gov> I demonstrate that the last claim
is false.

>>This has its drawbacks: the syntax is distinctly un-pretty, and it
>>requires gcc, and it is machine-dependent.  It does, however, work.

In article <1991Feb25.203629.5059@linus.mitre.org> bs@faron.mitre.org
(Robert D. Silverman) writes:
>Ah yes. A universal solution. Machine dependent. Ugly syntax. Only
>works with one compiler.

Do you want a machine-independent solution for using the machine
instruction(s) that compute(s) (A * B + C) divrem D?

This is clearly impossible, since some machines have no such instructions.

Do you want a machine-independent (but language-dependent) syntax for
expressing `(A * B + C) divrem D', and one which does not require a
subroutine call per operation?

This is what I provided.  If you believe you can improve on the syntax,
go to it; knock yourself out.  (Incidentally, the only reason it `only
works with one compiler' is that only one compiler compiles the
particular language GCC implements.  GCC is not a C compiler [well, it
is if you run it with `-ansi -pedantic'].  There is, however, no
constraint that says that GCC's language may not be adopted by other
compilers.)

I dare say you do not want a language-independent syntax, since `syntax'
and `language' are tightly-interwoven concepts.

Do you want machine-independent semantics, given some syntax, for
(A * B + C) divrem D?  I believe this not to be the case (since I
believe that you would prefer 64-bit A,B,C,D with 128-bit intermediate,
which I believe you believe is not available on many machines).

>Furthermore, it does nothing to contradict the assertions I made above
>that the arithmetic could not be done in the HLL and that assembler
>routines had to be called. It does inline the subroutine, however.

I have only a vague idea what `could not be done in the HLL' means.
Did I use a construct that is not provided by the compiler?  Or is
`assembler routines had to be called' the key phrase?  Do you mean
`machine dependent instructions had to be emitted'?  Of course this is
the case.  Do you mean `the rules for emitting the machine dependent
instructions were not built into the compiler'?  This depends on what
one means by `built in'.  Is the following sin() routine `built in'
to the compiler?:

	static __inline const double sin(double x) {
		double result;
		__asm("sin %1,%0" : "=f"(result) : "f"(x));
		return result;
	}

I claim that if this appears in <math.h>, it *is* built-in to the
compiler, despite the fact that the compiler reads these rules from an
external file rather than having them embedded directly in its
executable.  I further claim that if you require an <mp.h> to exist and
to define a mul_add_div_rem operation, and make constructing a proper
<mp.h> part of `building the compiler' (just as constructing a proper
<math.h> and <string.h> and <stdlib.h> is already part of building any
hosted ANSI C compiler), that your mul_add_div_rem operation will be
`built in'.

Or maybe this is not what you wanted?  What exactly was it you *did*
want?  I know what *I* would want: I would want a syntax I could use to
write mul_add_div_rem operations such that:

 - they were not `fragile' (i.e., could be inserted anywhere);
 - they could directly use machine instructions that implement the
   operation wherever such are available;
 - they did not impose extra overhead;

and this is exactly what I constructed.  Perhaps I could have done a
better job; but perhaps I would have spent more than an hour on it....
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

dik@cwi.nl (Dik T. Winter) (02/26/91)

In article <10278@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
 > [Bob Silverman asks for a way to compute (A * B + C) divrem D and
 > claims that `Even on machines that support double length integer
 > multiplies, one cannot put the above operations into HLL' (because
 > the compiler cannot use 32x32=>64 bit operations) and that `one is
 > FORCED to call assembler routines to do this'.]
etc.  Bob Silverman complains about the solution; Chris Torek writes:

 >              I further claim that if you require an <mp.h> to exist and
 > to define a mul_add_div_rem operation, and make constructing a proper
 > <mp.h> part of `building the compiler' (just as constructing a proper
 > <math.h> and <string.h> and <stdlib.h> is already part of building any
 > hosted ANSI C compiler), that your mul_add_div_rem operation will be
 > `built in'.
And this is exactly what Bob Silverman wants (if I read his mind
correctly that is).  As it is now, every user has to write his own for
every platform he encounters (and I have written some 40 upto now,
though through the use of subroutine calls, not through inlining in gcc).
And that was also the complaint of Peter Montgomery who has proposed such
a thing for the Fortran standard, and it is not there.  The current state
of affairs is that it is not in the standard for most languages.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

JBS@IBM.COM (02/26/91)

         In discussing what a compiler should do with i8=j4*k4 where
i8 is integer*8, j4,k4 are integer*4, Herman Rubin says
I have never seen any language description in which the type of the result
in a replacement statement affected what operation is performed.

         I would not propose this.  I would propose defining the type of
an integer*4 x integer*4 multiplication to be integer*8.  Then the right
thing will be done whether i8 is 8 bytes or 4 bytes long.
         Several people have discussed various instruction sequences with
the assumption that all fixed point instructions take the same amount of
time.  While on many machines most fixed point instructions are 1 cycle,
this is generally not true of fixed point multiply and especially not
true of fixed point divide.  Thus a compiler which uses 2 divides for
         iq=ia/ib
         iq=mod(ia,ib)
may be wasting the equivalent of 20+ typical fixed point instructions.
In fact if you know the compiler will do this the sequence
         iq=ia/ib
         ir=ia-iq*ib
will be much faster on many machines.  The 11 instruction pop subroutine
used a divide with remainder.  On many machines this instruction will be
considerably slower than the other 10 combined.  If one is doing many
divides by the same integer p it will often pay to replace them with
multiplies by some "magic constant" based on the binary representation
of 1/p.
         Regarding wrongheaded instructions I believe there are numerous
examples where a complicated instruction was implemented in such a way
that a sequence of simpler instructions doing exactly the same thing
would execute faster.  Of course this may be an implementation problem
rather than an architecture problem.
                          James B. Shearer

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/26/91)

In article <10278@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
> Do you want machine-independent semantics, given some syntax, for
> (A * B + C) divrem D?

Yes. That's exactly the problem here: given an operation glorp, we need
to express glorp in a portable way with machine-independent semantics.
Worse than that: when a chip supports glorp, we should be able to write
the code so that it will be compiled the right way for that chip---but
we shouldn't have to lose portability. Worse than that: the language
designer probably never heard of glorp.

No matter how nearsighted the language designer was, the chip designer
has to be able to show his support for glorp. And no matter how
nearsighted the language designer and chip designer were, the programmer
has to be able to write glorp portably yet efficiently.

Is this too much to ask? I don't think so. See my previous article for
my thoughts on how this can and should be done.

> I further claim that if you require an <mp.h> to exist and
> to define a mul_add_div_rem operation, and make constructing a proper
> <mp.h> part of `building the compiler' (just as constructing a proper
> <math.h> and <string.h> and <stdlib.h> is already part of building any
> hosted ANSI C compiler), that your mul_add_div_rem operation will be
> `built in'.

That's exactly the wrong answer. Programmers can't wait for chip
designers, and nobody can wait for language designers. It has to be
possible to set up a good enough structure beforehand that (1) chip
designers can make innovations available without changing the language,
(2) programmers can use those innovations without changing the chips or
the language, and (3) programmers don't have to sacrifice portability
for efficiency. I believe this is possible, and requires only a small
addition to languages and a bit of discipline on the part of compiler
writers.

---Dan

joeo@sharp..westford.ccur.com (Joe Orost) (02/26/91)

In article <10278@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>
>I have only a vague idea what `could not be done in the HLL' means.
>Did I use a construct that is not provided by the compiler?  Or is
>`assembler routines had to be called' the key phrase?  Do you mean
>`machine dependent instructions had to be emitted'?  Of course this is
>the case.  Do you mean `the rules for emitting the machine dependent
>instructions were not built into the compiler'?  This depends on what
>one means by `built in'.  Is the following sin() routine `built in'
>to the compiler?:
>
>	static __inline const double sin(double x) {
>		double result;
>		__asm("sin %1,%0" : "=f"(result) : "f"(x));
>		return result;
>	}
>
>I claim that if this appears in <math.h>, it *is* built-in to the
>compiler, despite the fact that the compiler reads these rules from an
>external file rather than having them embedded directly in its
>executable.

Not so fast.  Our C3Ada-68K compiler and our Fortran VII-3200 compiler has the 
transcendentals built-in.  This means much more than just compiling the 
calls inline.  Built-in means that:

	1. Constant evaluation is performed at compile time:
	   SIN(0.0) is replaced by 0.0, etc.
	2. Common subexpressions and loop invarient code motion is
	   performed on these operations at compile time:
	   SIN(A)   ...   SIN(A)  replaced by one call using a tempo.
	3. Special-case optimizations are performed:
	   A**0.5 -> SQRT(A)
	   A**1.0 -> A
	4. The compiler knows the argument profile for the routines.
	   In Fortran, SQRT(2) gets a warning and is converted to SQRT(2.0).
	4. The calls generate inline code.

				regards,
				joe

--

 Full-Name:  Joseph M. Orost
 Email:	     joeo@tinton.ccur.com
 Phone:      (908) 758-7284         Fax: (908) 758-7113
 US Mail:    MS 322; Concurrent Computer Corporation; 106 Apple St
             Tinton Falls, NJ 07724

marc@marc.watson.ibm.com (Marc Auslander) (02/26/91)

The C compiler for the Risc System/6000 produces one divide for i/j and i%j.
--


Marc Auslander       <marc@ibm.com>

gah@mt-xia.watson.ibm.com (Gary A. Hoffman) (02/26/91)

In fact, RS/6000 xlc compiler does have builtins.
-- 
g

don@dgbt.doc.ca (Donald McLachlan) (02/26/91)

>In article <2998@charon.cwi.nl>, dik@cwi.nl (Dik T. Winter) writes:
>>  >         I would like an instruction which counts the number of ones in
>>  > the binary representation of an integer but I find it hard to argue
>>  > that this would be widely used.
>> Ask Seymour Cray why that instruction was put in the Cray-1 as an
>> afterthought.
>
>I havn't seen Seymour posting here, and I don't often meet him.
>Now that I am suitably impressed, would you share with us what
>he would tell us if we did ask him?
>> --
>> dik t. winter, cwi, amsterdam, nederland
>> dik@cwi.nl
>
>dan herrick
>herrickd@iccgcc.decnet.ab.com

	I don't know what Seymour Cray would say, but I have an application
that would really benefit from this.

	When performing digital communication over HF (highly error prone
media) major problems are encountered.

1. Most HF modems are syncronous.
2. Due to the error proneness of the media, you will rarely get a hard (exact)
   sync match.
3. Can't go to a short sync sequence because of high probability of false sync.

Solution, requires that you write a correlator to count the number of bits in
error in the sync sequence, and only accept it as a valid sync sequence if
the distance is less than a predetermined limit.

The distance is simply the number_of_ones(receieved_seq XOR sync_pattern).

I must admit that this is a very specific problem, but one that must be
performed once every bit time, and it would benefit greatly from having
such an instruction in hardware.

Donald McLachlan

ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/26/91)

joeo@tinton.ccur.com (Joe Orost) wrote:
>In article <10278@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>>	static __inline const double sin(double x) {
>>		double result;
>>		__asm("sin %1,%0" : "=f"(result) : "f"(x));
>>		return result;
>>	}
>>
>>I claim that if this appears in <math.h>, it *is* built-in to the
>>compiler, despite the fact that the compiler reads these rules from an
>>external file rather than having them embedded directly in its
>>executable.
>
>Not so fast.  Our C3Ada-68K compiler and our Fortran VII-3200 compiler has the 
>transcendentals built-in.  This means much more than just compiling the 
>calls inline.  Built-in means that:
>
>	1. Constant evaluation is performed at compile time:
>	   SIN(0.0) is replaced by 0.0, etc.

Not done by the above; you're one up.

>	2. Common subexpressions and loop invarient code motion is
>	   performed on these operations at compile time:
>	   SIN(A)   ...   SIN(A)  replaced by one call using a tempo.

Done by the above; that's what the "const" qualifier on the function does,
and after inlining, gcc can see that there are no side effects.  (I did
some experiments with a non-existent "foo" instruction... it works fine.)

>	3. Special-case optimizations are performed:
>	   A**0.5 -> SQRT(A)
>	   A**1.0 -> A

Not done by the above.  I can see how the A**0.5 is useful; I can't see how
A**1.0 is - do programmers often write this?

>	4. The compiler knows the argument profile for the routines.
>	   In Fortran, SQRT(2) gets a warning and is converted to SQRT(2.0).

Done by the above.  The prototype forces conversaion to the right type
(although silently; there may be a warning flag you can turn on).

>	5. The calls generate inline code.

Of course, this is done by the above.

So, the only advantage you have to built-in transcendentals are compile-time
evaluation of constant expressions and some special cases.

I don't deny their usefulness, but claim it's not tremendous, and the cost
of a more complex compiler is not zero.  If you support a cross-compiler,
I hope you worked hard to ensure you're doing target arithmentic and not
host arithmetic when evaluating transcendentals.  Numerically unstable
applications might die horribly otherwise.
-- 
	-Colin

peter@ficc.ferranti.com (Peter da Silva) (02/27/91)

In article <22605:Feb2608:04:5391@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> In practice this isn't such a big problem. Joe just writes the code in
> several different ways (he has to, if he wants a fighting chance) and
> uses #ifdef or something similar to select the right solution for each
> machine. But this is still a pain.

Yes. It's even more of a pain than just #ifdefing a bunch of inline
assembly. So why not just do that for all the obscure instructions
and have done with it?
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

bs@linus.mitre.org (Robert D. Silverman) (02/27/91)

In article <6870@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
:In article <10278@dog.ee.lbl.gov>, torek@elf.ee.lbl.gov (Chris Torek) writes:
:> [Bob Silverman asks for a way to compute (A * B + C) divrem D and
:> claims that `Even on machines that support double length integer
:> multiplies, one cannot put the above operations into HLL' (because
:> the compiler cannot use 32x32=>64 bit operations) and that `one is
:> FORCED to call assembler routines to do this'.]
:> 
:> In article <10244@dog.ee.lbl.gov> I demonstrate that the last claim
:> is false.
:> 
:> >>This has its drawbacks: the syntax is distinctly un-pretty, and it
:> >>requires gcc, and it is machine-dependent.  It does, however, work.
:> 
:> In article <1991Feb25.203629.5059@linus.mitre.org> bs@faron.mitre.org
:> (Robert D. Silverman) writes:
:> >Ah yes. A universal solution. Machine dependent. Ugly syntax. Only
:> >works with one compiler.
:> 
:> Do you want a machine-independent solution for using the machine
:> instruction(s) that compute(s) (A * B + C) divrem D?
:> 
:> This is clearly impossible, since some machines have no such instructions.
:
:I agree partly with both.  The LANGUAGE should probably contain some such
:instruction, and it certainly should allow the writing
:
:	E,F = (A * B + C) divrem D
:
:or even
:
:	E,F = (A * B + C) / D
:
 
Actually I would settle for the following:::

I am more than willing to write the assembler code for an operation
     ----
we will call FOO. FOO takes 3 arguments and returns 2 more.

I would then like to see the following in a HLL:

	asm_routine(FOO(a,b,c,d,e))

The compiler will then fetch my assembler routine, and RESOLVE any
register/stack/addressing conflicts that might arise because the
assembler routine uses the same registers/addresses etc. that are
produced by the calling routine.

The compiler will straighten out any conflicts [renaming registers
within the assembler routine as needed etc.] and then in-line the
assembler routine.

It is true that I must rewrite FOO for every new machine, but because
FOO is time critical it is worth my trouble to do just that.

FOO would be written as an ordinary subroutine; it would assume that
the arguments to it had been placed on the stack the same as any other
subroutine call.

There are currently machines which have no multiply or divide instruction
at all, e.g. SPARC. When the compiler sees an operation a*b it generates
a subroutine call.  Building in a 'divrem' operator into a language
could be done the same way. For machines that have the instruction there
is no problem. For those that don't, a subroutine call is generated.

The only difficulty in the above paragraph would be in deciding that
a 'divrem' operator was worth adding to the language. If enough people
can agree that it IS worthwhile, then actually doing it is a non-problem.

As for all of you who claim that 'divrem' (or 'foo' or any other special
operation) is not needed let me remind you of one thing: The current
SPARC does not have integer multiply or divide in hardware. The next
release (Version 8) will have it. If, as all of you have been claiming
that multiplication & division are so infrequent as to not be needed,
then why did SUN add it in?  Now I can't say of my own personal knowledge,
of course, but ask yourself what user or set of users would have enough
clout [i.e. a big enough market] to convince SUN to add such a thing.

Hint: think 'cryptography'

Many of the instructions under discussion are essential for good performance in
cryptographic applications.
 
As more and more PC users hook into networks, computer security applications
will become more important. As will the necessary instructions to support
those applications.

--
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

jbuck@galileo.berkeley.edu (Joe Buck) (02/27/91)

In article <6870@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
[ Chris Torek's example of how to do optimal "divrem" in gcc ]
|> On the other hand, I have to agree with Torek about the implementation on
|> a given machine.  This must depend on the hardware.  But I can hope, with
|> Silverman, that if the language allows writing things like this, hardware
|> people might put them in.
|> 
|> A question for Torek:  Suppose I want to do the above for a variety of
|> types, for example, have A, B, C, D, and F floating, but E integer.  Can
|> I use the same overloaded operator symbol in both cases with his approach?
|> Can I do even more complicated overloaded situations, where the translation
|> needs to take into account the types?

I'm not Chris Torek, but the answer is: yes you can, with g++.  g++ accepts
exactly the same code Chris wrote for gcc, and, since it is a C++ compiler,
allows functions and operators to be overloaded based on argument type.

jbuck@galileo.berkeley.edu (Joe Buck) (02/27/91)

|> Donald Lindsay writes:
|> >- addressing modes which studies could not find a single use of.
|> >  (PDP-11 autodecrement deferred - omitted from the VAX.)

Way back in 1979 I had a job writing code for an LSI-11 board with
4K words of core memory and no O/S (it was basically an interrupt-driven
controller for interfacing a couple of array processors together),
and I found use for several of the weird autodecrement-deferred
instructions: including a "program" with one instruction word and
one data word that would zero all of memory and then halt, returning
control to the built-in ODT debugger.  Can any PDP-11 programmers
figure out what it was?

--
Joe Buck
jbuck@galileo.berkeley.edu	 {uunet,ucbvax}!galileo.berkeley.edu!jbuck

spot@CS.CMU.EDU (Scott Draves) (02/27/91)

In article <1991Feb26.155558.4215@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes:

   >	3. Special-case optimizations are performed:
   >	   A**0.5 -> SQRT(A)
   >	   A**1.0 -> A

   Not done by the above.  I can see how the A**0.5 is useful; I can't see how
   A**1.0 is - do programmers often write this?

probably just about never.  but that doesn't mean the compiler
shouldn't deal with it.  A lot of code is generated by cpp (or lisp
macros), and therefore contains all sorts of stupidity.  this is an
important point to keep in mind.

as an example, suppose I wrote the cpp macro

#define GammaCorrect(intensity, gamma)  \
	(pow((double)(intensity),1.0/(double)(gamma)))

A perfectly sane programmer might write GammaCorrect(i, 1.0).
--

			IBM
Scott Draves		Intel
spot@cs.cmu.edu		Microsoft

ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/27/91)

bs@linus.mitre.org (Robert D. Silverman) wrote:
>:> In article <1991Feb25.203629.5059@linus.mitre.org> bs@faron.mitre.org
>:> (Robert D. Silverman) writes:
>:>> Ah yes. A universal solution. Machine dependent. Ugly syntax. Only
>:>> works with one compiler.

I submit it's better than most, and allows mostly what you'd like.
Also, gcc is very widely available, and might inspire others.

>Actually I would settle for the following:::
>
>I am more than willing to write the assembler code for an operation
>     ----
>we will call FOO. FOO takes 3 arguments and returns 2 more.
>
>I would then like to see the following in a HLL:
>
>	asm_routine(FOO(a,b,c,d,e))
>
>The compiler will then fetch my assembler routine, and RESOLVE any
>register/stack/addressing conflicts that might arise because the
>assembler routine uses the same registers/addresses etc. that are
>produced by the calling routine.
>
>FOO would be written as an ordinary subroutine; it would assume that
>the arguments to it had been placed on the stack the same as any other
>subroutine call.
>
>The compiler will straighten out any conflicts [renaming registers
>within the assembler routine as needed etc.] and then in-line the
>assembler routine.

Som the compiler has to be able to parse either object files or
assembler source (I'm not sure which you mean), follow the data flow
(with possible aliasing) in the procedure-calling convention, know the
operand restrictions (does it need to be in registers or can it accept
a memory operand?) and implicit operands (MQ on the MIPS chip,
r0/r1/r2/...  for VAX movec instructions, etc.) of every instruction,
and in-line it?  I think this is awfully impractical.

The gcc solution, while syntactically messier, tells the compiler the
things it needs to know for optimisation purposes in a form it can
easily understand, and just supplies a string pattern to emit to the
assembler, so the compiler doesn't have to know how to parse assembler
input.

It can also handle new instructions the compiler has not yet been extended
to handle when a 32732/68050/80586/R5000/whatever comes out.

I respectfully suggest that you wouldn't like the results of what you
suggest either, like the people in Jon Bentley's story who wanted to
type a list of ~200 districts into the computer and get a random sample
(~5) out.  He suggested perhaps having the computer emit a few random
numbers and looking them up on the paper list would be a lot easier
than typing in the paper list.

I also suggest that you don't want it embedded in the source that the
code is an assembler routine, so you can supply a high level version
if it becomes necessary.
-- 
	-Colin

ram+@cs.cmu.edu (Rob MacLachlan) (02/28/91)

>From: hrubin@pop.stat.purdue.edu (Herman Rubin)
>Newsgroups: comp.arch,comp.lang.misc
>Subject: Re: bizarre instructions
>Summary: Syntax changes would be needed
>Date: 24 Feb 91 18:18:53 GMT
>
>> A sufficiently intelligent compiler will do the right thing.  It may
>> work to write
>>          k8=i4*j4
>
>The first approach would require a very intelligent compiler, indeed.
>The second approach is the "right" one.  However, every standard I have
>read would try the first.  Since i8*j8 might not even exist on the machine,
>at best an inlined function would have to be brought in instead of the
>single hardware instruction for k8=i4*j4.  
>
>I have never seen any language description in which the type of the result
>in a replacement statement affected what operation is performed.  

Actually, this use of output type information to select operators is
relatively common in Lisp compilers, especially Common Lisp.  It is
necessitated by the default assumption of indefinite-precision arithmetic.
In order to do fixed precision, you must somehow know the result is a
"small" integer.

The CMU Common Lisp compiler (Python) is (so far as I know) unique among
Lisp compilers in efficiently compiling full-word integer arithmetic (no tag
bits).  And our BIGNUM code is implemented with this facility combined with
some additional primitives such as 32x32 -> 64.  The 64 bit result is
returned as two 32 bit values, using the Common Lisp multiple value
mechanism.

It would be relatively straightforward to add support for larger
fixed-precision arithmetic.

  Robert A MacLachlan (ram@cs.cmu.edu)

ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/28/91)

don@dgbt.doc.ca (Donald McLachlan) wrote:
>Solution, requires that you write a correlator to count the number of bits in
>error in the sync sequence, and only accept it as a valid sync sequence if
>the distance is less than a predetermined limit.
>
>The distance is simply the number_of_ones(receieved_seq XOR sync_pattern).
>
>I must admit that this is a very specific problem, but one that must be
>performed once every bit time, and it would benefit greatly from having
>such an instruction in hardware.

Just wondering: the usual hack to count bits is a table of, say, 256 entries.
Look up each byte and add the values together.  If you know the sync sequence
ahead of time and can afford one table per byte of the sync sequence, you
can get a matched bits count directly by pre-diddling the table.

Another approach, since what you really want is to know whether there are
fewer than N bits set in the word, is to repeat distance &= (distance-1);
N times, and if the result is zero, you have a match.  For small n, this
is quite fast.
-- 
	-Colin

pcg@cs.aber.ac.uk (Piercarlo Grandi) (03/01/91)

On 26 Feb 91 02:25:39 GMT, dik@cwi.nl (Dik T. Winter) said:

    [ ... mul-add-div-rem primitive should be defined in some header,
    rather than the compiler provide, like GCC, the primitives to
    define it efficiently ... ]

dik> And this is exactly what Bob Silverman wants (if I read his mind
dik> correctly that is).  As it is now, every user has to write his own
dik> for every platform he encounters (and I have written some 40 upto
dik> now, though through the use of subroutine calls, not through
dik> inlining in gcc).

This ought to be encouraged. I think that the FSF should be much more
about libraries than about commands. The emphasis on programs/commands
instead of reusable libraries is one of tragic historic legacies of
Unix.

The technical question is which language architecture best supports
coding efficient libraries and efficient library use; then there is a
practical (economics, politics) question of writing those libraries.

dik> And that was also the complaint of Peter Montgomery who has
dik> proposed such a thing for the Fortran standard, and it is not
dik> there.  The current state of affairs is that it is not in the
dik> standard for most languages.

But this is not something we ought to argue about in comp.arch; it is no
longer about how to define a language architecture so that it may take
advantage of specifics of the CPU architecture if possible.

What you want is not a technical solution, it is legislation. "There
ought to be a law..."; in your case it is one that makes compiler
writers provide your favourite primitives on all the platforms you want
to use.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/02/91)

In article <PCG.91Feb28182206@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
> What you want is not a technical solution, it is legislation. "There
> ought to be a law..."; in your case it is one that makes compiler
> writers provide your favourite primitives on all the platforms you want
> to use.

I don't think this is practical. I have my favorites, and you have your
favorites, and when you take the overall favorites of programmers around
the world you have way too many operations for any one language to
support.

To put it differently, there will always be *some* primitive that a chip
designer implements or that a programmer wants, and that isn't in the
language. If the chip designer provides X and the programmer wants X but
the language doesn't understand X, the programmer is screwed.

What I do think is practical is having each compiler writer at least
provide a portable idiom for each instruction on his machine. (Portable
doesn't mean everyone will recognize the same idiom the same way; it
just means that the idiom is written portably in the language.) This is
much more manageable, and it ensures that the programmer will never hit
a wall between what the compiler lets him do and what he can accomplish
in assembly.

Note that some compilers go halfway, and provide unportable idioms
(e.g., directives). I find these to be no better than assembly language,
because I can't take a program with directives and use it without change
on another system.

---Dan

mash@mips.com (John Mashey) (03/03/91)

In article <7063:Mar202:29:0091@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

>What I do think is practical is having each compiler writer at least
>provide a portable idiom for each instruction on his machine. (Portable

I'm NOT suggesting there is no use to the various kinds of inlining
and other ways to get access to machine instructions that exist,
as has been discussed.
HOWEVER, I would observe that:
	a) People must be careful NOT to assume that function calls
	are inordinately expensive.  This is simply NOT true these
	days.  In particular, on most of the current RISCs, especially
	when aided and abetted by good register allocators, function
	calls only cost a few cycles; in particular, calls to
	leaf routines (i.e., those that do not call others) are
	usually very cheap, because they almost never save/restore
	any registers.
	b) Fast function calls ae necessary for many purposes.
	I cannot think of any popular architecture designed in the
	last 5-8 years that hasn't paid at least some attention to
	making fast function calls, one way or another.
	c) They solve a LOT of the problem being discussed.
	d) They do not solve 100% of the problem being discussed.
	However, it is always worth doing the cycle count for a given
	language and architecture choice between:
		1) The BEST that you can possibly do, with hand-coding
		2) The BEST you can do by writing leaf-level assembly
		programs on the machine.
	I believe that in most cases, measured over the complete
	program, that case 2) does pretty well, especially because the
	instructions you're trying to get to are often high cycle-count
	operations, where the % overhead for getting them is low.
	The one obvious exception is if you have low-cycle count
	operations (like rotate, or population count, or byte-swapping,
	or string-primitives, etc) that you'd like to get inlined,
	and have no way to express directly.  At least C is moving,
	albeit slowly in the direction of better support for the
	possibility of these things.

However, the bottom line still is: analyze the cycle count differences
to see if mechanisms are worth it.  Sometimes they are, sometimes
they're not. 
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

hrubin@pop.stat.purdue.edu (Herman Rubin) (03/03/91)

In article <8UQ9CL7@xds13.ferranti.com>, peter@ficc.ferranti.com (Peter da Silva) writes:
> In article <22605:Feb2608:04:5391@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

			......................

> Yes. It's even more of a pain than just #ifdefing a bunch of inline
> assembly. So why not just do that for all the obscure instructions
> and have done with it?

ALL the obscure instructions?   How about the ones which we haven't 
discovered yet?
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/03/91)

In article <658@spim.mips.COM> mash@mips.com (John Mashey) writes:
> 	a) People must be careful NOT to assume that function calls
> 	are inordinately expensive.  This is simply NOT true these
> 	days.  In particular, on most of the current RISCs,

Look, so-called RISC chips are hardly the mainstream. Function calls are
relatively expensive on most of the chips I use; the SPARC and the MIPS
are the only exceptions. On the other hand, I consider this whole
function-call issue to be orthogonal to the issue of having the compiler
provide portable access to all machine instructions. I assume that any
good language or optimizer will handle inlining.

> 	b) Fast function calls ae necessary for many purposes.
> 	I cannot think of any popular architecture designed in the
> 	last 5-8 years that hasn't paid at least some attention to
> 	making fast function calls, one way or another.

How about the Astronautics ZS supermini? It's a hellishly fast
machine, and beats the Convex hands-down on non-vectorizable code.
It has many registers, to be sure, and a nice optimizer, but a typical
function call still takes several load-saves. As the machine has no
built-in stack support, and as every memory reference takes two
instructions, function calls are quite expensive. Inlining is a must for
fast code on that machine.

  [ some opinions on whether the current situation works ]
> 	The one obvious exception is if you have low-cycle count
> 	operations (like rotate, or population count, or byte-swapping,
> 	or string-primitives, etc) that you'd like to get inlined,
> 	and have no way to express directly.

That's a pretty big exception.

As Peter points out, fast divrem would noticeably speed up integer
formatting, not to mention multiple-precision arithmetic. Pop count is
useful for algebraic coding (e.g., as in the Goppa public-key system).
Byte reversal (not that I know any chips that have it) is a costly step
in most applications of FFTs. Getting the high word of a multiply is
practically necessary for fast multiple-precision arithmetic and other
number-theoretic applications. And so on.

---Dan

quiroz@cs.rochester.edu (Cesar Quiroz) (03/04/91)

In <23358:Mar310:45:5591@kramden.acf.nyu.edu>,
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) wrote:

| ...On the other hand, I consider this whole function-call issue to
| be orthogonal to the issue of having the compiler provide portable
| access to all machine instructions. ...

I have been following this discussion quietly (and probably I should
not break that), and I have tried to understand this position, but
it still doesn't make sense to me.  It resists definition, and
therefore it is easy to use it to make the discussion infinitely
long.  There is something to be said for recommending that the
compilers pick up idioms and builtins and treat them
especially---there is a large opportunity for improvement there.
But it seems obvious that keeping the semantics straight and
guaranteeing a specific mapping to machine code are simultaneously
incompatible if pursued to extremes.

Let's see it this way: the other camp considers "portability" and
"efficiency" desirable, but sees them as non-binary and as
commodities to be traded.  An occasional lack of portability to
ensure large local benefit is OK to this camp. Say, no one expects
GCC asm to be portable at all, but it gives a clear interface to
most current---not imaginary or prospective---hardware.  Similarly,
a little efficiency can be lost for portability: your compiler can
be out and running on more machines faster if it doesn't recognize
the many special cases of pow() and just goes treats it as another
function to be called.  As a user, you get the tools to make your
own tradeoffs.

How to make the decision?  Maximum payoff first.  If pow appears
often enough to warrant it, put some effort into it, etc.  Don't get
stuck with extremes: maximum portability, maximum efficiency and
your pet syntax to boot.  Get a lollipop.

Blah, this has been said a bunch of times already.

| As Peter points out, fast divrem would noticeably speed up integer
| formatting, not to mention multiple-precision arithmetic. Pop count is
| useful for algebraic coding (e.g., as in the Goppa public-key system).
| Byte reversal (not that I know any chips that have it) is a costly step
| in most applications of FFTs. Getting the high word of a multiply is
| practically necessary for fast multiple-precision arithmetic and other
| number-theoretic applications. And so on.

And so on indeed.  The list is infinite by rights, should compiler
writers give the same ad-hoc treatment to each and every one of
those algorithms?  This departs from giving ad-hoc treatment to each
and every machine instruction, but that confusion seems deeply at
the root of Bernstein's position


-- 
                                      Cesar Quiroz

dik@cwi.nl (Dik T. Winter) (03/04/91)

In article <658@spim.mips.COM> mash@mips.com (John Mashey) writes:
 > 	a) People must be careful NOT to assume that function calls
 > 	are inordinately expensive.  This is simply NOT true these
 > 	days.
Sometimes yes, sometimes no.  But that is not much different from years ago.
 > 	       In particular, on most of the current RISCs, especially
 > 	when aided and abetted by good register allocators, function
 > 	calls only cost a few cycles; in particular, calls to
 > 	leaf routines (i.e., those that do not call others) are
 > 	usually very cheap, because they almost never save/restore
 > 	any registers.
Ah, that is sometimes the problem, but again sometimes not.  Whether you are
looking CISC or RISC, it depends.

 > 	b) Fast function calls ae necessary for many purposes.
 > 	I cannot think of any popular architecture designed in the
 > 	last 5-8 years that hasn't paid at least some attention to
 > 	making fast function calls, one way or another.

I have extensive figures about subroutine call overhead (in this case about
leaf routines).  The background is a library of routines to do extended
precision integer arithmetic (quit apt in this discussion).  I did port it
(both in C and in Fortran) to some 40 different platforms.  For most coding
in assembly was necessary to get performance.  The source is extremely
flexible.  It knows about different optimization leves, will compile for a
different number of bits used in a word and so on.  There is a completely
portable variant (optimization level 0) that only avoids all the known
compiler buts (C and Fortran).  It is fairly simple to get an optimized
version on a new platform (at least I think so).  The problem is of course
that all routines using the library need to be recompiled to use the more
efficient version because the number of bits in a word can change.

Optimization level 1 implies assembly routines for (integer) multiplication
and division.  It is however sometimes necessary to code these routines to
use floating-point arithmetic.  This only reduces the number of bits in a
word that are used from 30 to 26 (on a 32 bit machine).  This is needed on
the i860 that does not have a integer multiply, and no divide at all.  It
can be useful on others (e.g. MIPS).  The routines vary in size from a few
instructions to quite a lot.  E.g. the Vax multiply is under 10 instructions
while the SPARC divide is over 200.  In this case the operations are performed
on the available registers, i.e. no registers are saved.

Optimization level 2 means that loops over the multiplications and divisions
are coded in assembly.  This is roughly equivalent to inlining the simple
functions (although you would not do that if the simple function is fairly
large).  In this case sometimes registers are saved because they are needed
for the loop.

(Level 3 and level 4 optimization are also defined, but not relevant here.)

Below follows a table for a number of machines that indicates the speedup
when optimization level 2 is used.  Also in the last two columns it
indicates whether the base multiplication and division routines are small,
medium size or large.  I use small when a single instruction could be used,
medium when the hardware provides step instructions (multiply step or divide
step) and large when bit fiddling should be done.  Medium is also used if FP
operations are used.

Gould NP1				32.3%	small	small
Gould PowerNode				32.1%	small	small
Alliant FX/4 (68020 look-alike)		27.5%	small	small
Sony (68030)				27.1%	small	small
Vax 780					26.0%	small	small
Encore (32532)				24.9%	small	small
Sequent Balance (32032)			21.9%	small	small
Sun3 (68020)				20.4%	small	small
IBM RT/PC				18.6%	medium	medium
SGI 4D/25 (MIPS) FP			14.5%	medium	medium
Alliant FX2800 (i860)			13.8%	medium	medium
SGI 4D/25 (MIPS) INT			10.0%	small	large
Harris HCX7 (Tahoe)			 3.8%	small	small
Sun4 (SPARC)				 3.1%	medium	large

Two figures are given for the SGI.  One for when complete integer arithmetic
is used, one when floating-point is used in part.  Floating point is faster
by 12% on the SGI.  The reason for the difference between Alliant FX/4 and
Sun3 is that the Alliant uses  a caller saves calling sequence while the
Sun3 uses a callee saves sequence.

We see indeed that the subroutine call overhead decreases when subroutines
become larger.  What is extremely surprising is the position of the Tahoe
in this table!  This is not a modern machine by any means, nor is it RISC
(it is the smaller brother of the Vax).  But while the base routines use
only 5 resp. 4 instructions (4 only because of a bug? in the hardware, it
would have been 3 otherwise), we see nearly no effect of the CALL.
Also interesing is that Sony's 68030 has larger subroutine call overhead
than Sun's 68020.  Approximately the same code was run on both machines.
The only differences are due to compiler differences.  So this is
contradictionary to John Mashey's remark that modern machines have lower
subroutine call overhead.  That depends on your definition of modern.

What does this learn us?

Really nothing.  Inlining may or may not help.  On the Tahoe inlining will
help a bit, but not much.  But on the MIPS with integer arithmetic inlining
helps (this means inlining some 20 instructions for the multiply and some
430 for the divide).  In all, it would be good to look at how the Tahoe
does subroutine calls.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (03/04/91)

In article <1991Feb25.134714.23523@linus.mitre.org> bs@gauss.mitre.org (Robert D. Silverman) writes:
>>         iq=ia/ib
>>         ir=mod(ia,ib)
>>A reasonable compiler will only generate 1 divide with remainder.

>Only ONCE have I ever seen a compiler bright enough to do the above 
>optimization. (Burroughs 6700; a stack machine; Algol 68)

The last time I had my hands on a B6700 (a *lovely* machine) was in 1979.
However, I did write a compiler for it and assist with a couple of others,
and I am quite sure that there was no instruction that yielded both
quotient and remainder.  There was an instruction with the effect of
Algol 60's integer divide, and an instruction with the effect of Algol
60's 'mod'.  Later models in the same architectural family have varied
the instruction set somewhat (such as dropping VECTORMODE), perhaps this
was a later model?

Several years ago I posted a "<q,r> = (a*b+c) <div,mod> d" function for
Sun compilers, using their ".il" files.

I used to use a language (Pop-2) in which the integer division operator
*always* returned both results; oddly enough this was almost always a pain.

-- 
The purpose of advertising is to destroy the freedom of the market.

mash@mips.com (John Mashey) (03/05/91)

In article <23358:Mar310:45:5591@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
>Look, so-called RISC chips are hardly the mainstream....
>> 	b) Fast function calls ae necessary for many purposes.
>> 	I cannot think of any popular architecture designed in the
>> 	last 5-8 years that hasn't paid at least some attention to
>> 	making fast function calls, one way or another.
>
>How about the Astronautics ZS supermini? It's a hellishly fast
>machine, and beats the Convex hands-down on non-vectorizable code.
...

I guess I must be confused.  For some weird reason, I thought RISCs
were at least part of the mainstream of current computer,
based on the almost-nonexistence of new CISC designs. (Please note, that
in all talks that I give, that doesn't mean I claim CISCs are going
away, especially the X86; just that new architectures look more
RISCy than CISCy).

Clearly, I have somehow missed the idea that the ZS might be the mainstream...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

jpk@ingres.com (Jon Krueger) (03/05/91)

From article <3062@charon.cwi.nl>, by dik@cwi.nl (Dik T. Winter):
>  > 	a) People must be careful NOT to assume that function calls
>  > 	are inordinately expensive.  This is simply NOT true these
>  > 	days.
> [table of machines and speedups; in no case
>  does the entire speedup exceed 33%; 20% is more typical.]
> So this is
> contradictionary to John Mashey's remark that modern machines have lower
> subroutine call overhead.  That depends on your definition of modern.

And your definition of "lower", apparently.  Since your speedups are
achieved both from inlining and nonportable assembler, it's impossible
to tell how much function call overhead contributes.  Even if it's
entirely responsible, it doesn't look expensive to me.  Since it's
probably not entirely responsible, it looks even cheaper.  Finally,
even if you could guarantee me a portable 30% overall speedup, your
code isn't a real program, it's a set of support routines.  What
speedups have you measured in real programs?  For instance, if you
like, in real floating point intensive programs calling your routines?

-- Jon
--

Jon Krueger, jpk@ingres.com

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/06/91)

In article <677@spim.mips.COM> mash@mips.com (John Mashey) writes:
> In article <23358:Mar310:45:5591@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> >Look, so-called RISC chips are hardly the mainstream....

Okay, I agree that this is arguable, but only because ``RISC'' is as
meaningful as ``object-oriented.'' Why is the CDC 6600 not a RISC
architecture?

> >> 	I cannot think of any popular architecture designed in the
> >> 	last 5-8 years that hasn't paid at least some attention to
> >> 	making fast function calls, one way or another.
> >How about the Astronautics ZS supermini? It's a hellishly fast
> >machine, and beats the Convex hands-down on non-vectorizable code.
> I guess I must be confused.  For some weird reason, I thought RISCs
> were at least part of the mainstream of current computer,

The issue of fast function calls has nothing to do with whether RISC is
in the mainstream. You are confusing the issues.

The ZS has 64 registers. It is highly pipelined. It has no particularly
complex instructions. But I wouldn't say it pays much attention to fast
function calls.

---Dan

mash@mips.com (John Mashey) (03/06/91)

In article <3304:Mar601:52:2891@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
>In article <677@spim.mips.COM> mash@mips.com (John Mashey) writes:
>> In article <23358:Mar310:45:5591@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
>> >Look, so-called RISC chips are hardly the mainstream....
>
>Okay, I agree that this is arguable, but only because ``RISC'' is as
>meaningful as ``object-oriented.'' Why is the CDC 6600 not a RISC
>architecture?
	I would point out most RISC-genealogy charts start with a Cray
bubble; mine certainly do, and thousands of people have seen them;
as well, there have been numerous postings over the past years in this
newsgroup acknowledging the inspiration of Cray designs, especially
the CDC 6600, etc, etc.  What I don't understand is what of the immediately
preceding discussion would lead anyone to bring up CDC 6600, as though
a RISC person WOULDN'T be aware of it?
>> >> 	I cannot think of any popular architecture designed in the
>> >> 	last 5-8 years that hasn't paid at least some attention to
>> >> 	making fast function calls, one way or another.
>> >How about the Astronautics ZS supermini? It's a hellishly fast
>> >machine, and beats the Convex hands-down on non-vectorizable code.
>> I guess I must be confused.  For some weird reason, I thought RISCs
>> were at least part of the mainstream of current computer,
>
>The issue of fast function calls has nothing to do with whether RISC is
>in the mainstream. You are confusing the issues.
>
>The ZS has 64 registers. It is highly pipelined. It has no particularly
>complex instructions. But I wouldn't say it pays much attention to fast
>function calls.

Everybody please re-read what was posted, quoted above.  If somebody
wants to challenge what was said in some useful way, we'd all be pleased
to hear of counter-examples.
In this particular case, I'd like to hear why the ZS is 'a popular
architecture' while "so-called RISC chips are hardly the mainstream."
Perhaps I am just ignorant, but I've not come across the
ZS very often ...  well, ever....  and I'm not at all sure what issues
are that I'm confusing, but am happy to become educated ...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

ph@ama-1.ama.caltech.edu (Paul Hardy) (03/07/91)

In article <14710@sunquest.UUCP> terry@venus.sunquest.com (Terry R. Friedrichsen) writes:

   Path: nntp-server.caltech.edu!jarthur!uunet!sunquest!venus.sunquest.com!terry
>   >In fact, the old literature (and old machines!) are full of "bizarre"
>   >instructions that proved ill-advised. A few examples:
>   >
>   >  [ some good examples removed]
>   >
>   >- addressing modes which studies could not find a single use of.
>   >  (PDP-11 autodecrement deferred - omitted from the VAX.)
>
>   This example differs somewhat from his others, which were basically
>   deliberate design decisions that are not now generally seen as the
>   wise way to go.

Actually, I've used this instruction to scan backwards in a linked list.
Extremely handy.  I've seen other uses for it.  It was not, however, used
by any *DEC* compiler.  That's what did it in.  Apart from that, there was
something about the symmetry of the machine (lacking in the VAX) that was
useful in itself.  For one thing, it allowed compiler writers to very quickly
take advantage of the PDP-11's orthoganal architecture: get something working
with one instruction in one addressing mode, then extend it from there with
a flick of the wrist.

>   DEC's pdp-10 instruction set had the same oddities.  The machine had
>   a plethora of no-op equivalent instructions

My favorite will always be the "JUMP" instruction, which of course does not
jump since there's no condition associated with it.

                                   --Paul
--
This is my address:         ph@ama.caltech.edu
This is UUCP:               ...!{decwrl,uunet}!
This is my address on UUCP: ...!{decwrl,uunet}!caltech.edu!ama!ph
Any questions?

"Does Emacs have the Buddha nature?"  --Paul Hardy   "Yow!" --Zippy

dik@cwi.nl (Dik T. Winter) (03/07/91)

(Later tonight I will respond to some questions from John Mashey, now this)

In article <1991Mar5.080911.12759@ingres.Ingres.COM> jpk@ingres.com (Jon Krueger) writes:
 > From article <3062@charon.cwi.nl>, by dik@cwi.nl (Dik T. Winter):
 > > So this is
 > > contradictionary to John Mashey's remark that modern machines have lower
 > > subroutine call overhead.  That depends on your definition of modern.
 > 
 > And your definition of "lower", apparently.  Since your speedups are
 > achieved both from inlining and nonportable assembler, it's impossible
 > to tell how much function call overhead contributes.  Even if it's
 > entirely responsible, it doesn't look expensive to me.
It is entirely responsible.
 >                                                         Since it's
 > probably not entirely responsible, it looks even cheaper.  Finally,
 > even if you could guarantee me a portable 30% overall speedup, your
 > code isn't a real program, it's a set of support routines.  What
 > speedups have you measured in real programs?
This was a real program that was using those support routines.  The
speedup was *not* for the routines themselves but for a program running
from 1.5 to 155 seconds, depending on the architecture.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

jsp@glia.biostr.washington.edu. (Jeff Prothero) (03/08/91)

In article <4882@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:

   In article <1991Feb25.134714.23523@linus.mitre.org> bs@gauss.mitre.org (Robert D. Silverman) writes:
   >>         iq=ia/ib
   >>         ir=mod(ia,ib)
   >>A reasonable compiler will only generate 1 divide with remainder.

   >Only ONCE have I ever seen a compiler bright enough to do the above 
   >optimization. (Burroughs 6700; a stack machine; Algol 68)

I've seen MetaWare's High C do something very close to this on the
'386.  I found this depressing at the time, as I couldn't find a nice
way to make *my* code generator do it. Also a little surprising, since
High C (at least the version I was playing with) was not a strongly
optimising compiler, just had lots of local tweaks.

jsp@milton.u.washinton.edu
--
Jeff Prothero (jsp@u.washington.edu)  <std disclaimer>
Biological Structure Graphics Lab, University of Washington