[comp.arch] Compilers and efficiency

hrubin@pop.stat.purdue.edu (Herman Rubin) (04/08/91)

In article <1991Apr5.172533.6717@agate.berkeley.edu>, doug@eris.berkeley.edu (Doug Merritt) writes:
> In article <1991Apr4.125122.1@capd.jhuapl.edu> waltrip@capd.jhuapl.edu writes:
> >	He observes that "Few compilers use any of the nifty instructions that
> >	a CISC has."  I believe I read recently that RISC was based on the
> >	observation that, in fact, only about 30% of the instructions in CISC
> >	computers were used by compilers.  The rest of the instructions, for
> >	all practical purposes, were just excess baggage.

			.........................

> Anyway, the point is that even if compilers *do* use every possible feature
> of a CISC, it still usually won't make the CISC s/w competitive with RISC
> s/w. CISC cpus almost never put sufficient h/w optimization into the support
> of the fancy instructions. (There are exceptions to this, of course, and I
> haven't been watching the 030 and 040 to see how they do in this regard.
> CISC may yet rise again, but not until after superscalar RISC has been
> completely exploited.)

If the languages do not allow the user to communicate  the use of those
features of the CISC architecture which can make the object code considerably
faster, the compiler cannot take advantage of them.  This IS the situation.
For example, suppose there was an instruction, which could be effectively
used, which would divide a float by a float, obtaining an integer quotient
and a float remainder.  Now just how is the compiler to recognize that this
is in fact wanted?  

There is, at present, no language designed to take into account the collection
of "natural" useful hardware which the present users can find uses for, and
which are far more expensive in software than in hardware.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

ram+@cs.cmu.edu (Rob MacLachlan) (04/09/91)

>From: hrubin@pop.stat.purdue.edu (Herman Rubin)
>Subject: Compilers and efficiency
>Date: 7 Apr 91 19:04:12 GMT
>
>For example, suppose there was an instruction, which could be effectively
>used, which would divide a float by a float, obtaining an integer quotient
>and a float remainder.  Now just how is the compiler to recognize that this
>is in fact wanted?  
>
>There is, at present, no language designed to take into account the collection
>of "natural" useful hardware which the present users can find uses for, and
>which are far more expensive in software than in hardware.

Well, in Common Lisp, (truncate 3.5 .5) returns 3 (an integer) and .5 
(a float).

  Rob

ram+@cs.cmu.edu (Rob MacLachlan) (04/09/91)

>From: hrubin@pop.stat.purdue.edu (Herman Rubin)
>Subject: Compilers and efficiency
>Date: 7 Apr 91 19:04:12 GMT
>
>For example, suppose there was an instruction, which could be effectively
>used, which would divide a float by a float, obtaining an integer quotient
>and a float remainder.  Now just how is the compiler to recognize that this
>is in fact wanted?  
>
>There is, at present, no language designed to take into account the collection
>of "natural" useful hardware which the present users can find uses for, and
>which are far more expensive in software than in hardware.

Well, in Common Lisp, (truncate 4.5 2.0) returns 2 (an integer) and .5 (a
float).  And you can use CEILING, FLOOR or ROUND to get rounding to +inf,
-inf or nearest.  The Common Lisp numeric support was designed by someone
with a pretty good understanding of numerical analysis, and of the IEEE
float standard.  [Guy Steele, I believe...]  Unfortunately, most Common Lisp
implementations don't implement numeric stuff well enough to be worth using.
In my work on CMU Common Lisp, I have made some major improvements in basic
efficiency, but nobody has yet realized the full potential of numbers in
Common Lisp.

  Rob

guy@auspex.auspex.com (Guy Harris) (04/11/91)

>If the languages do not allow the user to communicate  the use of those
>features of the CISC architecture which can make the object code considerably
>faster, the compiler cannot take advantage of them.

What about those features of the CISC architecture that don't make the
object code *any* faster?  Methinks that's what a lot of people are
complaining about here....

chip@tct.com (Chip Salzenberg) (04/11/91)

According to hrubin@pop.stat.purdue.edu (Herman Rubin):
>There is, at present, no language designed to take into account the collection
>of "natural" useful hardware which the present users can find uses for, and
>which are far more expensive in software than in hardware.

If Herman would write be so kind as to write a spec for such a
language, we could try to implement it.

We're waiting, Herman...
-- 
Brand X Industries Custodial, Refurbishing and Containment Service:
         When You Never, Ever Want To See It Again [tm]
     Chip Salzenberg   <chip@tct.com>, <uunet!pdn!tct!chip>

hrubin@pop.stat.purdue.edu (Herman Rubin) (04/11/91)

In article <7117@auspex.auspex.com>, guy@auspex.auspex.com (Guy Harris) writes:
> >If the languages do not allow the user to communicate  the use of those
> >features of the CISC architecture which can make the object code considerably
> >faster, the compiler cannot take advantage of them.
> 
> What about those features of the CISC architecture that don't make the
> object code *any* faster?  Methinks that's what a lot of people are
> complaining about here....

In some cases this is true.  A badly implemented procedure is bad.  

If a hardware polynomial evaluation takes longer than an explicit loop,
it is not the fault of the instruction, but of the implementation.  Also,
it is important not to compare the object codes produced by compilers, but
by intelligent human beings, who can reason out how to use the features not
supported by the languages.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

hrubin@pop.stat.purdue.edu (Herman Rubin) (04/11/91)

In article <2803849F.483A@tct.com>, chip@tct.com (Chip Salzenberg) writes:
> According to hrubin@pop.stat.purdue.edu (Herman Rubin):
> >There is, at present, no language designed to take into account the collection
> >of "natural" useful hardware which the present users can find uses for, and
> >which are far more expensive in software than in hardware.
> 
> If Herman would write be so kind as to write a spec for such a
> language, we could try to implement it.
> 
> We're waiting, Herman...

1.  I am not quite sure what you mean by a spec.  If it is what I think it
is, it is far too early to do this.

2.  I am willing to provide a list of such instructions and their descriptions,
by no means intended to be complete, written in such a way that a reasonable
undergraduate (or even a high school student who can understand the idea of
mathematical symbolism and operations) with an understanding of binary
representation can understand.  Many of these have been posted to this group
by others as well as myself.  Such a list will include only those operations
of which I am acquainted, and others will come up with additional suggestions.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

turner@lance.tis.llnl.gov (Michael Turner) (04/12/91)

In article <7117@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>>If the languages do not allow the user to communicate  the use of those
>>features of the CISC architecture which can make the object code considerably
>>faster, the compiler cannot take advantage of them.
>
>What about those features of the CISC architecture that don't make the
>object code *any* faster?  Methinks that's what a lot of people are
>complaining about here....

Guy Harris's example (if I remember correctly) was that no language he
knew of had semantics for retrieving both the integer quotient and the
remainder from a floating divide, the point being that both values are
usually available from any implementation of floating point divide, but
that the language stands in the way of getting them at the same time.
I think a fairly trivial ad hoc common-subexpression analysis could handle
this particular case.  I also don't see how this is a CISC/RISC issue--
if anything, one instruction that yielded both quotient and remainder
is more "RISC", than having two that did div and rem separately.

As for CISC-architecture features that don't appear to be faster than
using combinations of other native-code features, this gets slippery. 
Yes, those added 68020 addressing modes don't make it faster, but they
do make the corresponding code denser, which might reduce I-cache misses.
If implementing them used up space on the chip that would have been
wasted otherwise, then there has been some good done, and little harm
but for a possible slight decrease in yield.  I don't know that that's
the case with the 68020.  From what I'm hearing of the 68040, it even
seems doubtful.  But that shouldn't get in the way of my point: that
when you're designing an architecture for single-chip implementation,
at some point you've got a little left-over real estate to play around
with.  Maybe you can even (*gasp*) add some instructions.  If this be
treason, make the most of it, I say.

Just my arrogant, ignorant opinion, as usual.
---
Michael Turner
turner@tis.llnl.gov

cs450a03@uc780.umd.edu (04/12/91)

Michael Turner writes:

>Guy Harris's example (if I remember correctly) was that no language he
>knew of had semantics for retrieving both the integer quotient and the
>remainder from a floating divide ...

You mean like
      0 3.141592654 #: 20
6 1.150444076

???

That's been around for ages.  It's just in a language that not many
people write compilers for (APL).  

Raul Rockwell

msp33327@uxa.cso.uiuc.edu (Michael S. Pereckas) (04/12/91)

In <1406@ncis.tis.llnl.gov> turner@lance.tis.llnl.gov (Michael Turner) writes:

>If implementing them used up space on the chip that would have been
>wasted otherwise, then there has been some good done, and little harm
>but for a possible slight decrease in yield.  I don't know that that's
>the case with the 68020.  From what I'm hearing of the 68040, it even
>seems doubtful.  But that shouldn't get in the way of my point: that
>when you're designing an architecture for single-chip implementation,
>at some point you've got a little left-over real estate to play around
>with.  Maybe you can even (*gasp*) add some instructions.  If this be
>treason, make the most of it, I say.

The problem is that you will have to include those instructions in the
next version of your processor in order to remain binary compatible,
and if things work out differently, as they probably will, they may be
hard to implement.  That happens in microcoded machines: we have some
extra microcode store space, lets add some instructions.  Next
version, you have to work like hell to cram all those instructions in
because things worked out differently, and probably almost no one uses
those instructions anyway.

--


Michael Pereckas               * InterNet: m-pereckas@uiuc.edu *
just another student...          (CI$: 72311,3246)
Jargon Dept.: Decoupled Architecture---sounds like the aftermath of a tornado

hrubin@pop.stat.purdue.edu (Herman Rubin) (04/12/91)

In article <11APR91.21114644@uc780.umd.edu>, cs450a03@uc780.umd.edu writes:
> Michael Turner writes:
> 
> >Guy Harris's example (if I remember correctly) was that no language he
> >knew of had semantics for retrieving both the integer quotient and the
> >remainder from a floating divide ...
 
> You mean like
>       0 3.141592654 #: 20
> 6 1.150444076

Translate, please.  Seeing the results, I still do not have any idea what
the instruction does or how to put the results in a useful place, probably
registers.

The problem with APL and other such wierdos (like 99+% of the assembler
languages) is their highly artificial syntax.  I can, even at my age,
learn the machine instructions quickly, but the stupid "mnemonics" of
assembler language are another matter.  This is what computer language
people do not realize; they seem to think that memorizing even highly
obscure notation is easy, and understanding slightly unusual operations
is difficult.

APL does have one feature lacking in at least most other languages; it
attempts to have a reasonably compact notation.  But there is no 
adequate macro processor available.  An adequate macro processor would
include the capability of easily translating APL into other languages,
for example, and even having the translation take into account types.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (04/12/91)

In article <10095@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

| If a hardware polynomial evaluation takes longer than an explicit loop,
| it is not the fault of the instruction, but of the implementation.  Also,
| it is important not to compare the object codes produced by compilers, but
| by intelligent human beings, who can reason out how to use the features not
| supported by the languages.

  Obviously a bad algorithm is slow, however you implement it. A good
implementation can be faster than the best code, due to overlap of
instructions. We had someone do an FFT instruction for VAX loadable
control store (master's thesis) and he got about 15-20% over the hand
coded assembler.

  You can get somewhat the same effect on a RISC machine if you feed it
good enough code and it has register scoreboarding or other techniques
which allow overlap. If I wanted maximum speed for some operation I
would still hardcode an instruction, but you need a certain dollar
volume to justify building a special instruction into a CPU instead of
using the real estate for something else. This is why we have
coprocessors, to allow the user to buy the instructionss/he needs.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

chip@tct.com (Chip Salzenberg) (04/13/91)

According to hrubin@pop.stat.purdue.edu (Herman Rubin):
>In article <2803849F.483A@tct.com>, chip@tct.com (Chip Salzenberg) writes:
>> If Herman would write be so kind as to write a spec for such a
>> language, we could try to implement it.
>
>1.  I am not quite sure what you mean by a spec.

Okay.  Let's suppose that holy grail of programming languages, HROL
(Herman Rubin Oriented Language), has been designed, implemented, and
tested, and it's in a box on your desk.

In the box is a tape, a programmer's reference manual, and an
architecture-specific reference manual.

Here's your job, Herman: Write that programmer's reference manual.

If you feel that you can't write it because you're not sure what
should be in it, then obviously you don't know what you want, and I
for one would be happy if you would stop lamenting its absence.

>2.  I am willing to provide a list of such instructions ...

Not an instruction reference, Herman, a *language* specification.
The instruction set is way down the line from there.
-- 
Brand X Industries Custodial, Refurbishing and Containment Service:
         When You Never, Ever Want To See It Again [tm]
     Chip Salzenberg   <chip@tct.com>, <uunet!pdn!tct!chip>

cs450a03@uc780.umd.edu (04/13/91)

Herman Rubin   >
Me             >|
Michael Turner >** 
>** Guy Harris's example (if I remember correctly) was that no
>** language he knew of had semantics for retrieving both the integer
>** quotient and the remainder from a floating divide ...

>| You mean like
>|       0 3.141592654 #: 20
>| 6 1.150444076

>Translate, please.  Seeing the results, I still do not have any idea what
>the instruction does or how to put the results in a useful place, probably
>registers.

Well,    (x,y) #: z  where x, y, and z are all single numbers returns
two results.  The "number on the right" is the remainder from z
divided by y.  The "number on the left" is the quotient of that
division modulo x.  "N modulo 0" is defined to be N.

Therefore,  (0,y) #: z   computes the dividend and remainder of y
divided by z.

Finally, note that for constants, the notation (x, y) may be
abbreviated by just writing the two numbers separated by a space.

>The problem with APL and other such wierdos (like 99+% of the
>assembler languages) is their highly artificial syntax.

Anything you understand is simple.  Anything you do not understand is
not simple.  I suppose I should also point out that in infix notation
a function takes arguments on both the right side and the left side.
I don't know if it is relevant to describe the details of this
mechanism.

Also note that   #:   is a single symbol.

>APL does have one feature lacking in at least most other languages;
>it attempts to have a reasonably compact notation.  But there is no
>adequate macro processor available.  An adequate macro processor
>would include the capability of easily translating APL into other
>languages, for example, and even having the translation take into
>account types.

He he...

APL is just now getting around to using ASCII... give it a little time
before it translates to every language under the sun.  :-)

Other than that, it looks like all you are asking for is constant
folding.

Raul

guido@cwi.nl (Guido van Rossum) (04/13/91)

hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

>[...]  An adequate macro processor would
>include the capability of easily translating APL into other languages,
>for example, and even having the translation take into account types.

Can you give some more clues that this statement is true?  Without any
more context it sounds like what you call a macro processor is
something completely different from what is usually called a macro
processor; for instance, normally macro processors have no notion
whatsoever of the types they are working on.

herrickd@iccgcc.decnet.ab.com (04/16/91)

In article <2803849F.483A@tct.com>, chip@tct.com (Chip Salzenberg) writes:
> According to hrubin@pop.stat.purdue.edu (Herman Rubin):
>>There is, at present, no language designed to take into account the collection
>>of "natural" useful hardware which the present users can find uses for, and
>>which are far more expensive in software than in hardware.
> 
> If Herman would write be so kind as to write a spec for such a
> language, we could try to implement it.
> 
> We're waiting, Herman...

Herman, and some others, have already published a requirement spec for
the language, mostly in the "bizarre instructions" thread here quite
recently.  Of course, it would help if someone would gather it together
in a form that could be reviewed and resolve some of the contradictions.

dan herrick
herrickd@iccgcc.decnet.ab.com

guy@auspex.auspex.com (Guy Harris) (04/16/91)

>Guy Harris's example (if I remember correctly) was that no language he
>knew of had semantics for retrieving both the integer quotient and the
>remainder from a floating divide, the point being that both values are
>usually available from any implementation of floating point divide, but
>that the language stands in the way of getting them at the same time.

I don't think you remember correctly; if *I* remember correctly, I
didn't give any particular example, and I certainly didn't give *that*
example.  *Herman Rubin* made the claim that no language he knew of had
those semantics, and at least two other people jumped in to claim that
Common LISP *did* have those semantics.

I was mainly thinking of, indeed, such things as the added addressing
modes; they may increase code density, but do they complicate instruction
decoding and slow the machine down there?  And what about some of the
more elaborate procedure-calling, procedure-entry, or procedure-exit
instructions?

Admittedly, to some extent, the problems with the more elaborate
features may come either from 1) sloppy implementations of them and
2) using the elaborate features even when inappropriate, e.g. not
making use of simplifications that can be done at code-generation time,
such as better management of registers.  Both seem to come, at least in
part, from the notion that "well, it's *one instruction*, that means it
*has* to be fast!" - i.e., since it's a single instruction, they didn't
worry about making it fast, or making the common case fast, and/or
assumed it was *always* the right thing to do to use that instruction. 

As an example of 1), the CCI Power 6/32, as I remember, did a lot better
job at implementing a VAX-like CALLS instruction than did the
VAX-11/780; one trick I remember them doing was to generate the fields
of the stack frame in order, so that the stores that built the stack
frame would work well with interleaved memory.  They also stored
*decoded* instructions, rather than code bytes, in the instruction
cache, as a way of avoiding the overhead of decoding VAX-style
instructions.

As an example of 2), consider, say, treating leaf procedures differently
- can you get away with doing less than a full procedure entry?

guy@auspex.auspex.com (Guy Harris) (04/17/91)

>If a hardware polynomial evaluation takes longer than an explicit loop,
>it is not the fault of the instruction, but of the implementation.

What if a hardware polynomial evaluation takes longer than an *unrolled*
loop, with the zero-coefficient terms taken out?  Or do you have an
implementation in mind that takes zero cycles to skip the terms with
zero coefficients?

miklg@sono.uucp (Michael Goldman ) (04/17/91)

msp33327@uxa.cso.uiuc.edu (Michael S. Pereckas) writes:

>In <1406@ncis.tis.llnl.gov> turner@lance.tis.llnl.gov (Michael Turner) writes:

>>If implementing them used up space on the chip that would have been
>>wasted otherwise, then there has been some good done, and little harm
>>but for a possible slight decrease in yield.  I don't know that that's
>>the case with the 68020.  From what I'm hearing of the 68040, it even
>>seems doubtful.  But that shouldn't get in the way of my point: that
>>when you're designing an architecture for single-chip implementation,
>>at some point you've got a little left-over real estate to play around
>>with.  Maybe you can even (*gasp*) add some instructions.  If this be
>>treason, make the most of it, I say.

>The problem is that you will have to include those instructions in the
>next version of your processor in order to remain binary compatible,
>and if things work out differently, as they probably will, they may be
>hard to implement.  That happens in microcoded machines: we have some
>extra microcode store space, lets add some instructions.  Next
>version, you have to work like hell to cram all those instructions in
>because things worked out differently, and probably almost no one uses
>those instructions anyway.

Actually, as I follow the 68020 to the 68030 to the 68040 I find a number
of deleted instructions as well as added insrtuctions at each move.  The
added instructions are either essential (MMU support) or not (BKPT).
The deleted instructions are of the same nature.  That this requires
rewriting code is just one of those things you're supposed to put up
with to maintain application compatibility.  The 680x0 doesn't have a
lot of really complex instructions like the 80x86 family.  The string
handling instructions of the 80x86 come to mind as something best
implemented as library functions.  The context switching instructions
of the 80286 are sufficiently involved as to be part of a large function.

I think one of the problems with the incredibly complex instructions
for OS stuff in the 80286 and beyond, is that you may not WANT to 
implement your OS the way they designed the instruction.  Another
problem is that it takes learning about the instruction set that
much harder and longer.  This results in delays in implementation and
bugs.  Then, also, as pointed out above, you've got compatibility issues
if you actually wrote your OS around their specific hardware concept
of how context switching, should be done.

john@iastate.edu (Hascall John Paul) (04/17/91)

In article <7207@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
}>If a hardware polynomial evaluation takes longer than an explicit loop,
}>it is not the fault of the instruction, but of the implementation.

}What if a hardware polynomial evaluation takes longer than an *unrolled*
}loop, with the zero-coefficient terms taken out?  Or do you have an
}implementation in mind that takes zero cycles to skip the terms with
}zero coefficients?

Doesn't seem so impossible:

           POLY   CMASK,CTABLE,X,RESULT

where the CMASK operand is a bitmask which indicates which elements
of the CTABLE array are non-zero.  [probably this instruction would
still be a waste of real estate for most]

John

--
John Hascall                        An ill-chosen word is the fool's messenger.
Project Vincent
Iowa State University Computation Center                       john@iastate.edu
Ames, IA  50011                                                  (515) 294-9551

mjs@hpfcso.FC.HP.COM (Marc Sabatella) (04/17/91)

>no language he [ Herman Rubin, actually ]
>knew of had semantics for retrieving both the integer quotient and the
>remainder from a floating divide, the point being that both values are
>usually available from any implementation of floating point divide, but
>that the language stands in the way of getting them at the same time.

In addition to Common LISP, Forth also has this ability.  In fact, it comes
about as close to Herman's idealized "user definable syntax macro processor"
as any language I can think of.

robertsw@gtephx.UUCP (Wild Rider) (04/18/91)

	In article <1991Apr12.001324.5@ux1.cso.uiuc.edu>
	msp33327@uxa.cso.uiuc.edu (Michael S. Pereckas) writes:

>The problem is that you will have to include those instructions in the
>next version of your processor in order to remain binary compatible,
>and if things work out differently, as they probably will, they may be
>hard to implement.  That happens in microcoded machines: we have some
>extra microcode store space, lets add some instructions.  Next
>version, you have to work like hell to cram all those instructions in
>because things worked out differently, and probably almost no one uses
>those instructions anyway.

	or, you could do like motorola did when they designed the 030:
	dump a couple of unused 020 instructions, i.e., the "callm"
	("call module") and the corresponding "rtm" ("return from
	module") instructions.  yes, that's correct, you can have an
	executable which runs _only_ on the 020, if it uses those
	instructions.  i guess motorola called up all their major 020
	customers and asked them: "do you use the callm/rtm
	instructions?"  apparently, the answer was "no" from every
	customer so motorola dumped the 2 unused instructions.  i just
	dug out my "mc68020 user's manual" and looked up the "callm"
	and "rtm" instructions... yuk.  no wonder they dumped them.
	talk about "complex instructions" ... anyway, i'll bet moto's
	designers freed up quite a bit of microcode real estate when
	they released those 2 losers.
-- 
Wallace Roberts, AG (formerly GTE) Communication Systems, Phoenix, AZ
UUCP: ...!{ncar!noao!asuvax | uunet!zardoz!hrc | att}!gtephx!robertsw
Internet: gtephx!robertsw@asuvax.eas.asu.edu    Bike: '82 GS1100L Suz
voice: (602)581-4555    fax: (602)582-7624      Cage: '89 Mustang  GT

shankar@hpcupt3.cup.hp.com (Shankar Unni) (04/20/91)

In comp.arch, robertsw@gtephx.UUCP (Wild Rider) writes:

> 	or, you could do like motorola did when they designed the 030:
> 	dump a couple of unused 020 instructions, i.e., the "callm"
> 	("call module") and the corresponding "rtm" ("return from
> 	module") instructions.  yes, that's correct, you can have an

That's surely an oversimplification. What usually happens is that those
instructions will raise some sort of "unimplemented instruction" (or
so-called "assist") exception, which is different from the run-of-the-mill
"illegal instruction", moving the responsibility for simulating
those instructions to the kernel (or the user program, if you are running
stand-alone embedded programs).  This sort of thing is commonly done with
certain types of floating-point instructions.
-----
Shankar Unni                                   E-Mail:
HP India Software Operation, Bangalore       Internet: shankar@cup.hp.com
Phone : +91-812-261254 x417                      UUCP: ...!hplabs!hpda!shankar

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (04/22/91)

In article <7184@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>And what about some of the
>more elaborate procedure-calling, procedure-entry, or procedure-exit
>instructions?
>
>Admittedly, to some extent, the problems with the more elaborate
>features may come either from 1) sloppy implementations of them and
>2) using the elaborate features even when inappropriate

The problem with an elaborate procedure calling convention is that
you *have* to do it that way - by emulation if not by use of the
fancy instructions. To do otherwise is to confuse debuggers,
exception/longjmp packages, and smart linkers, not to mention
breaking separate compilation.

Actually, the worst calling convention I ever dealt with had none of
those problems. IBM 360 PL/I-F didn't *have* a debugger, smart
linkers hadn't been invented, and the exception package was easily
tacked on. The reason it was easy was because the machine code for a
procedure prologue, contained (get this) a *call* to the runtime
package...
-- 
Don		D.C.Lindsay 	Carnegie Mellon Robotics Institute

meissner@osf.org (Michael Meissner) (04/22/91)

In article <12740@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald
Lindsay) writes:

| Actually, the worst calling convention I ever dealt with had none of
| those problems. IBM 360 PL/I-F didn't *have* a debugger, smart
| linkers hadn't been invented, and the exception package was easily
| tacked on. The reason it was easy was because the machine code for a
| procedure prologue, contained (get this) a *call* to the runtime
| package...

If I remember correctly, the Unix V6 (Ritchie) compiler did this
too....
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?

peter@ficc.ferranti.com (peter da silva) (04/22/91)

In article <12740@pt.cs.cmu.edu>, lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
> The reason it was easy was because the machine code for a
> procedure prologue, contained (get this) a *call* to the runtime
> package...

The K&R C compiler for the PDP-11 did this, too. And it did returns with
a "jmp cret".

You want a weird calling convention, try the 1802. A subroutine call is made
by changing which register is acting as the PC: all subroutines were
coroutines. To get stack calling you set up a pair of call-return subroutines,
burned a couple registers as their start address, and called them with SEP 4
and SEP 5. The only machine I know for which Forth (particularly token
indirect threaded code) was faster than regular subroutine calls using the
Standard Call/Return technique.

(I know, you're all tired of me flaming about the 1802. But it was a really
 CUTE micro. Sort of like a boutique car.)
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

henry@zoo.toronto.edu (Henry Spencer) (04/23/91)

In article <MEISSNER.91Apr21153714@curley.osf.org> meissner@osf.org (Michael Meissner) writes:
>| ... The reason it was easy was because the machine code for a
>| procedure prologue, contained (get this) a *call* to the runtime
>| package...
>
>If I remember correctly, the Unix V6 (Ritchie) compiler did this too...

All instantiations of the Ritchie compiler did it, in fact.  On the whole,
it made sense:  it saved code space, which was usually more of an issue on
the 11 than execution time.  And it wasn't actually that much slower than
inlining the relevant code, given that we're talking about machines that
had little or no pipelining.
-- 
And the bean-counter replied,           | Henry Spencer @ U of Toronto Zoology
"beans are more important".             |  henry@zoo.toronto.edu  utzoo!henry

halkoD@batman.moravian.EDU (David Halko) (04/23/91)

In article <9782@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> In article <1991Apr5.172533.6717@agate.berkeley.edu>, doug@eris.berkeley.edu (Doug Merritt) writes:
> > In article <1991Apr4.125122.1@capd.jhuapl.edu> waltrip@capd.jhuapl.edu writes:
> > >	He observes that "Few compilers use any of the nifty instructions that
> > >	a CISC has."  I believe I read recently that RISC was based on the
> > >	observation that, in fact, only about 30% of the instructions in CISC
> > >	computers were used by compilers.  The rest of the instructions, for
> > >	all practical purposes, were just excess baggage.
> 			.........................
> > Anyway, the point is that even if compilers *do* use every possible feature
> > of a CISC, it still usually won't make the CISC s/w competitive with RISC
> > s/w. CISC cpus almost never put sufficient h/w optimization into the support
> > of the fancy instructions. (There are exceptions to this, of course, and I
> > haven't been watching the 030 and 040 to see how they do in this regard.
> > CISC may yet rise again, but not until after superscalar RISC has been
> > completely exploited.)
> 			.........................
> If the languages do not allow the user to communicate  the use of those
> features of the CISC architecture which can make the object code considerably
> faster, the compiler cannot take advantage of them.  This IS the situation.
> For example, suppose there was an instruction, which could be effectively
> used, which would divide a float by a float, obtaining an integer quotient
> and a float remainder.  Now just how is the compiler to recognize that this
> is in fact wanted?  
> 			.........................
What do you mean how is a compiler suppoosed to recognize that this is wanted?

    AT&T came out with a DSP board which utilized neato CPU features in its
high level languages... since floating point multiply was so much faster
than integer multiply, it would take integers, convert them to float, do
the floating point multiplication, and convert back to integer again. Man,
talk about fast and nifty. This was all handled internally, inside the C
Compiler. It seems quite useful for integer division to do the same, here.
There were, however, some minor flaws in the Compiler which took away from
some of it's efficiency. I used to use global search and replace with specific
patterns to optimize the code I produced...

> There is, at present, no language designed to take into account the collection
> of "natural" useful hardware which the present users can find uses for, and
> which are far more expensive in software than in hardware.

I think it may be more of a question of "Are CISC features being fully realized by
compiler designers?"

I wonder if RISC architecture is all what it is cracked up to be. I work on 
a SUN SS1, and it is not all that much faster than a 3/60. Quite often, I
use the 3/60's we have instead. 
-- 
     _______________________________________________________________________
    /                                                                       \
   / "The use of COBOL cripples the mind; its teaching should, therefore, be \
  /  regarded as a criminal offence."           E.W.Dijkstra, 18th June 1975. \
 +-----------------------------------------------------------------------------+
  \  "Have you purchased a multi-   halkoD@moravian.edu  Have you booted your /
   \ media machine from IMS yet?"   David J. Halko       OS-9 computer today?/
    \_______________________________________________________________________/

barmar@think.com (Barry Margolin) (04/23/91)

In article <1991Apr23.151932.2125@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <MEISSNER.91Apr21153714@curley.osf.org> meissner@osf.org (Michael Meissner) writes:
>>| ... The reason it was easy was because the machine code for a
>>| procedure prologue, contained (get this) a *call* to the runtime
>>| package...
>>If I remember correctly, the Unix V6 (Ritchie) compiler did this too...
>All instantiations of the Ritchie compiler did it, in fact.  On the whole,
>it made sense:  it saved code space, which was usually more of an issue on
>the 11 than execution time.  And it wasn't actually that much slower than
>inlining the relevant code, given that we're talking about machines that
>had little or no pipelining.

On Multics, external procedures also make a call to the runtime system
(called "operators", for some reason -- I guess to distinguish the library
routines that the compiler "knows" about from the ones that it doesn't)
during the standard entry sequence.

Besides reducing code space, it is a very useful feature of the software
development environment.  The operator procedures are addressed through a
transfer vector.  Thus, alternate versions of the operators could be
dynamically linked into the process (similar to the way personal computer
customizations patch trap vectors).  The most notable uses of this are the
procedure call tracing mechanism and the metering facility.  On most
systems, one must compile a program with a special option in order to make
it maintain metering statistics; on Multics, one just issues a command that
switches operators, runs the program, and then switch back to the regular
operators (there's a program that combines all this).  For call tracing,
one switches to versions of the entry and exit operators that display a
message.
--
Barry Margolin, Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (04/25/91)

In article <1991Apr23.162814.20093@Think.COM> barmar@think.com 
	(Barry Margolin) writes:
| ... The reason it was easy was because the machine code for a
| procedure prologue, contained (get this) a *call* to the runtime
| package...

>...alternate versions of the operators could be
>dynamically linked into the process (similar to the way personal computer
>customizations patch trap vectors).  The most notable uses of this are the
>procedure call tracing mechanism and the metering facility.

Another debugging feature available via this trick is break-when-
expression-true.  The expression is of course only evaluated on every
procedure entry, not on every cycle. However, my personal experience
(on a DEC 20) was very postive. I was usually trying to find out
which routine was using a wild pointer, so per-call was the perfect
granularity.

Of course, there are other ways to achieve the same effect on vanilla
hardware. Aside from page protection cleverness, one can run a snoop
process on one of the other CPUs. (Multiprocessors *are* vanilla,
right?)
-- 
Don		D.C.Lindsay 	Carnegie Mellon Robotics Institute

meissner@osf.org (Michael Meissner) (04/26/91)

In article <4082@batman.moravian.EDU> halkoD@batman.moravian.EDU
(David Halko) writes:

| I wonder if RISC architecture is all what it is cracked up to be. I work on 
| a SUN SS1, and it is not all that much faster than a 3/60. Quite often, I
| use the 3/60's we have instead. 

From your posting, I would guess that you use a lot of integer
multiplies (non-constant in particular).  The initial Sparc hardware
does not have an integer multiply or divide instruction.  It has a
multiply step instruction (not sure about divide) to give a partial
answer.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?

hrubin@pop.stat.purdue.edu (Herman Rubin) (04/27/91)

In article <4082@batman.moravian.EDU>, halkoD@batman.moravian.EDU (David Halko) writes:
> In article <9782@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> > In article <1991Apr5.172533.6717@agate.berkeley.edu>, doug@eris.berkeley.edu (Doug Merritt) writes:
> > > In article <1991Apr4.125122.1@capd.jhuapl.edu> waltrip@capd.jhuapl.edu writes:
> > > >	He observes that "Few compilers use any of the nifty instructions that
> > > >	a CISC has."  I believe I read recently that RISC was based on the
> > > >	observation that, in fact, only about 30% of the instructions in CISC
> > > >	computers were used by compilers.  The rest of the instructions, for
> > > >	all practical purposes, were just excess baggage.
> > 			.........................
> > > Anyway, the point is that even if compilers *do* use every possible feature
> > > of a CISC, it still usually won't make the CISC s/w competitive with RISC
> > > s/w. CISC cpus almost never put sufficient h/w optimization into the support
> > > of the fancy instructions. (There are exceptions to this, of course, and I
> > > haven't been watching the 030 and 040 to see how they do in this regard.
> > > CISC may yet rise again, but not until after superscalar RISC has been
> > > completely exploited.)
> > 			.........................
> > If the languages do not allow the user to communicate  the use of those
> > features of the CISC architecture which can make the object code considerably
> > faster, the compiler cannot take advantage of them.  This IS the situation.
> > For example, suppose there was an instruction, which could be effectively
> > used, which would divide a float by a float, obtaining an integer quotient
> > and a float remainder.  Now just how is the compiler to recognize that this
> > is in fact wanted?  
> > 			.........................
> What do you mean how is a compiler suppoosed to recognize that this is wanted?

I mean exactly that.  I am assuming that the operations are done by means of
a sequence of instructions, and not a call to some (possibly inlined) function.
I know of NO hardware on which I would consider calls to be reasonable where
short sequences of code will do the job.

It is not the case that the algorithm to be used is the same on different
machines.  Hardware differences can cause major problems, as was seen when
cache/paging first came in; an algorithm for matrix multiplication which
would seem to be better from the timing standpoint (same operations, fewer
loads, far fewer stores) caused page faults on the great bulk of its loads,
and thus became non-competitive for large matrices.

Consider the following two operations, which comprise more than 25% of the 
operations in some highly efficient, from the "picocode" standpoint, methods
for generating non-uniform random numbers.

1.  Examine a bit, and take some action depending on it.  Also, the next time
this is to be done, this bit effectively no longer exists.  It is necessary
to make provision at least somewhere for replenishing the supply of bits.

2.  Find the distance to the next one in a bit stream.  The same considerations
as above apply.

Try coding this in any present language, including assembler, in such a way
that a compiler can be expected to recognize one of these without being 
explicitly instructed.

I am not saying that this should not be the case.  But it is the user, not
the language designer, who has to do this.  Whatever language, present or
future, is used, human beings are going to come up with extremely simple
operations which the language designers have not thought of, and for which
doing those operations using the language primitives will be clumsy.

The user must be able to instruct the compiler about these operations
in a manner similar to the way it now looks at addition and multiplication.
If there are competing ways to do something, the compiler must know the
available variety.  This is even the case for adding vectors in C.

>     AT&T came out with a DSP board which utilized neato CPU features in its
> high level languages... since floating point multiply was so much faster
> than integer multiply, it would take integers, convert them to float, do
> the floating point multiplication, and convert back to integer again. Man,
> talk about fast and nifty. This was all handled internally, inside the C
> Compiler. It seems quite useful for integer division to do the same, here.
> There were, however, some minor flaws in the Compiler which took away from
> some of it's efficiency. I used to use global search and replace with specific
> patterns to optimize the code I produced...

If this was the case, the hardware was deficient.  To a mathematician, floating
point arithmetic is a complicated version of fixed point arithmetic.  This is a
point which the RISC designers do not see; the chip real estate for doing good
integer arithmetic is already there in the floating point stuff, and the design
should be such as to use it.  Floating point arithmetic uses fixed point 
internally, so this would not be a problem AT THE TIME OF THE DESIGN OF THE
ARCHITECTURE.  It is a problem afterward.

> > There is, at present, no language designed to take into account the collection
> > of "natural" useful hardware which the present users can find uses for, and
> > which are far more expensive in software than in hardware.
> 
> I think it may be more of a question of "Are CISC features being fully realized by
> compiler designers?"

Have any language designers or hardware designers ever asked for input on the
variety of operations which are readily understood by people like Silverman,
Lehmer, etc.?  Have language designers considered that vastly different 
algorithms should be used with different architectures?  Most optimization
procedures which I have seen assume not.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

bengtl@maths.lth.se (Bengt Larsson) (04/28/91)

In article <11411@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu 
(Herman Rubin) writes:
>In article <4082@batman.moravian.EDU>, halkoD@batman.moravian.EDU (David Halko) writes:
>> What do you mean how is a compiler suppoosed to recognize that this is wanted?
>
>I mean exactly that.  I am assuming that the operations are done by means of
>a sequence of instructions, and not a call to some (possibly inlined) function.
>I know of NO hardware on which I would consider calls to be reasonable where
>short sequences of code will do the job.

What is the difference between an _inlined_ function and a "short sequence
of code"? Explanation, please.

>It is not the case that the algorithm to be used is the same on different
>machines.  Hardware differences can cause major problems, as was seen when
>cache/paging first came in; ...

This no news.

>1.  Examine a bit, and take some action depending on it.  Also, the next time
>this is to be done, this bit effectively no longer exists.  It is necessary
>to make provision at least somewhere for replenishing the supply of bits.

You are being obscure. What is this if not a shift instruction?

>2.  Find the distance to the next one in a bit stream.  The same considerations
>as above apply.

True. And some hardware has instructions for "find first set bit". The
Vax has, for example. That may be a useful addition, though it's
somewhat specialized. Useful for cryptography, I hear.

>Try coding this in any present language, including assembler, in such a way
>that a compiler can be expected to recognize one of these without being 
>explicitly instructed.

It isn't doable to make a compiler which can recognize _anything_ without
being instructed. If you want an extensible language, you'll have to show how it
can be made extensible. I can assure you that it isn't quite as easy as it
looks.

>I am not saying that this should not be the case.  But it is the user, not
>the language designer, who has to do this.  Whatever language, present or
>future, is used, human beings are going to come up with extremely simple
>operations which the language designers have not thought of, and for which
>doing those operations using the language primitives will be clumsy.

This is utterly obvious.

>The user must be able to instruct the compiler about these operations
>in a manner similar to the way it now looks at addition and multiplication.

OK, you like infix notation.

>If there are competing ways to do something, the compiler must know the
>available variety.  This is even the case for adding vectors in C.

You don't seem to know anything about the problems in what you are
talking about. 

>If this was the case, the hardware was deficient.  To a mathematician, floating
>point arithmetic is a complicated version of fixed point arithmetic.  

Hardware integer multiply should probably be provided for, yes. 

>Have any language designers or hardware designers ever asked for input on the
>variety of operations which are readily understood by people like Silverman,
>Lehmer, etc.?  

Why do you assume that people should come and ask you (a matematician)
about computer language development?  What makes you an expert in the
field? 

Bengt L.
-- 
Bengt Larsson - Dep. of Math. Statistics, Lund University, Sweden
Internet: bengtl@maths.lth.se             SUNET:    TYCHE::BENGT_L

preston@ariel.rice.edu (Preston Briggs) (04/28/91)

hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

> If the languages do not allow the user to communicate  the use of those
> features of the CISC architecture which can make the object code considerably
> faster, the compiler cannot take advantage of them.  This IS the situation.
> For example, suppose there was an instruction, which could be effectively
> used, which would divide a float by a float, obtaining an integer quotient
> and a float remainder.  Now just how is the compiler to recognize that this
> is in fact wanted?  

In Fortran, you might write

	iq = f1 / f2

and a while later (during which f1 and f2 are not changed)

	fr = amod(f1, f2)

The compiler could recognize, during value numbering, that an
quotient and remainder of the same operands was being computed.
During instruction selection (or perhaps peephole optimization),
it could recognize that it could avoid the real2int conversion
of the quotent by using the "hrubin" instruction.

Not everyone does these things the way I've described them, but many
compilers can achieve the same effect, one way or another.

Preston Briggs

preston@ariel.rice.edu (Preston Briggs) (04/28/91)

hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

>Consider the following two operations, which comprise more than 25% of the 
>operations in some highly efficient, from the "picocode" standpoint, methods
>for generating non-uniform random numbers.
>
>1.  Examine a bit, and take some action depending on it.
>    Also, the next time this is to be done, this bit effectively
>    no longer exists.  It is necessary to make provision at least
>    somewhere for replenishing the supply of bits.
>
>2.  Find the distance to the next one in a bit stream.
>    The same considerations as above apply.

These sort of things are certainly CISCish and would be difficult to
recognize in a compiler.  However, I would argue that they are
possibly misrepresented.

How about a linked-list (oh no!) of integers.
An element in the list represents a "1" bit.
The integer is number of "zero" bits preceding the 1.
The stream starting with 001010000001...
would look like

	(2, 1, 6, ...)

So, a single memory access per 1 bit.  The "distance to next 1 bit" 
questions can be answered immediately.  The "examine a bit and take
an action" can be done with some of those fancy "decrement, test, and branch"
instructions people use for closing loops.

Alternatively, ...
It seems like there's something producing the bits and something consuming
them and making decisions based on them.  Why not have the producer
simply call the alternate choices in the consumer directly?


Generally though, my critisism is that you're thinking very low-level.
You decided how you want these operations represented and you seem
unwilling to talk about them at a level above "bit stream".
If you must think this way, I expect you're going to have to
code in the same fashion; that is, in assembly.

Preston Briggs

hrubin@pop.stat.purdue.edu (Herman Rubin) (04/29/91)

In article <1991Apr28.154603.8003@rice.edu>, preston@ariel.rice.edu (Preston Briggs) writes:
> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> 
> > If the languages do not allow the user to communicate  the use of those
> > features of the CISC architecture which can make the object code considerably
> > faster, the compiler cannot take advantage of them.  This IS the situation.
> > For example, suppose there was an instruction, which could be effectively
> > used, which would divide a float by a float, obtaining an integer quotient
> > and a float remainder.  Now just how is the compiler to recognize that this
> > is in fact wanted?  
> 
> In Fortran, you might write
> 
> 	iq = f1 / f2
> 
> and a while later (during which f1 and f2 are not changed)
> 
> 	fr = amod(f1, f2)
> 
> The compiler could recognize, during value numbering, that an
> quotient and remainder of the same operands was being computed.
> During instruction selection (or perhaps peephole optimization),
> it could recognize that it could avoid the real2int conversion
> of the quotent by using the "hrubin" instruction.
> 
> Not everyone does these things the way I've described them, but many
> compilers can achieve the same effect, one way or another.

Any specific instance of this can be added to the compiler's ken, but
until it does, the compiler cannot take advantage of it.  The point is
that the compiler must be instructed to look for them.  As to what it
can do with such instructions, this depends very strongly on what the
hardware provides.

The real2int conversion still has to be done on most machines, and also
an integerize of real.  But suppose that the machine has integer hardware
comparable to floating hardware, and can also handle unnormalized floats.
Then one could put a shifted significand of f1 in a double register,
divide by the significand of f2, setting iq to be the quotient, and 
the remainder with the exponent of f2, and then normalized, as fr.
Now I wonder how many of the readers would think of doing it this way?
And how would the hardware designers recognize that putting the operation
in hardware might be a good idea?

The individual compiler might or might not see it.  The community definitely
would not.

Now, how would you write the code for doing a conditional transfer efficiently
on the next bit of a bit stream, which can be extended (usually by a subroutine
call) when necessary?  It is unlikely that one will know, without querying, just
where one is in the stream.  If you wrote the code, would someone else so
recognize it?
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

jesup@cbmvax.commodore.com (Randell Jesup) (04/29/91)

In article <11411@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>If this was the case, the hardware was deficient.  To a mathematician, floating
>point arithmetic is a complicated version of fixed point arithmetic.  This is a
>point which the RISC designers do not see; the chip real estate for doing good
>integer arithmetic is already there in the floating point stuff, and the design
>should be such as to use it.  Floating point arithmetic uses fixed point 
>internally, so this would not be a problem AT THE TIME OF THE DESIGN OF THE
>ARCHITECTURE.  It is a problem afterward.

	The RPM-40 used the floating point HW for integer multiple and divide,
and I'm certain at least a few others have also.  However, don't do this if
you have only a couple FP registers (a problem the rpm40 had).

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
Disclaimer: Nothing I say is anything other than my personal opinion.
Thus spake the Master Ninjei: "To program a million-line operating system
is easy, to change a man's temperament is more difficult."
(From "The Zen of Programming")  ;-)

preston@ariel.rice.edu (Preston Briggs) (04/29/91)

hrubin@pop.stat.purdue.edu (Herman Rubin) wrote

> If the languages do not allow the user to communicate  the use of those
> features of the CISC architecture which can make the object code considerably
> faster, the compiler cannot take advantage of them.  This IS the situation.
> For example, suppose there was an instruction, which could be effectively
> used, which would divide a float by a float, obtaining an integer quotient
> and a float remainder.  Now just how is the compiler to recognize that this
> is in fact wanted?  

And I described how it could be done, using Fortran 77 and pretty
standard optimization (value numbering and peephole optimization).

hrubin@pop.stat.purdue.edu (Herman Rubin) replies:

>Any specific instance of this can be added to the compiler's ken, but
>until it does, the compiler cannot take advantage of it.  The point is
>that the compiler must be instructed to look for them.

Yes indeed.  That's the job of the compiler writer.
If he does a bad job in using the instruction set of the machine,
you should complain to him.

>The real2int conversion still has to be done on most machines, and also
>an integerize of real.  But suppose that the machine has integer hardware
>comparable to floating hardware, and can also handle unnormalized floats.

Well, we can suppose all sorts of things.  Nevertheless, my solution
handles your first set of suppositions.  If you change the machine,
you need to change compilers.

I thought your original question was posed in an attempt to show that
many languages don't have the facilities you consider necessary
to control the machine.  I just tried to show that the effect you
wanted could be achieved.

Preston Briggs

bob@tera.com (Bob Alverson) (04/29/91)

In article <11489@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>[stuff deleted]
>The real2int conversion still has to be done on most machines, and also
>an integerize of real.  But suppose that the machine has integer hardware
>comparable to floating hardware, and can also handle unnormalized floats.
>Then one could put a shifted significand of f1 in a double register,
>divide by the significand of f2, setting iq to be the quotient, and 
>the remainder with the exponent of f2, and then normalized, as fr.
>Now I wonder how many of the readers would think of doing it this way?
>And how would the hardware designers recognize that putting the operation
>in hardware might be a good idea?
>

Gee, and what does the hardware do if the division is 1e300 / 1e-300?
It's got to extract about 2000 bits of quotient before it gets all the
integer bits.  I take it you really want

	iq = (f1 / f2) % some-power-of-two

Since continuable instructions don't mesh well with pipelined systems,
most hardware would only calculate 53 or maybe 64 quotient bits at a
time.  For example, look at the remainder *sequence* used on the 80x87.
Nobody wants to wait for the full calculation to finish before taking
an interrupt or context switching.

Strange as it may seem, the dividers in today's computers can compute
the correct quotient without having the correct remainder around to
return to the user.  Fortunately, some can compute the correct remainder
in one operation with a fused multiply add.
>-- 
>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
>Phone: (317)494-6054
>hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

Bob Alverson, Tera Computer
Resident arithmetic lunatic

glew@pdx007.intel.com (Andy Glew) (05/01/91)

Just thought that I would juxtapose two posts to this newsgroup:

    Strange as it may seem, the dividers in today's computers can compute
    the correct quotient without having the correct remainder around to
    return to the user.  Fortunately, some can compute the correct remainder
    in one operation with a fused multiply add.

    Bob Alverson, Tera Computer
    Resident arithmetic lunatic


And:

    From the Program for the IEEE Symposium on Computer Arithmetic:

    8.1 ``Integer Division Using Reciprocals''
	Robert Alverson, Tera Computer Company

Nuff said?

(NB. I'm on Bob Alverson's side.  This sort of thing is good.)
--

Andy Glew, glew@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
Hillsboro, Oregon 97124-6497

This is a private posting; it does not indicate opinions or positions
of Intel Corp.

herrickd@iccgcc.decnet.ab.com (05/04/91)

In article <1991Apr28.151831.2768@lth.se>, bengtl@maths.lth.se (Bengt Larsson) writes:
> In article <11411@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu 
> (Herman Rubin) writes:
>>Have any language designers or hardware designers ever asked for input on the
>>variety of operations which are readily understood by people like Silverman,
>>Lehmer, etc.?  
> 
> Why do you assume that people should come and ask you (a matematician)
> about computer language development?  What makes you an expert in the
> field? 
> 
He is a customer.  Unfortunately, he does not represent a large enough
subset of their customers.

Unfortunately because he is better informed than many other customers
and all the customer base might be better served if Henry's ideas
could penetrate the language design community.

Not talking to the mathematicians let the FORTRAN design team at IBM
put the label REAL on a finite subset of the rationals.

We'll never displace floating point numbers now, but what might we
have if the game theorist had spent more time talking with number
theorists back around 1940-1945?

How many engineering juniors believe that

      DO I = 1, 50000
        SUM = SUM + DATA (I)
      END DO

computes a good approximation of the sum of 50000 numbers they measured
somehow?

Could we have realized in hardware a number system that does give us
a commutative addition?

I think there is good reason for the hardware and software architects 
to listen to the number theorists.

dan herrick
herrickd@iccgcc.decnet.ab.com

herrickd@iccgcc.decnet.ab.com (05/04/91)

In article <4474.28219270@iccgcc.decnet.ab.com>, I said:
> 
> Could we have realized in hardware a number system that does give us
> a commutative addition?
> 
after giving the example that shows floating point arithmetic is
not associative.  But the fix for the algorithm is to sort the
input data - doesn't that suggest the word "commutative" to you, too?

I'm still
> dan herrick
> herrickd@iccgcc.decnet.ab.com

ingoldsb@ctycal.UUCP (Terry Ingoldsby) (05/08/91)

In article <11411@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> In article <4082@batman.moravian.EDU>, halkoD@batman.moravian.EDU (David Halko) writes:
> > In article <9782@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> > > In article <1991Apr5.172533.6717@agate.berkeley.edu>, doug@eris.berkeley.edu (Doug Merritt) writes:
> > > > In article <1991Apr4.125122.1@capd.jhuapl.edu> waltrip@capd.jhuapl.edu writes:

I had to leave the above 4 lines in, just cause it looks so awesome to see
a discussion going so many levels deep :^)
...
> Consider the following two operations, which comprise more than 25% of the 
> operations in some highly efficient, from the "picocode" standpoint, methods
> for generating non-uniform random numbers.
> 
> 1.  Examine a bit, and take some action depending on it.  Also, the next time
> this is to be done, this bit effectively no longer exists.  It is necessary
> to make provision at least somewhere for replenishing the supply of bits.
> 
> 2.  Find the distance to the next one in a bit stream.  The same considerations
> as above apply.

I have got to ask.  Why is it so generally important that the distance between
bits can be determined efficiently.  Note that I want to know, `why is it
important to ME, and to the general computing base?'.  This is the question
that hardware designers and compiler writers ask themselves.

For the hardware/software designer to incorporate the features you want, it
is not enough to know that they are important to YOU.  From
a business case, the only reason that they care if you are happy is if you
give them money.  If the desired features are generally useful, then the
probability is high that inclusion of such features will generate lots of
money.

It seems to me that the only way your crusade for bit distance will ever be
successful is to show the general utility of such functions.  Otherwise, no
matter how interesting (to you), you will not get these features.

Anyway, please tell us.  What is so interesting/useful about non-uniform
random numbers?

-- 
  Terry Ingoldsby                ingoldsb%ctycal@cpsc.ucalgary.ca
  Land Information Services                 or
  The City of Calgary       ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

stephen@pesto.uchicago.edu (Stephen P Spackman) (05/08/91)

In article <653@ctycal.UUCP> ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes:

   It seems to me that the only way your crusade for bit distance will ever be
   successful is to show the general utility of such functions.  Otherwise, no
   matter how interesting (to you), you will not get these features.

Are we talking about "find first one" here?

Find first one was a pretty useful instruction in our lisp compiler.
In fact, we figured we could speed the thing up a LOT by moving to the
Amiga and using the blitter to do bitmask stuff (which we wanted (a)
for closure algorithms and (b) for resource allocation).

A lot of CPU seems to get expended on variable length data encoding
around here. I haven't examined the code myself but it sounds like it
would come up there.

It's a notoriously useful thing for resource allocation in operating
systems, too.

I've surely wanted this more often than, say, floating point addition.

Of course, there are always other data representations, and so long as
the hardware people take the attitude that computers are for fluid
dynamics and to hell with compilers, operating systems and databases
(who uses computers for THAT stuff anyway? :-)....

Lotsa smilies.
----------------------------------------------------------------------
stephen p spackman         Center for Information and Language Studies
systems analyst                                  University of Chicago
----------------------------------------------------------------------

hrubin@pop.stat.purdue.edu (Herman Rubin) (05/08/91)

In article <653@ctycal.UUCP>, ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes:
> In article <11411@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> > In article <4082@batman.moravian.EDU>, halkoD@batman.moravian.EDU (David Halko) writes:
> > > In article <9782@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> > > > In article <1991Apr5.172533.6717@agate.berkeley.edu>, doug@eris.berkeley.edu (Doug Merritt) writes:
> > > > > In article <1991Apr4.125122.1@capd.jhuapl.edu> waltrip@capd.jhuapl.edu writes:
 
> I had to leave the above 4 lines in, just cause it looks so awesome to see
> a discussion going so many levels deep :^)
 ...
> > Consider the following two operations, which comprise more than 25% of the 
> > operations in some highly efficient, from the "picocode" standpoint, methods
> > for generating non-uniform random numbers.
 
> > 1.  Examine a bit, and take some action depending on it.  Also, the next time
> > this is to be done, this bit effectively no longer exists.  It is necessary
> > to make provision at least somewhere for replenishing the supply of bits.
 
> > 2.  Find the distance to the next one in a bit stream.  The same considerations
> > as above apply.
 
> I have got to ask.  Why is it so generally important that the distance between
> bits can be determined efficiently.  Note that I want to know, `why is it
> important to ME, and to the general computing base?'.  This is the question
> that hardware designers and compiler writers ask themselves.

I believe most people are aware of the existence of simulation, including
Monte Carlo, or Las Vegas, methods for obtaining answers to otherwise
intractable problems.  In these, it is almost always the case that 
considerable use must be made of non-uniform random variables; the 
quantities of interest are usually exponential, normal, Cauchy, binomial,
Poisson, etc., and not uniform.

In any particular problem, it is quite likely that thousands or millions or
even billions of these random numbers will be used.  If this is not a situation
which calls for efficiency, it will do until a real one comes along. :-)
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

wright@Stardent.COM (David Wright) (05/09/91)

In article <653@ctycal.UUCP> ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes:
>In article <11411@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>> In article <4082@batman.moravian.EDU>, halkoD@batman.moravian.EDU (David Halko) writes:
>> > In article <9782@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>> Consider the following two operations, which comprise more than 25% of the 
>> operations in some highly efficient, from the "picocode" standpoint, methods
>> for generating non-uniform random numbers.
>> 
>> 1.  Examine a bit, and take some action depending on it.  Also, the next time
>> this is to be done, this bit effectively no longer exists.  It is necessary
>> to make provision at least somewhere for replenishing the supply of bits.
>> 
>> 2.  Find the distance to the next one in a bit stream.  The same
>> considerations as above apply.
>
>I have got to ask.  Why is it so generally important that the distance between
>bits can be determined efficiently.  Note that I want to know, `why is it
>important to ME, and to the general computing base?'.  This is the question
>that hardware designers and compiler writers ask themselves.

Hear hear.  And this leads to a second question of HR, which has been
posed once before but not answered (or I missed it):  Why does your
application have to work on a bit stream?  What is wrong with
instead computing the data about the bit stream, and passing the
results along as, say, pairs (a,b), where a is the number of
consecutive 0's and b the number of consecutive 1's in the current
part of the stream?

>Anyway, please tell us.  What is so interesting/useful about non-uniform
>random numbers?

Well, if he's generating numbers that conform to some other
distribution, the results certainly wouldn't be uniform random numbers
(though I think this is usually done by *starting* with a uniform
distribution and then manipulating that into, say, a Gaussian).

  -- David Wright, not officially representing Stardent Computer Inc
     wright@stardent.com  or  uunet!stardent!wright

Groucho:  America is waiting to hear him sing!
Chico:    Well, he can sing pretty loud, but not that loud.
                          -- A Night at the Opera

chip@tct.com (Chip Salzenberg) (05/09/91)

According to hrubin@pop.stat.purdue.edu (Herman Rubin):
>In article <653@ctycal.UUCP>, ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes:
>> I have got to ask.  Why is it so generally important that the distance
>> between bits can be determined efficiently.  Note that I want to know,
>> `why is it important to ME, and to the general computing base?'.
>
>I believe most people are aware of the existence of simulation, including
>Monte Carlo, or Las Vegas, methods for obtaining answers to otherwise
>intractable problems.

That kind of problem is not "generally important" in that it does not
come up in the "general computing base".

Care to try again?
-- 
Brand X Industries Sentient and Semi-Sentient Being Resources Department:
        Because Sometimes, "Human" Just Isn't Good Enough [tm]
     Chip Salzenberg         <chip@tct.com>, <uunet!pdn!tct!chip>

gideony@microsoft.UUCP (Gideon YUVAL) (05/10/91)

In article <STEPHEN.91May8001742@pesto.uchicago.edu> stephen@pesto.uchicago.edu (Stephen P Spackman) writes:
>Are we talking about "find first one" here?
>
>Find first one was a pretty useful instruction in our lisp compiler.

If you start from the low-order bit, you can find the first "one"
bit by:

table[ (x  XOR (x-1)) MOD 37]

On a 386, this is 4X as fast (worst-case) as the "hardware" BSF.
-- 
Gideon Yuval, gideony@microsof.UUCP, 206-882-8080 (fax:206-883-8101;TWX:160520)

hrubin@pop.stat.purdue.edu (Herman Rubin) (05/10/91)

In article <28297C23.6984@tct.com>, chip@tct.com (Chip Salzenberg) writes:
> According to hrubin@pop.stat.purdue.edu (Herman Rubin):
> >In article <653@ctycal.UUCP>, ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes:
> >> I have got to ask.  Why is it so generally important that the distance
> >> between bits can be determined efficiently.  Note that I want to know,
> >> `why is it important to ME, and to the general computing base?'.

> >I believe most people are aware of the existence of simulation, including
> >Monte Carlo, or Las Vegas, methods for obtaining answers to otherwise
> >intractable problems.
 
> That kind of problem is not "generally important" in that it does not
> come up in the "general computing base".

That something is not important to everyone is no reason no to make it
available to those who can see a use for it.

How much of the driving public uses the low gears of automatic transmissions?
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

gsh7w@astsun.astro.Virginia.EDU (Greg Hennessy) (05/10/91)

Chip Salzenberg writes:
#> That kind of problem is not "generally important" in that it does not
#> come up in the "general computing base".

Herman Rubin writes:
#That something is not important to everyone is no reason no to make it
#available to those who can see a use for it.

Chip didn't say something had to be important for everyone before
making it available, Chip said it had to be important to the general
base. Those aren't the same thing.

Herman, you seem to have quite specialized computer needs, and more
knowledge about computers than the average person, but the types of
things you do are not useful to J. Random User, so people who write
compilers for general computer users don't find it cost effective to
add your optimizations into their code.

As a radio-astronomer, I have specialized needs in computing also. I
don't complain that there is no commodity language to reduce my data,
instead I (and other radio-astronomers) write all our code ourselves.
Our market is not large enough to expect off the shelf solutions to
our problems.

Why don't you apply for a grant, and hire someone to write a language
for you?

--
-Greg Hennessy, University of Virginia
 USPS Mail:     Astronomy Department, Charlottesville, VA 22903-2475 USA
 Internet:      gsh7w@virginia.edu  
 UUCP:		...!uunet!virginia!gsh7w

rwa@cs.athabascau.ca (Ross Alexander) (05/11/91)

hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
	[much chat about whether Herman's apps are typical elided]
>How much of the driving public uses the low gears of automatic transmissions?

All of them, Herman.  Whenever they accelerate from a stop.  That's
what an automatic transmission is for.

-- 
Ross Alexander    rwa@cs.athabascau.ca    (403) 675 6311    ve6pdq
		`You were s'posed to laugh!' -- Zippy

guy@auspex.auspex.com (Guy Harris) (05/11/91)

>> I have got to ask.  Why is it so generally important that the distance between
>> bits can be determined efficiently.  Note that I want to know, `why is it
>> important to ME, and to the general computing base?'.  This is the question
>> that hardware designers and compiler writers ask themselves.
>
>I believe most people are aware of the existence of simulation, including
>Monte Carlo, or Las Vegas, methods for obtaining answers to otherwise
>intractable problems.

Yes, but unless the person who posed the original question does those
kinds of simulations, it's not important to him, and unless a given
member of the general computing base does them, it's not important to
them, either.

I.e., you answered an interesting question, namely "tell me a use for
finding the distance to the next 1 in a bit stream", but you didn't
answer the question that was asked.

Now, there are other places where finding the distance to the next 1 in
a bit stream is useful.  It may well be that it's worth providing some
kind of hardware assist for it; I'm curious what forms of "hardware
assist" of that sort exist (other than doing it using the obvious
simple-minded loop, but in microcode).

Some questions then are "what other useful hardware would you have to
sacrifice, in some given implementation, to add the hardware assist for
'find next 1'?" and "will we, at some point, not have to sacrifice
anything really useful to get it?" so that you might want to put such an
operation into the instruction set anyway, and have a non-assisted
implementation early on.

(Is 'find next 1' similar enough to floating-point normalization that
the same hardware assistance can be used for both?"

BTW, I assume that by "most people" in "I believe most people are aware
of the existence of simulation..." you mean "most readers of this
group", which may well be true.  I don't know how "find first 1" comes
in handy for those simulations, though.

jpk@ingres.com (Jon Krueger) (05/11/91)

In article <12054@mentor.cc.purdue.edu>, Herman Rubin does not address
 the question:
>> Why is it so generally important that the distance between
>> bits can be determined efficiently ... why is it
>> important to ME, and to the general computing base?

Instead he addresses the question "why is it so specifically
ipmortant to Herman Rubin":

> I believe most people are aware of the existence of simulation, including
> Monte Carlo, or Las Vegas, methods for obtaining answers to otherwise
> intractable problems.

No one asked that question.  We believe it's important to you, Herman.
Can you answer the question that was asked?

Why is it important to the general computing base that the distance
between bits can be determined efficiently?

-- Jon
--

Jon Krueger, jpk@ingres.com 

rockwell@socrates.umd.edu (Raul Rockwell) (05/11/91)

Herman Rubin:
   >> Consider the following two operations, ...
   >> 1.  Examine a bit, ...
   >> 2.  Find the distance to the next one in a bit stream.

Terry Ingoldsby:
   >I have got to ask.  Why is it so generally important that the
   >distance between bits can be determined efficiently.

David Wright:
   Hear hear.  And this leads to a second question of HR, which has been
   posed once before but not answered (or I missed it):  Why does your
   application have to work on a bit stream? 

Chip Salzenberg:
   That kind of problem is not "generally important" in that it does not
   come up in the "general computing base".

harumph.

At work we have a hand-optimized assembly routine to convert from a
bit array to a list of indices.  It gets used a lot.

Please note that a bit list is very handy for dealing with selection
problems.  They are very compact, and have very predictable size
requirements.  Selection problems include highlighting text for
emphasis, intermediate results during a database query, as well as Herman
Rubin's probabalistic models.

They're also handy in some domains for converting from some arbitrary
list of integers to a unique, sorted list (but otherwise the same
integers), in O(n).

We use "bit lists as integers" for other things too (once you have a
generic feature you tend to use it whenever it's handy -- particularly
if it's fast).

As for the problem of "distance between bits" vs "absolute position of
bits" note that if you have a _lot_ of bits you start talking about
very long lists of integers (and possibly even exceeding maxint, or
whatever for the representation you're using).  This is particularly
painful if your bitlist is dense.  Both have their uses (as does
a list ordered pairs of the form (position, offset)), and we happen to
use them a lot.

Raul Rockwell

zalman@mips.com (Zalman Stern) (05/11/91)

In article <7738@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
[...]
>Now, there are other places where finding the distance to the next 1 in
>a bit stream is useful.  It may well be that it's worth providing some
>kind of hardware assist for it; I'm curious what forms of "hardware
>assist" of that sort exist (other than doing it using the obvious
>simple-minded loop, but in microcode).

The IBM RS/6000 has a single cycle count leading zeros instruction. The
hardware is also used by the multiplier to short circuit multiplies by
smaller constants. The operation takes from 3 to 5 cycles depending on the
size of the multiplier. The clz hardware is used to decide how much work it
has to do.

>
>Some questions then are "what other useful hardware would you have to
>sacrifice, in some given implementation, to add the hardware assist for
>'find next 1'?" and "will we, at some point, not have to sacrifice
>anything really useful to get it?" so that you might want to put such an
>operation into the instruction set anyway, and have a non-assisted
>implementation early on.

A Count leading zeros (or ffs if you prefer) instruction is not a big deal
other than that you have to design the hardware to do it. (I.e. it doesn't
place unreasoanble demands on the register file, it doesn't eat lots of
opcode space.) However its not clear that this single instruction does the
job. You can either look for a 1 or a 0 bit and you can look from most
significant to least significant or vice-versa. Herman probably wants all
of these options...

>
>(Is 'find next 1' similar enough to floating-point normalization that
>the same hardware assistance can be used for both?"

Putting it in the FP unit makes it harder to uses the result for shifts and
such on machines with seperate FP and integer units. (Hooking the integer
register file up to the FP normalization hardware is going to be very
painful.)

My guess is that special purpose hardware would be useful for this sort of
operation. DMA it in at bus speed and write the values into scratch RAM.
Interrupt the processor when the scratch RAM gets full or the buffer gets
empty. (Or DMA the scratch RAM back out too...) Go ahead and toss a Xlinx
or something into the hardware to make it programable.
-- 
Zalman Stern, MIPS Computer Systems, 928 E. Arques 1-03, Sunnyvale, CA 94088
zalman@mips.com OR {ames,decwrl,prls,pyramid}!mips!zalman     (408) 524 8395
  "Never rub another man's rhubarb" -- the Joker via Pop Will Eat Itself

hrubin@pop.stat.purdue.edu (Herman Rubin) (05/11/91)

In article <rwa.673895513@aupair.cs.athabascau.ca>, rwa@cs.athabascau.ca (Ross Alexander) writes:
> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> 	[much chat about whether Herman's apps are typical elided]
> >How much of the driving public uses the low gears of automatic transmissions?
> 
> All of them, Herman.  Whenever they accelerate from a stop.  That's
> what an automatic transmission is for.

To clarify, I meant the specific ability to place the transmission in low gears.

Current computer hardware can do anything,assuming sufficient external storage,
but not efficiently.  Similarly with software.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

gsh7w@astsun.astro.Virginia.EDU (Greg Hennessy) (05/12/91)

Herman Rubin writes:
#To clarify, I meant the specific ability to place the transmission in low gears.

I do this every time it snows. I would have thought that Lafayette had
its fare share of snow.

--
-Greg Hennessy, University of Virginia
 USPS Mail:     Astronomy Department, Charlottesville, VA 22903-2475 USA
 Internet:      gsh7w@virginia.edu  
 UUCP:		...!uunet!virginia!gsh7w

hrubin@pop.stat.purdue.edu (Herman Rubin) (05/12/91)

In article <1991May10.204426.24828@ingres.Ingres.COM>, jpk@ingres.com (Jon Krueger) writes:
> In article <12054@mentor.cc.purdue.edu>, Herman Rubin does not address
>  the question:
> >> Why is it so generally important that the distance between
> >> bits can be determined efficiently ... why is it
> >> important to ME, and to the general computing base?
> 
> Instead he addresses the question "why is it so specifically
> ipmortant to Herman Rubin":
> 
> > I believe most people are aware of the existence of simulation, including
> > Monte Carlo, or Las Vegas, methods for obtaining answers to otherwise
> > intractable problems.
> 
> No one asked that question.  We believe it's important to you, Herman.
> Can you answer the question that was asked?
> 
> Why is it important to the general computing base that the distance
> between bits can be determined efficiently?

The summary points out the problems.  Does one have to go to great lengths
to get a car which has features not in the "general driving base"?

Do film companies only make films for the "general public"?

Should one even consider that universities only teach courses for the
"general student"?

In addition, there is much use of instructions for systems programs and
libraries  which the individual programmer will not use.  It is unlikely
that the person using the subroutines which would make use of efficient
obtaining of the distance between 1's in a bit stream would be aware of
this, any more than the person who computes a trigonometric or exponential
function is aware that that algorithm has an integer quotient and floating
remainder in it.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

dmocsny@minerva.che.uc.edu (Daniel Mocsny) (05/12/91)

In article <12235@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>In article <1991May10.204426.24828@ingres.Ingres.COM>, jpk@ingres.com (Jon Krueger) writes:
>> Why is it important to the general computing base that the distance
>> between bits can be determined efficiently?
>
>The summary points out the problems.  Does one have to go to great lengths
>to get a car which has features not in the "general driving base"?

I believe that Herman Rubin could solve his problem with these steps:

1. Develop a computer code which solves a reasonable set of problems,
and which relies extensively on a computer's ability to determine the
distance between bits.

2. Submit the code to SPEC and get it included in the standard
benchmark suite.

3. Watch panic-stricken marketroids scurry to protect their SPEC
ratings by further partitioning SPEC into an additional category:
SPECrubins.


--
Dan Mocsny				
Internet: dmocsny@minerva.che.uc.edu

jpk@ingres.com (Jon Krueger) (05/13/91)

hrubin@pop.stat.purdue.edu (Herman Rubin) whines:
> That something is not important to everyone is no reason no to make it
> available to those who can see a use for it.

But of course it is available.  What exactly is preventing you
from using whatever feature you like?

-- Jon
--

Jon Krueger, jpk@ingres.com 

peter@ficc.ferranti.com (peter da silva) (05/13/91)

In article <12235@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> The summary points out the problems.  Does one have to go to great lengths
> to get a car which has features not in the "general driving base"?

Yes. For example, it is getting harder and harder to find a car with
competantly designed setabelts instead of rube-goldberg "passive restraint"
systems (whether or not accompanied by an airbag). That's because the
general public is lazy, so that's where all the money is.

> Do film companies only make films for the "general public"?

In general, yes. At least that's where all the money is.

> Should one even consider that universities only teach courses for the
> "general student"?

That's where all the money is.

> In addition, there is much use of instructions for systems programs and
> libraries  which the individual programmer will not use.

So how about some *examples*?

The "general computer base" is where the money is, and building a chip is
not cheap. Nor is the return on investment that high.
-- 
Peter da Silva; Ferranti International Controls Corporation; +1 713 274 5180;
Sugar Land, TX  77487-5012;         `-_-' "Have you hugged your wolf, today?"

rwa@cs.athabascau.ca (Ross Alexander) (05/13/91)

hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

>In article <rwa.673895513@aupair.cs.athabascau.ca>, rwa@cs.athabascau.ca (Ross Alexander) writes:
>> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>> 	[much chat about whether Herman's apps are typical elided]
>> >How much of the driving public uses the low gears of automatic transmissions?
>> 
>> All of them, Herman.  Whenever they accelerate from a stop.  That's
>> what an automatic transmission is for.

>To clarify, I meant the specific ability to place the transmission in low gears.

But the auto transmission knows when to do that for you.  Some poor
underpaid mechanical engineer spent years designing an analogue
(fluid) computer that monitors engine speed, accelerator position,
drive wheel speed, and torques and then selects a gear ratio with a
view to optomizing fuel economy, engine wear, and ghods know what
else.  They are remarkably robust (mine currently has 275,000 km
without any failures) and maintenance-free.  Using one allows the
operator to concentrate on an appropriate problem domain - keeping the
vehicle out of the ditch.  Why are you so d*mned eager to second guess
this mechanism, and divert your attention from the thing that,
compared to the machinery, you do well (guidance)?  BTW, I don't
accept the `snow hypothesis'; with proper tires, second gear starts to
prevent slippage are redundant (and besides, they're hard on the
engine - low speed/high torque (lugging) strains the crankshaft and
its bearings).

All these arguements can be moved back into the computer architecture
discussion with remarkably little rephrasing, I might add.

>Current computer hardware can do anything,assuming sufficient external storage,
>but not efficiently.  Similarly with software.

I can only wonder that, if they're inefficient, why are they so much
faster and cheaper that they used to be?  What the heck do you mean by
`efficient'?  Obviously some new use of the word I was previously
unaquainted with (apologies to Arthur Dent).

-- 
Ross Alexander    rwa@cs.athabascau.ca    (403) 675 6311    ve6pdq
		`You were s'posed to laugh!' -- Zippy

hrubin@pop.stat.purdue.edu (Herman Rubin) (05/14/91)

In article <rwa.674151323@aupair.cs.athabascau.ca>, rwa@cs.athabascau.ca (Ross Alexander) writes:
> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
			
			.................

> >To clarify, I meant the specific ability to place the transmission in low gears.
> 
> But the auto transmission knows when to do that for you.  Some poor
> underpaid mechanical engineer spent years designing an analogue
> (fluid) computer that monitors engine speed, accelerator position,
> drive wheel speed, and torques and then selects a gear ratio with a
> view to optomizing fuel economy, engine wear, and ghods know what
> else.

According to those who know, they do not do a very good job.  And there
are situations in which the overrides are needed; why do you think that
the manufacturers still keep them?  There are places I would not drive
without using these overrides.

			...................

> >Current computer hardware can do anything,assuming sufficient external storage,
> >but not efficiently.  Similarly with software.
> 
> I can only wonder that, if they're inefficient, why are they so much
> faster and cheaper that they used to be?  What the heck do you mean by
> `efficient'?  Obviously some new use of the word I was previously
> unaquainted with (apologies to Arthur Dent).

If you make something 10^6 times as fast, and with 10^5 times the capacity,
and get 10^4 times as much done, it is not as efficient.  Now these are not
precise figures.  But the CRAYs, for example, take about 20 instructions
to perform a double precision multiplication of two single precision
numbers, while the CYBER 205 takes exactly 2.  The earliest computers
multiplied two integers of more than 32 bits and obtained a double length
result in one instruction.  I suggest you program this on your box, and
see how many instructions it takes, if the instruction is not there.

If some hardware does not have communication between the integer and floating
registers, conversion is a real horror.  I do not think that any kludge is
likely to be as bad as the complicated code needed, using memory, no less.
Now the early computers were fast on memory and transfer, but the newer
ones are very definitely not so, unless trickery is used.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

preston@ariel.rice.edu (Preston Briggs) (05/14/91)

hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

>But the CRAYs, for example, take about 20 instructions
>to perform a double precision multiplication of two single precision
>numbers, while the CYBER 205 takes exactly 2.  The earliest computers
>multiplied two integers of more than 32 bits and obtained a double length
>result in one instruction.

Sure, but each instruction was very slow, in modern terms.

>Now the early computers were fast on memory and transfer, but the newer
>ones are very definitely not so, unless trickery is used.

New computers have much faster memory access.
They also have much, much faster FP and integer instructions.
So the balance has changed.  Everyone knows this.

The idea is that the overall throughput goes up.
Certainly my programs run faster.  I expect yours do too.

If you want to take advantage of new architectures and implementations,
your probably going to have to rethink some of your old assumptions.
For example, in the old days, FP was much more expensive than integer
arithmetic.  Now they are about the same.  In the old days, FP was
more expensive than memory accesses.  Now FP is generally cheaper.

It's easy to get old instructions (say int x int -> long int) in 1 cycle,
just slow the cycle time down to old rates.

jpk@ingres.com (Jon Krueger) (05/14/91)

hrubin@pop.stat.purdue.edu (Herman Rubin):
> Does one have to go to great lengths
> to get a car which has features not in the "general driving base"?

Yes.  "Great lengths" usually means cost.  TVRs are available
for those who are willing to pay.

> Do film companies only make films for the "general public"?

Yes.  Custom emulsions are available for those who are willing
to pay.  The rest of us benefit from commodity hardware.

> Should one even consider that universities only teach courses for the
> "general student"?

Of course.  Universities regularly cancel courses if too few people
sign up.  Courses are not seminars.  Seminars are not tutoring.

> [a] person using the subroutines which would make use of efficient
> obtaining of the distance between 1's in a bit stream [might be
> unaware of it]

Where's your DATA?  Regardless of whether he's aware of it, just how
many microseconds has Joe User saved this year because of efficient bit
distance routines working on his behalf?  Where's your DATA?  Skip
the analogies, they're not compelling and even if they were they're
without substance.  Where's your DATA?

-- Jon
--

Jon Krueger, jpk@ingres.com 

kers@hplb.hpl.hp.com (Chris Dollin) (05/14/91)

Ross Alexander says:

   [quoting Herman Rubin]
   >To clarify, I meant the specific ability to place the transmission in low gears.

   But the auto transmission knows when to do that for you.  

My experience with manual gears [in Britain] vs automatic gears [America, much
less experience] is ``knows when to do that for you'' is an overstatement.

   Using one allows the operator to concentrate on an appropriate problem
   domain - keeping the vehicle out of the ditch.  

Is this a matter of fact, or of plausible opinion? I'm less aware of changing
gear [in my car my car] than I am of an automatic mishandling gear changes 
- that is to say, it [the former] happens, but not very often.

   Why are you so d*mned eager to second guess
   this mechanism, and divert your attention from the thing that,
   compared to the machinery, you do well (guidance)?  

He then adds

   All these arguements can be moved back into the computer architecture
   discussion with remarkably little rephrasing, I might add.

Well, I'd agree, but perhaps with the opposite sign! I usually disagree with
Herman on architecture; but it seems to me that the car geraing analogy is all
in his favour. Perhaps, if we wish to use this analogy, we should find a part
of the car more suitaed to our purposes ...

One of the Rules for System Design [the authors name I recall with enough
uncertainty to be to embarrassed to attempt to write here] was ``don't hide
power''. Isn't this applicable here? [The Rule doesn't say ``go out of your way
to provide power in the form the use would most like'', of course.]
--

Regards, Kers.      | "You're better off  not dreaming of  the things to come;
Caravan:            | Dreams  are always ending  far too soon."

hinton@gca.UUCP (Edward Hinton) (05/14/91)

In article <rwa.674151323@aupair.cs.athabascau.ca> rwa@cs.athabascau.ca (Ross Alexander) writes:
>hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>
>>To clarify, I meant the specific ability to place the transmission in low gears.
>
>Why are you so d*mned eager to second guess
>this mechanism, and divert your attention from the thing that,
>compared to the machinery, you do well (guidance)?  BTW, I don't
>accept the `snow hypothesis'; with proper tires, second gear starts to
>prevent slippage are redundant (and besides, they're hard on the
>engine - low speed/high torque (lugging) strains the crankshaft and
>its bearings).
>

I'm new to this thread, but I have a bone to pick with this reasoning,
as well as whether Ross has ever lived through a winter driving an
automatic transmission car in New England or a similar snow area.
Snow here is WET.  It slides.  It becomes something akin to sawdust and
motor oil mixed together during the early portions of a storm, which
happens to be when MOST accidents due to skidding occur up here.
I much prefer the control of a manual (translate: RISC) transmission,
although when driving an automatic (translate: CISC), I have many times
found that lower gears are the ONLY way to go up hill unless you care
to invest in chains which contribute to road damage unless the road is covered
with ice.  Low gears are also necessary when you have certain engine problems
miles from the nearest service station.  CISC might be claimed to do things
better under the conditions designed for, but I recently had a problem
which required me to use the two low gears to get home in order to
prevent stalling every few feet.

In computer architecture terms, the principles are the same.  CISC can do
wonderful things that the designers thought of.  But those are usually
the easy cases to predict.  'When things go wrong' and 'under adverse
conditions' are the ones where no amount of hardware forsight will
suffice, and exceptions are where software folks really make their
money.  RISC makes the hard cases easier.  Of course, one could claim
that just as automatic transmissions also allow manual use of lower
gears, CISC machines don't need to forbid simple instructions also.
But we're dealing with extra baggage then, and just as the lower gears
in my truck don't work nearly as well as my standard transmission car,
so too, extra baggage in a CISC is unlikely to be as efficient as
the regular RISC instructions.

Please don't simply flame my next remark, but my philosophy is simple:
"Hardware designers shouldn't keep trying to guess what unexpected
ways I will want to use their hardware."  I am just as creative as a
hardware guy, just in a different line of work.  I want fast, simple,
general tools that let me be creative.  RISC gives me that.

abaum (Allen Baum) (05/14/91)

In article <1991May13.211555.28824@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:

>It's easy to get old instructions (say int x int -> long int) in 1 cycle,
>just slow the cycle time down to old rates.

It's easy to do now, with the same cycle times. Cycle times are
not the issue in this case. The issue is one of returning two
results, and what to do with them. The are various solutions, almost
all are disliked by either compiler writers, architects, or chip
designers.

These solutions include specifying two destination registers in the
instruction, overwriting one or both of the source registers, writing
back to dest reg/dest reg+1 pair, keeping the upper half in a special
register..... you can probably think of more, and come up with
the reasons that each of the groups above dislike it. The reasons
for disliking them aren't terribly frivolous either- they just 
outweigh the advantages of returning both (in the eyes of those
particular groups, AT the particular time - times change!)3

chip@tct.com (Chip Salzenberg) (05/15/91)

According to hrubin@pop.stat.purdue.edu (Herman Rubin):
>That something is not important to everyone is no reason no to make it
>available to those who can see a use for it.

I agree... unless the inclusion of a feature slows down other, more
widely used features.

The addition of an instruction can make a CPU slower, harder to build
and harder to test.  "Neat" instructions are not free.
-- 
Brand X Industries Custodial, Refurbishing and Containment Service:
         When You Never, Ever Want To See It Again [tm]
     Chip Salzenberg   <chip@tct.com>, <uunet!pdn!tct!chip>

paulh@cimage.com (Paul Haas) (05/16/91)

In article <ROCKWELL.91May10233855@socrates.umd.edu> rockwell@socrates.umd.edu (Raul Rockwell) writes:
>
>At work we have a hand-optimized assembly routine to convert from a
>bit array to a list of indices.  It gets used a lot.
>
>Please note that a bit list is very handy for dealing with selection
>problems.  They are very compact, and have very predictable size
>requirements.  
> ...
>Raul Rockwell

Our product also uses bit lists to store selection lists.  According to
gprof my rather naive find_next_bit() function (coded in C) took an
insignificant amount of run time.  In our application we have to render
each selected entity, and that takes far longer than the bit
twiddling.

To see if a new instruction would be usefull, it helps to have a some
numbers to show how much it would speed up a real application.  For
our application we will probably get a measurable improvement (with a
profiler you can measure lots of insignificant things :-)) when we can
use 64 bit words.  Sometimes the selection lists are rather sparse,
so we spend some time scanning for the next non-zero word.  64 bit
words will speed up lots of applications.
--
Paul Haas paulh@cimage.com (313) 761-7478
I am not now, nor have I ever been an official spokesman for Cimage
Corporation.

ward@vlsi.waterloo.edu (Paul Ward) (05/16/91)

In article <2831299D.53F7@tct.com> chip@tct.com (Chip Salzenberg) writes:
>According to hrubin@pop.stat.purdue.edu (Herman Rubin):
>>That something is not important to everyone is no reason no to make it
>>available to those who can see a use for it.
>
>I agree... unless the inclusion of a feature slows down other, more
>widely used features.
>
>The addition of an instruction can make a CPU slower, harder to build
>and harder to test.  "Neat" instructions are not free.

Even if the "Neat" instruction is free, it should still not be added unless
clearly warrented.  What is free today has to be supported tomorrow.

Paul Ward
University of Waterloo

-- 
They will say, "As surely as the LORD lives, who brought the Israelites up out
of the land of the north and out of all the countries where he had banished
them."  For I will restore them to the land I gave to their forefathers.
                                                                Jeremiah 16:15

mlord@bwdls58.bnr.ca (Mark Lord) (05/16/91)

<Ross Alexander says:
<
<   [quoting Herman Rubin]
<   >To clarify, I meant the specific ability to place the transmission in low gears.
<
<   But the auto transmission knows when to do that for you.  

This is probably a lousy way to try to shoot down this analogy.
It is fairly common knowledge that manual transmissions get significantly
better fuel economy than manual transmissions.  Therefore, when fuel performance
is important, manual control is also very important.   :)

dik@cwi.nl (Dik T. Winter) (05/17/91)

In article <6812@bwdls58.bnr.ca> mlord@bwdls58.bnr.ca (Mark Lord) writes:
 > This is probably a lousy way to try to shoot down this analogy.
Extremely lousy.
 > It is fairly common knowledge that manual transmissions get significantly
 > better fuel economy than manual transmissions.
Right on the dot.  It only depends on the driver.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

chip@tct.com (Chip Salzenberg) (05/17/91)

According to rockwell@socrates.umd.edu (Raul Rockwell):
>Chip Salzenberg:
>   That kind of problem is not "generally important" in that it does not
>   come up in the "general computing base".
>
>harumph.
>
>At work we have a hand-optimized assembly routine to convert from a
>bit array to a list of indices.  It gets used a lot.

Mr. Rockwell's point is presumably that it would be more widely used
were it more widely implemented.  I cannot help but agree.

Unfortunately, it is impossible to do cost/benefit analyses on
possible universes of computation in which given tools are more or
less available or efficient.  A pity.

I've been taken to task by another reader for my use of the term
"general computing base", since I can't define it.  That point is also
well taken.

Things are not as simple as I had believed...
-- 
Brand X Industries Custodial, Refurbishing and Containment Service:
         When You Never, Ever Want To See It Again [tm]
     Chip Salzenberg   <chip@tct.com>, <uunet!pdn!tct!chip>