[comp.arch] Compilers taking advantage of architectural enhancements

aglew@crhc.uiuc.edu (Andy Glew) (10/12/90)

>[gillies@m.cs.uiuc.edu]
>One of the problems in new CPU designs is that designers don't realize
>which architecture enhancements are pointless, because we don't and
>may never have the optimization technology to take advantage of them.

Ah, a potentially interesting and useful topic.  Perhaps we can start
a discussion that will lead to a list of possible hardware
architectural enhancements that a compiler can/cannot take advantage
of?  Maybe the real compiler guys (preston, David C.) will help us
out?

By the way, it is useful in such discussions to distinguish between
run-of-the-mill optimizations that can pretty much be added to any
existing compiler by a fairly good CS grad, and optimizations that
require so much background that only the "experts" in the field can
reasonably be expected to do this.  For example, many of the
parallelizing FORTRAN loop optimizations can reasonably only be
expected to be done by people from Rice or UIUC CSRD (note that I'm
UIUC, but CRHC, not CSRD), so unless you are willing to pay for these
guys (or their assorted front companies) you aren't likely to get them
onto your machine.
    Cost of compiler development can be significant.  Sometimes a
company might put a hardware feature in even though they know a
compiler approach would be better, because they can't afford the
compiler guys (or the compiler guys already have exclusive contracts
with a competitor).


Let me list a few things to start off the discussion.  I hope and
expect to be shot down on a few of them.  It's easy to list a few of
the hardware enhancements that we already know compilers can take
advantage of.


Branch Delay Slots - small number
    One or two branch delay slots can be filled quite easily.
    DIFFICULTY: a reasonably good CS grad can do it.

Branch Delay slots - large number
    Most groups report difficulty filling a large number of branch delay
    slots, although my group's compiler guy (UIUC IMPACT, Prof Weh-M Hwu)
    has been able to get good results for up to 8 delay slots, on slightly
    different hardware (forward semantic, with profiling feedback).
    DIFFICULTY: given the slightly different hardware, and exposure
    	to the necessary ideas, a reasonably good CS grad can do it.

Register file - moderate sized (up to 32 registers)
    Most compilers can use these efficiently. 
    DIFFICULTY:	a reasonably good CS grad can do it.
    	Avoiding IBM's patents is a slight problem.

Register file - large (around 128 registers, or more)
    Most compilers do not get enough benefit from these to justify
    the extra hardware, or the slowed down register access.

Multiple, hierarchical, registers sets
    Like Cray's fast S registers and slower T registers.
    It took a very long time for compilers to take advantage 	
    of these multiple register sets.
    DIFFICULTY: complex.
    

Separate floating point register file
    Most compilers can take advantage of this.
    DIFFICULTY: easy

Heterogenous register file
    Few compilers have been developed that can take advantage of a truly heterogenous
    register file, one in which, for example,  the divide unit writes to registers
    D1..D2, the add unit writes to registers A1..A16, the shift unit writes to registers
    S1..S4 -- even though such hardware conceivably has a cycle time advantage over
    homogenous registers, even on VLIW machines where data can easily be moved
    to generic registers when necessary.
    DIFFICULTY: hard. 

Instruction cache
    Many groups have reported compilers that can take advantage of small instruction caches
    by rearranging code to minimize I-cache misses.  Eg. HP, UIUC IMPACT.
    DIFFICULTY: moderately complex - a good CS grad can do it, but it requires tools
    	and background that take a while to develop, and are not widely available.
    One of the complicating factors is that this sort of optimization is most
    useful when it is done in cooperaton with the linker/loader. Which is something
    that your average compiler hacker doesn't have background in (although it isn't
    too hard).

Data cache - software managed consistency
    This reportedly has been done, but certainly isn't run-of-the-mill.
    DIFFICULTY: needs a skilled compiler expert.

Micro-scheduling parallelism (like CONVEX's ASAP)
    Apparently isn't too hard to develop compiler support for this.
    Hardware is rather expensive, though.
    DIFFICULTY: moderately complex

Vectorizing
    Done in lots of places, but high performance vectorizing compilers
    still seem to be a problem for a lot of companies.
    DIFFICULTY: complex.  The easy things are easy, but milking the
    last few FLOPs out of your vector unit requires compiler experts.

Parallelizing - fine grain, small numbers of processors
    (like Alliant's concurrent hardware)
    Done, eg. by Alliant and Cray
    DIFFICULTY: to be competitive with this sort of hardware, you
    	need real compiler experts on your team.  Harder than vectorizing.

Parallelizing, fine grain, large numbers of processors.
    What Tera is trying to do.
    DIFFICULTY: very complex. Leading edge.
    	Good luck, Tera.

Multiple functional units - heterogenous - VLIW or superscalar
    DIFFICULTY: complex.
    	Complexity of the packing algorithms grows rapidly as
    	you try to go beyond basic block parallelism

Multiple functional units - homogenous - VLIW or superscalar
    DIFFICULTY: moderately complex
    	Easier than the heterogenous case, and the packing algorithms
    	are considerably easier.

Special hardware instructions - scalar
    Taking advantage of simple instructions like abs(), conditional exchange, etc.
    DIFFICULTY:
    	(1) When treated not as a compiler problem, but as a problem of simply
    	    writing libraries to inline optimized machine code, EASY
    	    Requires inlining support.
    	(2) When treated as a compiler problem, eg. compiler should recognize
    	    IF a < 0 THEN a = -a
    	    and automatically convert it to a = ABS(a)
    	    MODERATELY COMPLEX.
    I expect Herman Rubin to comment on this.

--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

preston@titan.rice.edu (Preston Briggs) (10/12/90)

In article <AGLEW.90Oct11144920@dwarfs.crhc.uiuc.edu> aglew@crhc.uiuc.edu (Andy Glew) writes:

>Perhaps we can start
>a discussion that will lead to a list of possible hardware
>architectural enhancements that a compiler can/cannot take advantage of?

I'll comment on a couple of features from the list

>Register file - large (around 128 registers, or more)
>    Most compilers do not get enough benefit from these to justify
>    the extra hardware, or the slowed down register access.

In the proceedings of Sigplan 90, there's a paper about how to chew
lots of registers.

	Improving Register Allocation for Subscripted Variables
	Callahan, Carr, Kennedy

I suggested the subtitle "How to use of all those FP registers"
but nobody was impressed.  Also, there's a limit to how many registers
you need, at least for scientific fortran.  It depends on the speed
of memory and cache, speed of the FPU, and the actual applications.
The idea is that once the FPU is running at full speed,
more registers are wasted.

>Heterogenous register file
>    Few compilers have been developed that can take advantage of a
>    truly heterogenous register file, one in which, for example, the
>    divide unit writes to registers D1..D2, the add unit writes to
>    registers A1..A16, the shift unit writes to registers S1..S4 -- 
>    even though such hardware conceivably has a cycle time advantage
>    over homogenous registers, even on VLIW machines where data can
>    easily be moved to generic registers when necessary.
>    DIFFICULTY: hard. 

At first glance, the problem seems susceptable to coloring.
Perhaps I'm missing something.

>Data cache - software managed consistency
>    This reportedly has been done, but certainly isn't run-of-the-mill.
>    DIFFICULTY: needs a skilled compiler expert.

At Rice (and other places), people are considering the perhaps
related problems of trying to manage cache usage for a single
processor.  I'm personally turned on by the topic because of
big performance gains possible and the possible impact on 
architecture.  Questions like: Can we get away with no D-cache?
Perhaps we don't need cache for FP only?
Can we get away with only direct mapped cache?  What does a compiler
do with set associativity?  How can we do prefetches to cache?

Porterfield did a thesis here that talks some about these questions.
Additionally, Callahan and Porterfield (both at Tera) have a paper
in Supercomputing 90 on (perhaps) similar topics.

>Multiple functional units - heterogenous - VLIW or superscalar
>    DIFFICULTY: complex.
>Multiple functional units - homogenous - VLIW or superscalar
>    DIFFICULTY: moderately complex
>    	Easier than the heterogenous case, and the packing algorithms
>    	are considerably easier.

I had never thought to distinguish the two cases and
I'm not sure why the scheduling algorithms should be much different.

>Special hardware instructions - scalar
>    Taking advantage of simple instructions like abs(), conditional 
>    exchange, etc.
>    DIFFICULTY:
>    	(1) When treated not as a compiler problem, but as a problem of simply
>    	    writing libraries to inline optimized machine code, EASY
>    	    Requires inlining support.

For intrinsics, I follow the PL.8 example.
That is, have intermediate language instructions
for ABS etc. so the optimizer can try and hoist them or perhaps strength
reduce them (e.g. SIN).  Then expand to a simple form (perhaps with branches
and so forth), and let the optimizer get at the guts of each operation.
Some like ABS might be available as basic instructions and so need not
be expanded to a lower level form.  This seems to require that the 
front-end recognize certain calls as intrinsics.  Naturally, this
works fine with Fortran, but compilers for other languages could
easily adopt the same approach.  Probably have for C.

This isn't wonderfully extensible, but people have worked on
variations that might be worth exploring.  In particular,
the Experimental Compiler System (ECS) project at IBM hoped to
achieve the same effect in a more extensible fashion.

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

aglew@crhc.uiuc.edu (Andy Glew) (10/12/90)

>>Perhaps we can start
>>a discussion that will lead to a list of possible hardware
>>architectural enhancements that a compiler can/cannot take advantage of?
>
>I'll comment on a couple of features from the list

Thanks. Do you have any features of your own to add?


>>Register file - large (around 128 registers, or more)
>>    Most compilers do not get enough benefit from these to justify
>>    the extra hardware, or the slowed down register access.
>
>In the proceedings of Sigplan 90, there's a paper about how to chew
>lots of registers.
>
>	Improving Register Allocation for Subscripted Variables
>	Callahan, Carr, Kennedy
>
>I suggested the subtitle "How to use of all those FP registers"
>but nobody was impressed.  Also, there's a limit to how many registers
>you need, at least for scientific fortran.  It depends on the speed
>of memory and cache, speed of the FPU, and the actual applications.
>The idea is that once the FPU is running at full speed,
>more registers are wasted.

I'm pretty sure that being able to put elements of aggregates into
registers is the next big step - subscripted variables and structure
fields.

However, if it's only just now being published, that probably means
its leading edge, so a company should not design for this unless it
has guys in its compiler group actually working on it.
    Any comment on this way of looking at it?


>>Heterogenous register file
>>    Few compilers have been developed that can take advantage of a
>>    truly heterogenous register file, one in which, for example, the
>>    divide unit writes to registers D1..D2, the add unit writes to
>>    registers A1..A16, the shift unit writes to registers S1..S4 -- 
>>    even though such hardware conceivably has a cycle time advantage
>>    over homogenous registers, even on VLIW machines where data can
>>    easily be moved to generic registers when necessary.
>>    DIFFICULTY: hard. 
>
>At first glance, the problem seems susceptable to coloring.
>Perhaps I'm missing something.

I agree with you --- I really don't understand why heterogenous
register files are so hard to handle.  But homogenous register files
are one thing that compiler people have gone into rhapsodies wrt. RISC
about.

Here's one example: the Intel 80x86 is basically a heterogenous
register file machine. Specific registers were tied to the outputs and
inputs of specific functional units in the original hardware.
Compiler people hated targetting this architecture, and there are very
few compilers that can produce machine code comparable to hand-coded
assembly on this architecture.

But heterogenous register files are much easier to make fast.  I think
that two things can be done to make them easier for the compilers to
use:
    (1) Provide more than one register at the output of any specific 
    	functional unit. This avoids the need to immediately move 
    	the result away.
    	(if my spelling is bad it's because my keyboard just went flakey)
    (2) Make the register file heterogenous for writes, but homogenous
    	for reads (it's easier to provide read multiporting than
    	write multiporting).




>>Data cache - software managed consistency
>>    This reportedly has been done, but certainly isn't run-of-the-mill.
>>    DIFFICULTY: needs a skilled compiler expert.
>
>At Rice (and other places), people are considering the perhaps
>related problems of trying to manage cache usage for a single
>processor.  I'm personally turned on by the topic because of
>big performance gains possible and the possible impact on 
>architecture.  Questions like: Can we get away with no D-cache?
>Perhaps we don't need cache for FP only?
>Can we get away with only direct mapped cache?  What does a compiler
>do with set associativity?  How can we do prefetches to cache?
>
>Porterfield did a thesis here that talks some about these questions.
>Additionally, Callahan and Porterfield (both at Tera) have a paper
>in Supercomputing 90 on (perhaps) similar topics.

Again - it's current research, so a company should not "bet the farm"
on this sort of thing, unless it has people actively working in the field.
A compnay should not assume that it can just go out and buy this sort
of compiler technology, or that it can hire a generic CS grad to 
implement this sort of optimization within a year.


>>Multiple functional units - heterogenous - VLIW or superscalar
>>    DIFFICULTY: complex.
>>Multiple functional units - homogenous - VLIW or superscalar
>>    DIFFICULTY: moderately complex
>>    	Easier than the heterogenous case, and the packing algorithms
>>    	are considerably easier.
>
>I had never thought to distinguish the two cases and
>I'm not sure why the scheduling algorithms should be much different.

Neither am I. I just put this up because it seemed to be what Gillies was 
referring to.



>>Special hardware instructions - scalar
>>    Taking advantage of simple instructions like abs(), conditional 
>>    exchange, etc.
>>    DIFFICULTY:
>>    	(1) When treated not as a compiler problem, but as a problem of simply
>>    	    writing libraries to inline optimized machine code, EASY
>>    	    Requires inlining support.
>
>For intrinsics, I follow the PL.8 example.
>That is, have intermediate language instructions
>for ABS etc. so the optimizer can try and hoist them or perhaps strength
>reduce them (e.g. SIN).  Then expand to a simple form (perhaps with branches
>and so forth), and let the optimizer get at the guts of each operation.
>Some like ABS might be available as basic instructions and so need not
>be expanded to a lower level form.  This seems to require that the 
>front-end recognize certain calls as intrinsics.  Naturally, this
>works fine with Fortran, but compilers for other languages could
>easily adopt the same approach.  Probably have for C.
>
>This isn't wonderfully extensible, but people have worked on
>variations that might be worth exploring.  In particular,
>the Experimental Compiler System (ECS) project at IBM hoped to
>achieve the same effect in a more extensible fashion.

I'm a fan of GNU EMACS style inlining.  I don't think that your
intermediate language should even attempt to represent all of the
special hardware instructions that are likely to be useful - and it is
very frustrating when you have the operation that nobody thought of.
PS. comp.compilers has been talking about this recently.

--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

spot@TR4.GP.CS.CMU.EDU (Scott Draves) (10/12/90)

|> 
|> 
|> >>Register file - large (around 128 registers, or more)
|> >>    Most compilers do not get enough benefit from these to justify
|> >>    the extra hardware, or the slowed down register access.
|> >
|> >In the proceedings of Sigplan 90, there's a paper about how to chew
|> >lots of registers.
|> >
|> >	Improving Register Allocation for Subscripted Variables
|> >	Callahan, Carr, Kennedy
|> >
|> >I suggested the subtitle "How to use of all those FP registers"
|> >but nobody was impressed.  Also, there's a limit to how many registers
|> >you need, at least for scientific fortran.  It depends on the speed
|> >of memory and cache, speed of the FPU, and the actual applications.
|> >The idea is that once the FPU is running at full speed,
|> >more registers are wasted.
|> 

shouldn't loop unrolling burn lots of registers also?  when unrolling,
which ceiling will you hit first, the number of registers, or the size
of the I-cache?


			Consume
Scott Draves		Be Silent
spot@cs.cmu.edu		Die

preston@titan.rice.edu (Preston Briggs) (10/12/90)

In article <1990Oct12.042251.18884@cs.cmu.edu> spot@TR4.GP.CS.CMU.EDU (Scott Draves) writes:

>|> >>Register file - large (around 128 registers, or more)
>|> >>    Most compilers do not get enough benefit from these to justify
>|> >>    the extra hardware, or the slowed down register access.
>|> >
>|> >In the proceedings of Sigplan 90, there's a paper about how to chew
>|> >lots of registers.
>|> >
>|> >	Improving Register Allocation for Subscripted Variables
>|> >	Callahan, Carr, Kennedy

>shouldn't loop unrolling burn lots of registers also?  when unrolling,
>which ceiling will you hit first, the number of registers, or the size
>of the I-cache?

I shouldn't been been so cavalier when I said "chew lots of registers".
I meant "use profitably".

Simple unrolling of the inner loop offers little advantage to
a fabulous optimizing compiler.  If the optimizer can't do software
pipelining, then unrolling (if performed correctly) can provide larger
chucks of code to schedule across.  However, if the loop contains
recurrences, then unrolling can't help much.

oh yeah.  it can save some conditional branches.  whoopee

The paper I mentioned above is more agressive (and more profitable).
They advocate using dependence analysis to detect reuse of array elements.
Where there's consistant reuse, they can replace memory references
with register references.  They can also detect opportunities to
unroll *outer* loops and jam the multiple inner loop bodies.
This creates more opportunities for holding reused values in registers
and also helps solve the problem of scheduling loops with recurrences.

On machines like the MIPS and Sparc and 860, they can get factors
of 3 improvement using source-source transformations and the stock
compiler.

These same ideas provide the basis for managing the D-cache.

Scott also asked about blowing out the I-cache.

It's possible; massive unroll-and-jamming can consume lots of code
space.  However, the usual limit is the number of registers
or the speed of the FPU.  All these transformations are intended
to avoid computations being memory-bound.  Once you're compute-bound
(floating-point unit is 100% busy), there's nothing else you can do.
This is why I question the need for 100's of registers.

READ this paper.  It's not optional.

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

preston@titan.rice.edu (Preston Briggs) (10/12/90)

In article <AGLEW.90Oct11222801@treflan.crhc.uiuc.edu> aglew@crhc.uiuc.edu (Andy Glew) writes:
>>>Perhaps we can start
>>>a discussion that will lead to a list of possible hardware
>>>architectural enhancements that a compiler can/cannot take advantage of?

>However, if it's only just now being published, that probably means
>its leading edge, so a company should not design for this unless it
>has guys in its compiler group actually working on it.
>    Any comment on this way of looking at it?

Tough call.  It's hard to get people to look seriously at 
imaginary machines, especially when there's so many real
machines (with real problems:-).

>>>Heterogenous register file
>>>    Few compilers have been developed that can take advantage of a
>>>    truly heterogenous register file, one in which, for example, the
>>>    divide unit writes to registers D1..D2, the add unit writes to
>>>    registers A1..A16, the shift unit writes to registers S1..S4 -- 
>>>    even though such hardware conceivably has a cycle time advantage
>>>    over homogenous registers, even on VLIW machines where data can
>>>    easily be moved to generic registers when necessary.
>>>    DIFFICULTY: hard. 
>>
>>At first glance, the problem seems susceptable to coloring.

>Here's one example: the Intel 80x86 is basically a heterogenous
>register file machine. Specific registers were tied to the outputs and
>inputs of specific functional units in the original hardware.
>Compiler people hated targetting this architecture, and there are very
>few compilers that can produce machine code comparable to hand-coded
>assembly on this architecture.

>
>But heterogenous register files are much easier to make fast.  I think
>that two things can be done to make them easier for the compilers to
>use:
>    (1) Provide more than one register at the output of any specific 
>    	functional unit. This avoids the need to immediately move 
>    	the result away.
>    	(if my spelling is bad it's because my keyboard just went flakey)
>    (2) Make the register file heterogenous for writes, but homogenous
>    	for reads (it's easier to provide read multiporting than
>    	write multiporting).

I'm not sure how to "fix" the x86 family.  Your ideas seem good.
Additionally, a larger total register file helps avoid ugly spill
code.

I expect, on the x86, humans do a better job than coloring allocators
because humans can rearrange the code and a coloring allocator
normally sticks with the schedule it's handed.  Careful intermediate
code generation in the front end can help, but it seems a difficult problem.


>>>Special hardware instructions - scalar
>>>    Taking advantage of simple instructions like abs(), conditional 
>>>    exchange, etc.
>>>    DIFFICULTY:
>>>    	(1) When treated not as a compiler problem, but as a problem of simply
>>>    	    writing libraries to inline optimized machine code, EASY
>>>    	    Requires inlining support.

>>That is, have intermediate language instructions
>>for ABS etc. so the optimizer can try and hoist them or perhaps strength
>>reduce them (e.g. SIN).  Then expand to a simple form (perhaps with branches
>>and so forth), and let the optimizer get at the guts of each operation.
>>Some like ABS might be available as basic instructions and so need not
>>be expanded to a lower level form.  This seems to require that the 
>>front-end recognize certain calls as intrinsics.

>I'm a fan of GNU EMACS style inlining.  I don't think that your
>intermediate language should even attempt to represent all of the
>special hardware instructions that are likely to be useful

I was really advocating having (roughly) an IL instruction for every
source language instrinsic.  That is, I didn't have an aswer for your
question, so I made up my own.

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

golds@fjcnet.GOV (Rich Goldschmidt) (10/13/90)

Maybe this is naive or too futuristic, but is anyone working towards
methods for automatically generating a compiler based on the architecture
design?  It would seem that even before there is silicon, there is
enough info about a new CPU in the CAD system used for layout that the
features of special interest for generating a compiler are known.  To
the extent that generating a compiler is rule based (those tasks a
good CS grad can do?), why hasn't it been automated?  Or are there people
working on this now?  Chip designers might even take compilers into 
consideration in their designs :-).



-- 
Rich Goldschmidt: uunet!fjcp60!golds or golds@fjc.gov
Commercialization of space is the best way to escape the zero-sum game.

moss@cs.umass.edu (Eliot Moss) (10/14/90)

With respect to large register files, David Wall at DECWRL did studies that
indicated that inter-procedural (really global) register allocation at link
time can profitably use a large number of registers and win over register
window and similar architectures. He considered register files up to 64 or so
registers. It did appear that more than that was not likely to yield
significant additional improvement (as I recall). Here are the refs:

David W. Wall, "Global Register Allocation at Link Time", WRL Rsch Rpt 86/3,
(see also Proc SIGPLAN '86 Symp on Compiler Construction, SIGPLAN Notices 21,
7, 264-275).

David W. Wall, Michael L. Powell, "The Mahler Experience: Using an
Intermediate Language as the Machine Description", WRL Rsch Rpt 87/1, (see
also Proc Second Symp on Arch Support for Prog Lang and Operating Systems,
1987).

David W. Wall, "Register Windows vs. Register Allocation", WRL Rsch Rpt 87/5,
(see also Proc SIGPLAN '88 Conf on Prog Lang Design and Impl).

David W. Wall, "Link-Time Code Modification", WRL Rsch Rpt 89/17.

Note that DEC WRL Research Reports can be obtained by sending mail to an
automatic response system; to get an index and instructions, I believe the
protocol is to send a piece of mail to wrl-techreports@wrl.dec.com with the
body being:

	send index

Enjoy!							Eliot
--

		J. Eliot B. Moss, Assistant Professor
		Department of Computer and Information Science
		Lederle Graduate Research Center
		University of Massachusetts
		Amherst, MA  01003
		(413) 545-4206; Moss@cs.umass.edu

pcg@cs.aber.ac.uk (Piercarlo Grandi) (10/15/90)

On 12 Oct 90 03:28:01 GMT, aglew@crhc.uiuc.edu (Andy Glew) said:

	[ ... some comments on large numbers of registers being useful
	... ]

Bah, as usual. If you use them as static cache, yes. But isn't a dynamic
cache as good and less trouble? Yes, if you don't use a load-store
architecture.

In a load-store architecture to address a line in the cache takes an
extra load or store instruction, and potentially a delay slot;
addressing a line in the register bank takes just a wide register number
field in the current instruction. Note that some compilers are starting
to treat cache lines as registers indeed, by scheduling code to have
optimal cache reference patterns.

So, given that one wants an intermediate cache between the input and
output ports of the CPU functional units, and the main memory, we could
have three alternatives:

1) a dynamic cache addressed with aliases of main memory addresses.

2) a static cache in a separate, much smaller, address space.

3) a cache with multiple stacks, only top of stacks have addresses.

If your architecture can address "efficiently" the main memory address
space, 1) is better than 2); if it cannot, 2) is better than 1); in all
cases MNHO 3) is better than either 1) or 2), because it is dynamic
just like 1) and does not require long addresses just like 2).

aglew> I agree with you --- I really don't understand why heterogenous
aglew> register files are so hard to handle.  But homogenous register
aglew> files are one thing that compiler people have gone into
aglew> rhapsodies wrt. RISC about.

That's actually not difficult to comprehend, IMHNO, as soon as you
realize that registers, as currently (mis)understood, perform actually
two completely different functions (at least -- there are others):

1) inputs and outputs to functional units ("accumulators")

2) statically managed cache ("temporaries").

The former function of registers means that they are essentially entry
and exit ports of a queueing network. In order to generate efficient
code for a queueing network you must analyze flows into it, or something
equivalent. This seems harder than just considering problem 2), which is
already hard enough.

aglew> Here's one example: the Intel 80x86 is basically a heterogenous
aglew> register file machine. Specific registers were tied to the
aglew> outputs and inputs of specific functional units in the original
aglew> hardware.  Compiler people hated targetting this architecture,
aglew> and there are very few compilers that can produce machine code
aglew> comparable to hand-coded assembly on this architecture.

Oh yes. But this is simply because current compiler technology is mostly
based on believing that registers are there only to be a statically
managed cache. Thus ridiculous things like graph coloring, which
minimizes the *static* costs, e.g. code size, not run time, unless there
are so many registers that essentially all values, including those that
are dynamically important, have a chance of ending up in a register.
There are plenty of research papers that show that

1) the number of dynamically important values is very small, for a
single functional unit.

2) large numbers of registers are useful under graph coloring and
on machines that have a huge gap between register file and cache.

The two sets of results can only be reconciled by observing that:

* vector/superscalar etc... have in effect multiple functional units

* graph coloring wastes a large amount of registers to dynamically
unimportant values, and load/store architectures have a huge gap between
register file and cache.

Not surprising, eh?

I reckon that fully specialized registers (e.g. having input-only and
output-only registers that map directly onto functional unit ports) are
best, and that caching temporaries ought not to be done with registers.

I would like a more data-flow like architecture, in which the input and
output ports of the functional units (and the relative delays maybe) are
directly exposed, and separate.

Caching, IMNHO, ought to be performed using multiple cached stacks, or
anyhow using dynamic caching (e.g. like in the i386/i486, where the
onchip cache is almost a large associative register bank).

Naturally exposing the functional units and their input and output ports
(hints of VLIW here) means that the number of architecturally visible
ports varies with the number of functional units in different
implementations. This is a problem anyhow; one can solve it in several
ways, e.g.:

0) recompiling for different implementations
1) lengthening of the instruction word (VLIW)
2) register renaming (RS/6000)
3) dynamic queuing (MU dataflow)

You may argue that 0) is not a solution; but consider: it is probably
the best way to take advantage of the specificities of a particular
implementation. 1) is a slightly easier way of doing 0). 2) ensures
binary portability, but I don't see how it could work over a large range
of functional unit numbers. 3) is guaranteed to exploit any number of
functional units nearly optimally, but requires sophisticated hardware.

aglew> But heterogenous register files are much easier to make fast.

Because you do not have to put logic in that does the mapping from the
register file as static cache to the input-output ports of the
functional units, if you choose one of solutions 0-2) above. Arguably
solution 3) is so flexible that its potential complexity/speed
disadvantage can be offset by adding extra functional units, even if
there are hints that the inherent parallelism in many applications does
not require a lot of functional units (my rule of thumb is '4').
--
Piercarlo "Peter" Grandi           | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

des@frogland.inmos.co.uk (David Shepherd) (10/16/90)

In article <336@fjcp60.GOV> golds@fjcnet.GOV (Rich Goldschmidt) writes:
>Chip designers might even take compilers into 
>consideration in their designs :-).

they already do! 4 years ago when the IMS T800 was being designed out of
the 8 people involved in the fpu design 1 was writting standard function
libraries and another was writing the compiler. This enabled the
instructions to be tuned to take account of the code sequences that the
compiler produced/needed. high performance now depends on both hardware
and software technology so you shouldn't design one without considering
the other!

--------------------------------------------------------------------------
david shepherd: des@inmos.co.uk or des@inmos.com    tel: 0454-616616 x 529
                inmos ltd, 1000 aztec west, almondsbury, bristol, bs12 4sq
                phevbfvgl xvyyrq gur png

cik@l.cc.purdue.edu (Herman Rubin) (10/17/90)

In article <11922@ganymede.inmos.co.uk>, des@frogland.inmos.co.uk (David Shepherd) writes:
> In article <336@fjcp60.GOV> golds@fjcnet.GOV (Rich Goldschmidt) writes:

	[I have included the whole of Goldschmidt's article, so I
	can comment on that as well in this posting.]

|   Maybe this is naive or too futuristic, but is anyone working towards
|   methods for automatically generating a compiler based on the architecture
|   design?  It would seem that even before there is silicon, there is
|   enough info about a new CPU in the CAD system used for layout that the
|   features of special interest for generating a compiler are known.  To
|   the extent that generating a compiler is rule based (those tasks a
|   good CS grad can do?), why hasn't it been automated?  Or are there people
|   working on this now?  Chip designers might even take compilers into 
|   consideration in their designs :-).

> they already do! 4 years ago when the IMS T800 was being designed out of
> the 8 people involved in the fpu design 1 was writting standard function
> libraries and another was writing the compiler. This enabled the
> instructions to be tuned to take account of the code sequences that the
> compiler produced/needed. high performance now depends on both hardware
> and software technology so you shouldn't design one without considering
> the other!

If we limit languages and compilers to what is taught to CS graduate
students now, we can only freeze them at the present sorry state.  There
is so much more that a thinking human can do, and do better, faster, and
more clearly, than is provided for in the HLLs.  Much of this was even on
older computers.

The same holds for hardware design.  We are already in the vicious cycle
where constructs are not used because they are not in the language, and
they are not put in the hardware because they are not used, and so it goes.

It seems that people who can see the deficiencies in design even without
using it are too rare.  As has been stated,

	The reasonable man adjusts himself to his environment.  It is
	the unreasonable man who insists on adjusting the environment
	to himself.  Therefore, all progress depends on the 
	unreasonable man.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

larus@primost.cs.wisc.edu (James Larus) (10/17/90)

Having implemented a few `state of the art' compiler algorithms (and a few
beyond the state of the art), I can say that most competent graduate
students could code them, but most companies wouldn't implement them.  The
key problem is that most papers present enough details, but don't present
enough evidence that a technique is effective.  A rational project manager
would not commit the resources to try out these algorithms.  There are
exceptions, of course, but compiler papers frequently use towers of hanoi
or puzzle-sized benchmarks.

/Jim
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

scl@unislc.uucp (Sean Landis) (10/17/90)

In article <336@fjcp60.GOV> golds@fjcnet.GOV (Rich Goldschmidt) writes:
>Maybe this is naive or too futuristic, but is anyone working towards
>methods for automatically generating a compiler based on the architecture
>design?
> ...some text deleted...
>-- 
>Rich Goldschmidt: uunet!fjcp60!golds or golds@fjc.gov

There is a group at the University of Utah that was able (to some extent)
generate compiler back-ends from an architectural description language (ADL).
Really, if you use a well defined intermediate form, this is all you should
require to take advantage of a particular architecture.  I don't know the
details, but it all seemed pretty slick to me.  It's seems to me (I'm not a
compiler expert) that a major problem would be the definition of the
ADL.  It must be robust enough to desribe the optimizable intricacies of
all the architectures you might be interested in, while also being stable
enough to be usable over time.

Sean

golds@fjcnet.GOV (Rich Goldschmidt) (10/18/90)

In article <2649@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:
> If we limit languages and compilers to what is taught to CS graduate
> students now, we can only freeze them at the present sorry state.  There
> Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
> Phone: (317)494-6054
> hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

I was not trying to suggest that compiler technology should be frozen.  I
thought I was suggesting an area of research for compiler designers.
The point of automating compiler generation is not necessarily to produce
a final product, but to couple compiler development more tightly with
chip development, and avoid the kind of situation we see with the i860
where there is no compiler which can really take advantage of the chip
long after its introduction.  By using an automatically generated compiler
early in the chip design process, it might be clear very quickly that certain 
features will require substantial effort to take advantage of, and either
plan for that effort far in advance, or decide that it might be more 
productive to consider alternate designs.  I would think this would be a
very fertile area for research by chip manufacturers, since the sooner they
can get a compiler available the better off they are in selling their chips.
They will still need experts to enhance automaticaly generated compilers,
but at least they have a first cut more quickly, and get better feedback
early in the design process.  I am surprised by the lack of interest in this 
concept among this group, but maybe I don't know enough to see the flaws in 
my suggestions.

-- 
Rich Goldschmidt: uunet!fjcp60!golds or golds@fjc.gov
Commercialization of space is the best way to escape the zero-sum game.

alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) (10/19/90)

I hesitate to get in the middle of this since I'm not an authority on either
architecture or compilers (I've written a couple very small compilers, but
note the .signature), but...

In article <339@fjcp60.GOV>, golds@fjcnet.GOV (Rich Goldschmidt) writes:
> I was not trying to suggest that compiler technology should be frozen.  I
> thought I was suggesting an area of research for compiler designers.

And it seems like an interesting area.  Some compiler development (ala lex
and yacc) is already fairly automated.  It seems that you are simply suggesting
more of the same.  And I actually find it hard to believe that such doesn't
exist already, although it's probably proprietary.

> The point of automating compiler generation is not necessarily to produce
> a final product, but to couple compiler development more tightly with
> chip development, and avoid the kind of situation we see with the i860
> where there is no compiler which can really take advantage of the chip
> long after its introduction.

But this is a double-edged sword.  It seems that what one wants to do is get
the best (fastest, cheapest, ...) hardware-software combination.  It's not
good to make chips for which there almost certainly can be no compilers in the
near future, but I would hate to constrain chip design totally to compiler
technology.  It's chips like the i860 that force advances in compilers.
-- 

-- Alan			       # My aptitude test in high school suggested that
   ..!ames!elroy!alan	       # I should become a forest ranger.  Take my
   alan@elroy.jpl.nasa.gov     # opinions on computers with a grain of salt.

cik@l.cc.purdue.edu (Herman Rubin) (10/21/90)

In article <339@fjcp60.GOV>, golds@fjcnet.GOV (Rich Goldschmidt) writes:
> In article <2649@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:
> > If we limit languages and compilers to what is taught to CS graduate
> > students now, we can only freeze them at the present sorry state.  There

> I was not trying to suggest that compiler technology should be frozen.  I
> thought I was suggesting an area of research for compiler designers.
> The point of automating compiler generation is not necessarily to produce
> a final product, but to couple compiler development more tightly with
> chip development, and avoid the kind of situation we see with the i860
> where there is no compiler which can really take advantage of the chip
> long after its introduction.  By using an automatically generated compiler
> early in the chip design process, it might be clear very quickly that certain 
> features will require substantial effort to take advantage of, and either
> plan for that effort far in advance, or decide that it might be more 
> productive to consider alternate designs.  I would think this would be a
> very fertile area for research by chip manufacturers, since the sooner they
> can get a compiler available the better off they are in selling their chips.
> They will still need experts to enhance automaticaly generated compilers,
> but at least they have a first cut more quickly, and get better feedback
> early in the design process.  I am surprised by the lack of interest in this 
> concept among this group, but maybe I don't know enough to see the flaws in 
> my suggestions.

This is still letting compilers	drive the architecture.  There are many useful
constructs not in the present languages or architectures.  Instead of trying to
limit the architecture to what an automaton can use, we should be trying to see
what those who think differently can find ways to use.

The emphasis in this group should be on architectures which can do things
efficiently, whether or not there are languages or compilers which can make
use of the power ot those architectures.  I doubt if we will have in the
foreseeable future a language or compiler which can consider what human
ingenuity is capable of.  This does not mean that the constructs in the
languages (NOT the compilers) may not suggest architecture enhancements,
but that should not be the only source of such suggestions.  But to deliberately
restruct an architecture to what a language provides is at best an insult to
all thinking people.  To limit it to what can be handled by an automated
compiler is, IMHO, criminal.

Remember that a comuter program is a fast sub-imbecile.  This includes 
compilers.  An interactive compiler MAY allow the use of human ingenuity,
but even this may be difficult.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

chris@mimsy.umd.edu (Chris Torek) (10/22/90)

In article <2661@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin)
hyperbolizes (is that a word? :-) ):
>... to deliberately restruct
(`restrict', I think, not `restructure', not that it matters too much
to my particular reply)
>an architecture to what a language provides is at best an insult to
>all thinking people.

Hardly.  If someone builds a machine designed to run COBOL programs fast,
I may not want to *use* it, but it is in no way an *insult*.  If company X
thinks there is a market for such machines, and can find investors, why
should I be insulted?

Now, if Herman Rubin thinks a machine with instructions Y and Z can be
built and would be such a great leap forward, I suggest he found his
own company.  There are plenty of investors.  Instead of griping about
the ones investing in MIPS and SPARC, why not find your own?
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/22/90)

In article <27089@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:

| Hardly.  If someone builds a machine designed to run COBOL programs fast,
| I may not want to *use* it, but it is in no way an *insult*.  If company X
| thinks there is a market for such machines, and can find investors, why
| should I be insulted?

  Honeywell offered the extended instruction set (EIS box) for the
6000/DPS line, and much of the new instruction set was intended to allow
fast implementation of COBOL, such as direct hardware BCD arithmetic,
string copy and compare, etc.

  Because the address space was limited and instruction processing was
costly, this hyper-CISC was a good decision at the time. It allowed
(from memory) up to 3x speedup of COBOL programs, with about 5-10% cost
increment.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

don@zl2tnm.gp.govt.nz (Don Stokes) (10/24/90)

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:

>   Honeywell offered the extended instruction set (EIS box) for the
> 6000/DPS line, and much of the new instruction set was intended to allow
> fast implementation of COBOL, such as direct hardware BCD arithmetic,
> string copy and compare, etc.
> 
>   Because the address space was limited and instruction processing was
> costly, this hyper-CISC was a good decision at the time. It allowed
> (from memory) up to 3x speedup of COBOL programs, with about 5-10% cost
> increment.

DEC offered the Commercial Instruction Set for several of the later PDP-11
(eg the 11/44 etc) models for much the same reasons.



Don Stokes, ZL2TNM  /  /                            Home: don@zl2tnm.gp.govt.nz
Systems Programmer /GP/ Government Printing Office  Work:        don@gp.govt.nz
__________________/  /__Wellington, New Zealand_____or:_PSI%(5301)47000028::DON

seanf@sco.COM (Sean Fagan) (10/26/90)

In article <2661@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>This is still letting compilers drive the architecture.  

For most, if not all, people, this is fine.

Most of the con side of this argument takes an existing architecture, and
says, 'Well, if it had <x>, we could do <y> so much faster!'  And that is a
valid point, and, possibly, <x> can be added to a later revision of the
architecture.

On the other hand, if you put a complex operation into the architecture in
the first place, one which ends up slowing down the machine as a whole, you
are *stuck* with that!

Consider, as always, a VAX versus a MIPS-based system.  To get comparable
performance (CPU-wise, that is) on the VAX system, you have to go to the
high end of the series, because of the baggage the architecture has, which
ends up slowing down the entire system.

When designing a processor, it is not necessarily foolish to let the
compiler influence your decisions.  I will, however, say that just sticking
to the here-and-now, and not considering the future, *is* a foolish thing to
do.

>There are many useful
>constructs not in the present languages or architectures.  Instead of trying to
>limit the architecture to what an automaton can use, we should be trying to see
>what those who think differently can find ways to use.

Which is going to be faster:  a machine which has your whiz-bang
instruction, but a clock cycle of 10MHz, and takes 32 cycles to execute the
instruction, or a system which has a clock cycle of 60MHz, and takes a total
of 75 cycles to execute an instruction stream to do the equivalent
operation(s)?

-- 
-----------------+
Sean Eric Fagan  | "*Never* knock on Death's door:  ring the bell and 
seanf@sco.COM    |   run away!  Death hates that!"
uunet!sco!seanf  |     -- Dr. Mike Stratford (Matt Frewer, "Doctor, Doctor")
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

cik@l.cc.purdue.edu (Herman Rubin) (10/31/90)

In article <8424@scolex.sco.COM>, seanf@sco.COM (Sean Fagan) writes:
> 
> In article <2661@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
> >This is still letting compilers drive the architecture.  
> 
> For most, if not all, people, this is fine.

This is like the argument that we should not teach advanced courses.  Why 
can one not get a terminal with a character set which is easily user 
programmable and downloadable?

Most people are content to drive cars which they cannot control well, also.

> Most of the con side of this argument takes an existing architecture, and
> says, 'Well, if it had <x>, we could do <y> so much faster!'  And that is a
> valid point, and, possibly, <x> can be added to a later revision of the
> architecture.
> 
> On the other hand, if you put a complex operation into the architecture in
> the first place, one which ends up slowing down the machine as a whole, you
> are *stuck* with that!

In many cases, the operation is very simple, not complex.  Do not assume that
the current operations on the computers are even an attempt at the logically
basic ones.  This was the case in the early computers, when those designing
them had a good understanding of mathematics.  It is not the case now.

Also, it may very well require more than a revision of the architecture to
add a feature.  If the architecture is designed with the flexibility to 
include that feature, yes.  If not, it may be well-nigh impossible.

> Consider, as always, a VAX versus a MIPS-based system.  To get comparable
> performance (CPU-wise, that is) on the VAX system, you have to go to the
> high end of the series, because of the baggage the architecture has, which
> ends up slowing down the entire system.

Since it seems impossible to get enough information to take into account the
hardware of the VAX, this may be the case.  But the problem is not with the
VAX design baggage, but with the lack of anticipation of the problems.

> When designing a processor, it is not necessarily foolish to let the
> compiler influence your decisions.  I will, however, say that just sticking
> to the here-and-now, and not considering the future, *is* a foolish thing to
> do.
> 
> >There are many useful
> >constructs not in the present languages or architectures.  Instead of trying to
> >limit the architecture to what an automaton can use, we should be trying to see
> >what those who think differently can find ways to use.
> 
> Which is going to be faster:  a machine which has your whiz-bang
> instruction, but a clock cycle of 10MHz, and takes 32 cycles to execute the
> instruction, or a system which has a clock cycle of 60MHz, and takes a total
> of 75 cycles to execute an instruction stream to do the equivalent
> operation(s)?

I doubt that the difference is of that type.  Many machines have a separate
chip for floating-point calculations; if necessary, put complicated calcu-
lations on that chip, or even another one.

But my complaint is about operations which are very simple conceptually,
and even simple at the nanocode or picocode level, but which are quite
difficult to do in software.  Also, operations which would be in the 
hardware if it were not for the lack of provision in software.  How 
much harder would it be to have simultaneous quotient and remainder in
the division hardware?  How hard would it be to have a buffer-empty
user-processed exception?  These exceptions occur on cache or other
memory faults, not subject to user control.

In some cases, the difference would not be between 32 and 75 cycles, but
between 2 and 75.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) (11/01/90)

cik@l.cc.purdue.edu (Herman Rubin) writes
|> In many cases, the operation is very simple, not complex.  Do not
assume that
|> the current operations on the computers are even an attempt at the logically
|> basic ones.  This was the case in the early computers, when those designing
|> them had a good understanding of mathematics.  It is not the case now.

what a completely bigoted and infantile statement.  your true colors are
showing.
so tell us, what are the "logically basic operations"?  probably those
you wish you had the last time you wrote some code, right?

|> 
|> [ a few things herman wishes were in hardware ]
|> 
|> In some cases, the difference would not be between 32 and 75 cycles, but
|> between 2 and 75.

name it.  and then show how it can be implemented on a modern processor.
and then show that those transistors couldn't have been better spent on
something else.  keep in mind the application mix that uProcs are
actually used for.



			Consume
Scott Draves		Be Silent
spot@cs.cmu.edu		Die

rminnich@super.org (Ronald G Minnich) (11/03/90)

In article <1990Oct31.203932.26325@cs.cmu.edu>, spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:

|> name it.  and then show how it can be implemented on a modern processor.
|> and then show that those transistors couldn't have been better spent on
|> something else.  keep in mind the application mix that uProcs are
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|> actually used for.
^^^^^^^^^^^^^^^^^^^^^

I understand your point, and this is an old argument, but it is beginning to 
sound parlously close to circular logic: 
A) I won't implement x because no one currently uses it
U) I can't use this uP, since it can't do x, better go use something else
   (two years later)
A) See what i mean? nobody uses x!

ron

-- 
"Socialism is the road from capitalism to communism, but we never promised to 
                 feed you on the way!"-- old Russian saying
"Socialism is the torturous road from capitalism to 
                  capitalism" -- new Russian saying (Wash. Post 9/16)

cik@l.cc.purdue.edu (Herman Rubin) (11/06/90)

In article <1990Oct31.203932.26325@cs.cmu.edu>, spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:
> cik@l.cc.purdue.edu (Herman Rubin) writes
> |> In many cases, the operation is very simple, not complex.  Do not
> assume that
> |> the current operations on the computers are even an attempt at the logically
> |> basic ones.  This was the case in the early computers, when those designing
> |> them had a good understanding of mathematics.  It is not the case now.
> 
> what a completely bigoted and infantile statement.  your true colors are
> showing.
> so tell us, what are the "logically basic operations"?  probably those
> you wish you had the last time you wrote some code, right?
> 
> |> 
> |> [ a few things herman wishes were in hardware ]
> |> 
> |> In some cases, the difference would not be between 32 and 75 cycles, but
> |> between 2 and 75.
> 
> name it.  and then show how it can be implemented on a modern processor.
> and then show that those transistors couldn't have been better spent on
> something else.  keep in mind the application mix that uProcs are
> actually used for.

Where did I confine myself to microprocessors?  And what are microprocessors
not used for?  Bob Silverman was complaining about the lack of 32x32 -> 64
multiplication; he was using microprocessors as part of a coordinated effort
to factor large numbers.  Today's microprocessors can outperform the mainframes
of not too many years ago.

A modern processor is what the brightest designers and engineers of today can
put together.  Our students and faculty are expected to do most of their
computer work on our departmental machines as far as possible; except for
an old mini, being replaced, they are usually considered microprocessors.
Some of these involve computer simulations which could not be afforded as
recently as 20 years ago.

As to whether the transistors could not be used better, another chip would
not be all that expensive.  In fact, it would be cheap.

Now for some instructions.  There are algorithms, doing all their computations
on a few bits (usually < 10) which use these instructions.  As these are 
algorithms for generating non-uniform random variables, they are likely to
be used thoudands or millions of times.

	Find the distance to the next 1 in a bit stream.  Place the stream
	pointer past that 1.  It is necessary to take into account that the
	stream could be exhausted before a 1 is found.

	Current solution, and even this is generally not available:

		Keep as many bits as possible for the subsequent operations
		in registers.

		Keep needed counts and pointers in registers.

		Find next 1, if there is one.
		Do appropriate shifts and count updates.
		Put count in desired register

		If there is not one, store number of 0's
reload:		Check if there are items left in the memory buffer.
		If not, refill and update pointers
		Get information into registers and update pointers.
		Find next 1, if there is one
		If not, augment 0 count by number of 0's
			and go to reload
		Do appropriate shifts and count updates
		Put count plus number of 0's in desired register

Now I can do better on some vector machines.  But even there the time 
needed to produce a vector of these is 3 times the "hardware" time.

A problem which may even be worse in the same context is to do a conditional
branch on a bit, and move the pointer.

Another wanted instruction on vector machines is to replace the 0's in a 
bit stream with bits from another stream, i.e., the k-th 0 is to be replaced
by the k-th bit.  The estimated vector time on a CYBER 205 of the best method
I have come up with is about 40 times what the string operation would take
if it were in hardware, and would take 4 startups instead of 1.

Other things, such as reading buffers, etc., have already been mentioned.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

njk@diku.dk (Niels J|rgen Kruse) (11/07/90)

cik@l.cc.purdue.edu (Herman Rubin) writes:
>Now for some instructions.  There are algorithms, doing all their computations
>on a few bits (usually < 10) which use these instructions.  As these are
>algorithms for generating non-uniform random variables, they are likely to
>be used thoudands or millions of times.
>       Find the distance to the next 1 in a bit stream.  Place the stream
>       pointer past that 1.  It is necessary to take into account that the
>       stream could be exhausted before a 1 is found.

It seems that you are generating multiple precision variables.
Does that make a large difference?  When generating 31 bit
exponentially distributed random numbers at least, i don't
see much need for this wonder instruction.

Some data.  I have measured the cost of the various generators
of my homegrown random number generation package on a SS1:

Cost / random number =                    1.59 uSec, Val0 = 0xe29e9cba
Cost / Exp. dist. random number =         6.77 uSec, Val1 = 0xe97afa5
Cost / random float =                     3.78 uSec, Val2 = -67.4
Cost / exp. dist. random float =          10.1 uSec, Val3 = 122.6
Cost / random double =                    6.64 uSec, Val4 = -49.4384703630
Cost / random remainder by         7 =    8.73 uSec, Val5 = 0x0
Cost / random remainder by      7777 =    5.58 uSec, Val6 = 0x35f
Cost / random remainder by     77777 =    12.1 uSec, Val7 = 0x18e6f
Cost / random remainder by  77777777 =    7.96 uSec, Val8 = 0x37d8808
Cost / random remainder by 777777777 =    7.83 uSec, Val9 = 0x2d2d4019

Timing error is on the order of 5%.
Each time is measured over a loop unrolled 20 times, which
generate a total of 1000000 numbers.  The hexadecimal values
printed, are the xor of the numbers produced.  The fp values
printed, are the sum of the numbers produced, with 20 * the
mean subtracted for every 20 numbers.  These values are printed
to avoid overeager optimization and so that it can be checked
that they remain unchanged, when running the code on different
architectures (gives a warm fussy feeling).
It shouldn't affect timing very much.

The double generator is a macro, which take 2 equidistributed
32 bit numbers out of a buffer, mask off the high bit, convert
to double, scale and add.  Simple no?  The generator of
                                                    26
exponentially distributed 31 bit numbers with mean 2   is a
macro which load a number with the appropriate distribution from
a buffer with 10 numbers.  The refill algorithm use random
minimization, which your proposed instruction is aimed at, and
consume an average of 2 equidistributed 32 bit numbers per
output number, just like the double generator.  The algorithm
is quite a lot more complex, however.  Yet, it is only slightly
slower.

I think integer <-> fp conversion or fp scaling by powers of 2
is *much* more in need of speeding up on this machine.

The random remainder generator is a macro which test how wide
the divisor is and select a routine to call accordingly.  If
the divisor fit in 13 bits, integer division is used after
truncation to 15 bits, else if it will fit in 29 bits, long
division is used after truncation to 31 bits, otherwise
unsigned long division is used.  Using gcc (like here), the
13 and 29 bit routines are inlined.

This split was chosen to speed up division on 16 bit machines
and machines that don't support unsigned division efficiently.
(It seems to help the sparc a lot too, which i hadn't expected).
With a constant divisor (like here), the testing and branching
optimize away and only the division remains. (There is a cheap
test after the division to guarantee exact equidistribution,
but this doesn't cost much).

Seeing that getting random remainders is more expensive than
getting exponentially distributed numbers in most cases, i
think that speeding up division should have a higher priority.

Conclusion: i think that random minimization is fast enough as
it is.

BTW, has anybody else noticed the bug in the algorithm for
random minimization (algorithm S), p. 128 in The Art of
Computer Programming, Vol 2, ed. 2?  The bit count should start
from 0, not 1, otherwise the generated numbers will have a mean
that is too large by log 2 times the desired mean.
-- 
Niels J|rgen Kruse 	DIKU Graduate 	njk@diku.dk

cprice@mips.COM (Charlie Price) (11/09/90)

In article <36493@super.ORG> rminnich@super.org (Ronald G Minnich) writes:
>In article <1990Oct31.203932.26325@cs.cmu.edu>, spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:
>
>|> ... keep in mind the application mix that uProcs are
>       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>|> actually used for.
>^^^^^^^^^^^^^^^^^^^^^
>
>I understand your point, and this is an old argument, but it is beginning to 
>sound parlously close to circular logic: 
>A) I won't implement x because no one currently uses it
>U) I can't use this uP, since it can't do x, better go use something else
>   (two years later)
>A) See what i mean? nobody uses x!

Your dialogue is about a *product* from a *commercial* computer maker,
and it is pretty much the right one considering that commercial enterprises
exist mostly to generate return on investment for the people who own them.

Exploring the use of "x" is what "research" is suppose to do, isn't it?
At a more pragmatic level, exploring the use of "x" in an end-user
environment is frequently "research financed by a venture capitalist".
Hopefully industrial research folks, academic research folks,
and entrepreneurs are all out spending late nights creating a market
for "x" which can be satisfied by the commercial types and enjoyed
by the masses.

It is probably good to keep in mind that there are several distinct
perspectives among those of us who meet in this particular netnews byway.
One is the perspective of people who would like to make and sell a bunch of
machines today or tomorrow and show a reasonable profit while having a
(reasonably) good time.
Another is the perspective of people who are thinking about how to build
machines to solve problem "xyzzy" -- in which they are deeply interested.
Another is the perspective of someone thinking about good ways to use
all this incredible machine-building technology that we are developing.
Another...
Some people are interested in architecture at multiple levels...

Keeping this diversity in mind might have a salutary influence
on the "discussions" that we create.
-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650