[comp.arch] Memory latency / cacheing / scientific programs

jmd@granite.dec.com (John Danskin) (06/22/88)

I am interested in running a class of programs that process large
(bigger than cache but smaller than memory) arrays of data repeatedly.

The inner loop of the program is well behaved, so I can expect that
all of my instructions will fit in cache, and all of my intermediates will
fit in either registers or cache depending on how many registers there are.

The amount of work done per byte of data is such that the total throughput
required from the bus (on a well balanced/very high performance system)
is not a limiting factor.

However, if I have to accept the latency for a cache miss every line,
the performance of my program is halved.

I get 50% utilization of the CPU and 25% utilization of the bus.
If I could hide the latency, then I could get 100% of the CPU and 50% of
the bus.

This problem seems to be common to all of the new RISC machines (and
all of our old vaxen).  A few manufacturers (Amdahl) have tried to fix
the problem by prefetching into the cache. Others (Cray) hide memory
latency with (pipelined) vector instructions or (Cydrome) by exposing
the memory latency to the compiler.

Now Vectors are expensive, and exposing the memory latency directly to
the compiler seems to be a very short term solution (I like to think
that binary compatibility across two models of the same machine is a
reasonable goal). So I would like to look elewhere for my mips/flops.

How has prefetching worked out? Alan Jay Smith seems to recommend
prefetching given that it is implemented well. Has anybody been able to
do it?

How about multiple pending scoreboarded loads? The compiler emits the
load instruction as early as possible and only references the target
register when it is needed.  A trace scheduling compiler could space
loads out through a loop so that most/all of the bus latency is
hidden.  The compiler still has to know what memory latency is for full
efficiency, but if it doesn't (numbers change) the code still works.
This scheme is also register intensive in that several registers may be
waiting for relatively distant events.  Besides, most references are to
the cache (it's just that the misses are so much more important...).

Does anybody do this with scalar machines? Is it too hard?

Software caching: people talk about it, but who has done it?
If I could say
	load	Cache7 with Page 12
and compute using other cache lines (while the load occured) until I said
	csynch	Cache7
things would work just fine. Too hard for compilers to do? Unimplementable?


Is my class of problem interesting? It is my understanding that many
large scientific programs have similar behavior, but that the standard
UNIX timesharing load (whatever that is) has significantly different behavior.

A lot of research seems to have gone into making UNIX run fast, which is a
laudable goal. But I don't want a workstation that runs UNIX fast. I want a
workstation that runs my application fast.


I am eagerly awaiting a full list of the errors in my thinking.
-- 
John Danskin				| decwrl!jmd
DEC Technology Development		| (415) 853-6724 
100 Hamilton Avenue			| My comments are my own.
Palo Alto, CA  94306			| I do not speak for DEC.

colwell@mfci.UUCP (Robert Colwell) (06/22/88)

In article <243@granite.dec.com> jmd@granite.dec.com (John Danskin) writes:
>
>I am interested in running a class of programs that process large
>(bigger than cache but smaller than memory) arrays of data repeatedly.
>
>How about multiple pending scoreboarded loads? The compiler emits the
>load instruction as early as possible and only references the target
>register when it is needed.  A trace scheduling compiler could space
>loads out through a loop so that most/all of the bus latency is
>hidden.  The compiler still has to know what memory latency is for full
>efficiency, but if it doesn't (numbers change) the code still works.
>This scheme is also register intensive in that several registers may be
>waiting for relatively distant events.  Besides, most references are to
>the cache (it's just that the misses are so much more important...).
>
>Does anybody do this with scalar machines? Is it too hard?

I consider our VLIW a scalar machine, and our compiler does what you
said.  I'm not sure what you meant by "if compiler doesn't know
memory latency, the code still works" though.  If the latency becomes
shorter, it'll still work, but if it gets longer, the code stops
working (the compiler will try to use the operand that it assumes has
now made it into the assigned register before the data actually
arrives).

One could, I suppose, put in the scoreboarding logic that would make
the code keep working even when the memory pipes are longer, by
watching which registers are targeted, and making sure that they've
been loaded before they are next used.  That could be a lot of
hardware in a VLIW like ours, though; any cluster (of which there are
1, 2, or 4 in the TRACE) can do a load and target some other
cluster's registers.  Since the CPU is implemented as 4 separate
cluster-pairs, I think the cross-instruction-stream watching hardware
would be pretty difficult.  Not to mention that you'd have to change
the register files, which are already pretty crammed with gates in
order to achieve the 4-read/4-write ports per instr that we need.
A high performance price to pay to accommodate potentially slower
memories without re-compiling.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

lindsay@k.gp.cs.cmu.edu (Donald Lindsay) (06/24/88)

In article <779@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes:
>The scalar CPU of Cyber 205 can issue up to 3 loads/stores to the memory unit.
>It continues running until the loaded register is referenced.


According to

  "Lockup-Free Instruction Fetch/Prefetch Cache Organization" by David Kroft
  SIGARCH Vol 9 #3 p.81 (May 1981)
  (8th Annual Symposium on Computer Architecture)

Control Data prototyped a cache which could continue satisfying cache hits
at full pipeline speed, plus  keep track of N unsatisfied cache misses.
The article reported that N=4 seemed optimal in their case. (Kroft did not,
however, say what machine the prototype was for.)
-- 
Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

"Imitation is not the sincerest form of flattery. Payments are."
- a British artist who died penniless before copyright law.

freuden@mfci.UUCP (Stefan Freudenberger) (06/24/88)

In article <443@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
|In article <243@granite.dec.com> jmd@granite.dec.com (John Danskin) writes:
|>                        ...  A trace scheduling compiler could space
|>loads out through a loop so that most/all of the bus latency is
|>hidden.  The compiler still has to know what memory latency is for full
|>efficiency, but if it doesn't (numbers change) the code still works.
|I consider our VLIW a scalar machine, and our compiler does what you
|said.  I'm not sure what you meant by "if compiler doesn't know
|memory latency, the code still works" though.  If the latency becomes
|shorter, it'll still work, ...
|                       ...  Not to mention that you'd have to change
|the register files, which are already pretty crammed with gates in
|order to achieve the 4-read/4-write ports per instr that we need.

Even if the memory latency decreases, the code will *only* work if you
don't run into register write port and/or other resource conflicts.
If resource constraints are maintained by the compiler, then you are
likely :-) to have to recompile your programs when these constraints
change.  If your original code still runs, though, then it wasn't very
compact the first place.
 
Stefan Freudenberger   mfci!freuden@uunet.uucp
Multiflow Computer     freudenberger@multiflow.com
175 N. Main St.
Branford, CT 06405     203-488-6090

smryan@garth.UUCP (Steven Ryan) (06/25/88)

>                                                            (Kroft did not,
>however, say what machine the prototype was for.)

Actually, I don't know if a 205 even has a cache. If it does, it is well
hidden from the CPU. I think the main memory is supposed to be as fast as
cache memory on all these weeny machine (ha-ha).

Don't shoot me--I just the programmer.

jmd@granite.dec.com (John Danskin) (06/28/88)

In article <448@m3.mfci.UUCP> you write:
>In article <443@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>|In article <243@granite.dec.com> jmd@granite.dec.com (John Danskin) writes:
>|>                        ...  A trace scheduling compiler could space
>|>loads out through a loop so that most/all of the bus latency is
>|>hidden.  The compiler still has to know what memory latency is for full
>|>efficiency, but if it doesn't (numbers change) the code still works.
>|I consider our VLIW a scalar machine, and our compiler does what you
>|said.  I'm not sure what you meant by "if compiler doesn't know
>|memory latency, the code still works" though.  If the latency becomes
>|shorter, it'll still work, ...
>|                       ...  Not to mention that you'd have to change
>|the register files, which are already pretty crammed with gates in
>|order to achieve the 4-read/4-write ports per instr that we need.
>
>Even if the memory latency decreases, the code will *only* work if you
>don't run into register write port and/or other resource conflicts.
>If resource constraints are maintained by the compiler, then you are
>likely :-) to have to recompile your programs when these constraints
>change.  If your original code still runs, though, then it wasn't very
>compact the first place.
> 
>Stefan Freudenberger   mfci!freuden@uunet.uucp


Somewhere up in my original posting I said 'scoreboarding'. With
scoreboarding, the code still runs. It may not run fast anymore, but
that is another problem.

You tell the compiler what the latencies are so that it can produce
good code.  You implement scoreboards so that binaries port from your
mark 1 machine to your mark 2 machine (they don't run as fast as they
should, but they probably run a little faster).

The real problem seems to be that scoreboards are expensive. You guys
may be on the right track (blow off compatibility, make the machine
soooo fast and relatively cheap that people will use it despite the
headache).

Tradeoffs are shifting in that direction, but I still see a lot of
value in binary compatibility at least for a few models of a design.

Have you guys thought about keeping an intermediate language copy of
each executable IN the executable with the 'cached' binary? Have the
loader check the binary to see if it has the right tag for the current
machine, if it does, run the code, if it doesn't then  regenerate code
from the intermediate language spec and then run. You would want to
provide a utility for executable conversion as this is not a real
performance answer. If the intermediate language was sophisticated
enough you might even be able to do the code generation reasonably
quickly...

You would pay a disk space penalty now, but avoid being bitten
by the inevitable backwards compatibility problems that seem so
irrelevant when you build your first machine...
-- 
John Danskin				| decwrl!jmd
DEC Technology Development		| (415) 853-6724 
100 Hamilton Avenue			| My comments are my own.
Palo Alto, CA  94306			| I do not speak for DEC.

colwell@mfci.UUCP (Bob Colwell) (06/28/88)

In article <244@granite.dec.com> jmd@granite.UUCP (John Danskin) writes:
>
>Somewhere up in my original posting I said 'scoreboarding'. With
>scoreboarding, the code still runs. It may not run fast anymore, but
>that is another problem.
>
>You tell the compiler what the latencies are so that it can produce
>good code.  You implement scoreboards so that binaries port from your
>mark 1 machine to your mark 2 machine (they don't run as fast as they
>should, but they probably run a little faster).
>
>The real problem seems to be that scoreboards are expensive. You guys
>may be on the right track (blow off compatibility, make the machine
>soooo fast and relatively cheap that people will use it despite the
>headache).

I sure hope so.  We considered a lot of alternatives, including
putting in a hardware mode to make newer hardware correctly mimic the
pipelines of older, but we all suffered allergic rashes for a week.
Visions of something analogous to dragging the 8086 register set
around for the rest of our lives danced through our heads.  It's just
really yucky (scientifically speaking) to make your current hardware
slower so that old executables run unmodified.

Your point about the reason for scoreboarding was on target.
What I was trying to say was that however expensive you think
they are for normal machines, I believe they'd turn out to be lots
more so on a VLIW like ours, because of the sheer number of
instruction stream bits you'd have to simultaneously monitor (1024
all told, across 8 different boards).  I'd rather put all that
extra hardware to use as more functional unit horsepower, or more
I/O, caches, whatever.  Something that would contribute to my usable
compute horsepower.

>Tradeoffs are shifting in that direction, but I still see a lot of
>value in binary compatibility at least for a few models of a design.
>
>Have you guys thought about keeping an intermediate language copy of
>each executable IN the executable with the 'cached' binary? Have the
>loader check the binary to see if it has the right tag for the current
>machine, if it does, run the code, if it doesn't then  regenerate code
>from the intermediate language spec and then run. You would want to
>provide a utility for executable conversion as this is not a real
>performance answer. If the intermediate language was sophisticated
>enough you might even be able to do the code generation reasonably
>quickly...
>
>You would pay a disk space penalty now, but avoid being bitten
>by the inevitable backwards compatibility problems that seem so
>irrelevant when you build your first machine...

Dunno about expanding the sizes of our executables; they're already
something like 3X what a VAX would use.  We considered the utility
idea, but felt that the manpower required to implement it could be
better spent making the new machine run as fast as possible (which
would convince the customer that recompiling was the right move).

Personally, I think it is one of the hallmarks of RISC design in
general that one trades whatever one can for performance, and as long
as it is performance that customers buy our design choices are quite
constrained.  It seems to me that big companies (the 'D' word and 
the 'I' word) require object compatibility, because that's a primary
reason for their sales.  Companies who hope to compete with them
will not, however, get their sales for that same reason.  We have to 
provide enough performance to get attention, and hope that the set of
tradeoffs represented thereby still have critical mass.  (I know
that's a horrible circumlocution, but I want to keep my job here!)


Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

eugene@pioneer.arpa (Eugene N. Miya) (06/29/88)

In article <243@granite.dec.com> jmd@granite.dec.com (John Danskin) writes:
>
>I am interested in running a class of programs that process large
>(bigger than cache but smaller than memory) arrays of data repeatedly.
> . . .
>How has prefetching worked out? Alan Jay Smith seems to recommend
>prefetching given that it is implemented well. Has anybody been able to
>do it?
> . . .
>Is my class of problem interesting? It is my understanding that many
>large scientific programs have similar behavior, but that the standard
>UNIX timesharing load (whatever that is) has significantly different behavior.

John, enjoyed lunch with you.
Yes, your problem is interesting, but few are willing to do anything
about it.  Obviously, significance is in the eye of the beholder, I
don't think they differ much (personally, gross generalization right?).

To wit: Alan stopped reading comp.arch about a year ago, but I did mail
the note to him.  He didn't have any special comment.  In the interest
of discussion, however, Alan and I have had a running argument about
how to deal with memory hierarchies.

Users currently write codes which run optimally on vector machines.
This contorts the code with compiler directives, tight vector loops
(rather than looser ones), and so forth.  It's still portable, but
looks funny at time.

Alan's argument is that if a machine has a cache, codes can be written
to assume a cache, and do as much work which in the fast cache with a
minimum of faulting.  The problem is people don't have a feel for what
it costs to place some in there, how much work is break even, what kinds
of percentages of a full cache (If I have a 16 K cache, how much work
can I do).  There's no guideposts unlike the vector world.

Perhaps people should write a few articles (grad students?)
on writing codes specifically for performance.  I would not,
it's too transitory a topic.  Just what are the tradeoffs.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (06/29/88)

In article <243@granite.dec.com> jmd@granite.dec.com (John Danskin) writes:
>
>I am interested in running a class of programs that process large
>(bigger than cache but smaller than memory) arrays of data repeatedly.


In some cases it is possible to reorganize array accesses so that, for
example, columns of the array are reused.  The generic example of how
to do this is Dongarra and Eisenstat's "Squeezing the most out of an
algorithm in Cray Fortran", ANL/MCS-TM-9 (Argonne Nat. Lab. - May 83).
This is the algorithm used in the "Matrix Vector" version of the Linpack
benchmark.  This type of reorganization can help cache, TLB translation,
and paging efficiency on some scalar machines (if they have fast
enough floating point to notice), and vector performance on both
memory to memory (e.g. Cyber 205) and vector register (e.g. Cray)
vector machines.

If your algorithm
requires a complete pass through the array
for a small amount of computation done
per step, clever coding can only eliminate unnecessary memory references.
It will not replace the full sweep through memory that you have to do.
So the key to seeing whether some improvement is possible is to see
whether the elements of the array are being used more than once in each
iteration.


-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (06/29/88)

In article <803@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes:
>Actually, I don't know if a 205 even has a cache. If it does, it is well
>hidden from the CPU. I think the main memory is supposed to be as fast as
>cache memory on all these weeny machine (ha-ha).

Neither the 205 nor the Cray has a cache.  The philosophy is to put in
enough registers that a cache is unnecessary.  The 256 registers on
the 205 were plenty for any module that I saw.  The place where this
approach hurts is in scalar codes that have very frequent procedure
calls (typical C system and utilities code) since data has to be
saved and restored between procedure calls even if it is being reused.
So, don't run code like that on these machines more than necessary...

-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

firth@sei.cmu.edu (Robert Firth) (06/30/88)

In article <11023@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov.UUCP (Hugh LaMaster) writes:
>Neither the 205 nor the Cray has a cache.  The philosophy is to put in
>enough registers that a cache is unnecessary.  The 256 registers on
>the 205 were plenty for any module that I saw.  The place where this
>approach hurts is in scalar codes that have very frequent procedure
>calls (typical C system and utilities code) since data has to be
>saved and restored between procedure calls even if it is being reused.

This is a defensible philosophy, but, as with any hardware design, it
can work only with sympathetic software design.  If the target machine
has a large number of registers, then the codegenerator really must be
prepared to perform global register allocation, and must also address
the problem of register partitioning between user code and shared
library code.  Anything less, I would regard as an indication that the
implementor hadn't bothered to understand the machine.

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (06/30/88)

In article <6070@aw.sei.cmu.edu> firth@bd.sei.cmu.edu.UUCP (Robert Firth) writes:

>This is a defensible philosophy, but, as with any hardware design, it
>can work only with sympathetic software design.  If the target machine
>has a large number of registers, then the codegenerator really must be
>prepared to perform global register allocation, and must also address

If by "global" you mean, at the module level, then the answer is yes.  The
optimizers go far beyond peephole type optimizations.  But if you mean
global for an entire process, no.  Which is why all registers that are
used in a module have to be restored.

>the problem of register partitioning between user code and shared
>library code.  Anything less, I would regard as an indication that the
>implementor hadn't bothered to understand the machine.

This, I don't understand.  Shared library modules are called just like
any other modules, so they do saves and restores.  But, even if you
do global global optimization for an entire process, you wouldn't
normally assume anything about a shared library - after all, the code
changes from release to release.  You wouldn't want to assume that
shared library code never changed...  P.S.  Crays don't have memory
mapping, so they don't do shared libraries.  Cyber 205's do both...


-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

hankd@pur-ee.UUCP (Hank Dietz) (07/01/88)

In article <11023@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
> Neither the 205 nor the Cray has a cache.  The philosophy is to put in
> enough registers that a cache is unnecessary.  The 256 registers on
> the 205 were plenty for any module that I saw.  The place where this
> approach hurts is in scalar codes that have very frequent procedure
> calls (typical C system and utilities code) since data has to be
> saved and restored between procedure calls even if it is being reused.
> So, don't run code like that on these machines more than necessary...

Registers are not a valid substitute for cache:  they are fundamentally more
restricted in function (although they are efficiently managed at
compile-time).  For example, both a[i] and a[j] can be kept in cache ad
infinitum, however, if we (the compiler) don't know if i==j, we can't put
either one in a register without having to flush both from registers every
time either one is stored into.  It's the classic ambiguous alias problem.

For vector registers, incidentally, you don't hit this problem because the
same alias problem which prevents keeping things in registers would also
prevent vectorization.  State save/restore on procedure calls is not so bad
a problem for registers, because the state variables typically have no
aliases, hence they can be maintained in registers (if you have enough of
them, as in some RISC designs).  As for the number of registers, we've
recently found that a perfect (or just really good) global register
allocator should rarely want more than about 10 registers per processor --
it's a pitty there are so few good register allocators out in the real
world....  By the way, if you ignore loops, the cache size also can be a
very small number.

Chi-Hung Chi and I will have a paper discussing the aliasing problem and how
it relates to registers vs. cache at SUPERCOMPUTING '88:  "CRegs: A New Kind
of Memory for Referencing Arrays and Pointers."  The CReg is a new structure
which is managed like registers but treats aliased objects more like a cache
does...  patent perhaps-soon-to-be-pending?  Anyway, I'm willing to make the
paper available on a limited basis before the conference.

     __         /|
  _ |  |  __   / |  Compiler-oriented
 /  |--| |  | |  |  Architecture
/   |  | |__| |_/   Researcher from
\__ |  | | \  |     Purdue
    \    |  \  \
	 \      \   Hank Dietz, Ass't Prof. of EE

smryan@garth.UUCP (07/01/88)

>>Actually, I don't know if a 205 even has a cache. If it does, it is well
>>hidden from the CPU. I think the main memory is supposed to be as fast as
>>cache memory on all these weeny machine (ha-ha).

I was reminded the 205 does use an instruction stack (aka cache), hence the
existence of the VSB (void stack and branch) instruction. (The only VSB I
know of lives in DFBM.)
 
>Neither the 205 nor the Cray has a cache.  The philosophy is to put in
>enough registers that a cache is unnecessary.  The 256 registers on
>the 205 were plenty for any module that I saw.

Actually, for the code I ran, even 256 was too small. The usual case of
program expanding to consume all available memory I guess.

>                                                The place where this
>approach hurts is in scalar codes that have very frequent procedure
>calls

An interesting idea for the far future is to analyze subprogram calling
sequences to minimise swaps. This was done for FTN200 runtime library.
Typically, one swap is necessary when going from user code to the library,
and then library routines share the register file, avoiding swap outs
as much as possible.

chris@mimsy.UUCP (Chris Torek) (07/01/88)

>In article <6070@aw.sei.cmu.edu> firth@bd.sei.cmu.edu (Robert Firth) writes:
>>... If the target machine has a large number of registers, then the
>>codegenerator really must be prepared to perform global register
>>allocation ....

In article <11052@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster)
writes:
>If by "global" you mean, at the module level, then the answer is yes.  The
>optimizers go far beyond peephole type optimizations.  But if you mean
>global for an entire process, no.  Which is why all registers that are
>used in a module have to be restored.

I think by `global' he means what I mean by `global': at the process
level.  (It bothers me that people talk about per-function optimisation,
or even per-module, as `global optimisation'.  Global means global,
not function-level, not module-level.  One would not call the
variable `t' below global:

	f(x: real) {
		t: int;

		t := floor(x);
		...
	}

so why is optimisation that puts `t' in a register for the duration
of `...' called `global'?)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

smryan@garth.UUCP (07/01/88)

>                                 then the codegenerator really must be
>prepared to perform global register allocation, and must also address
>the problem of register partitioning between user code and shared
>library code.  Anything less, I would regard as an indication that the
>implementor hadn't bothered to understand the machine.

We understood the machine, but the problem isn't so simple. Current languages,
C and Fortran in particular, emphasise separate compilations but do not
provide a standard library mechanism to permit good interprocedure
optimisation. While work is been doing to overcome this, it is in spite of
rather than because of most languages.

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (07/01/88)

In article <8429@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>Registers are not a valid substitute for cache:  they are fundamentally more
>restricted in function (although they are efficiently managed at
>compile-time).  For example, both a[i] and a[j] can be kept in cache ad
>infinitum, however, if we (the compiler) don't know if i==j, we can't put
>either one in a register without having to flush both from registers every
>time either one is stored into.  It's the classic ambiguous alias problem.

Only scalars are allocated in registers using these compilers so there
isn't an aliasing problem.  What these compilers do, and what
the architecture supports, is the use of registers for local scalars,
and the use of memory for everything else: arrays and global variables
of all kinds.  While this is patently not the best arrangement for scalar
oriented C, it works very well for Fortran because:
1) Fortran modules tend to be "bigger" and have more local variables,
   compiler generated temporaries from expression evaluations, etc.,
   to store.
2) Arrays tend to have vector operations on them anyway, so a direct
   memory read/write is often in order.
3) Fortran does not have the same aliasing problems that "Algol-like"
   languages have.

It is true that all other things being equal, cache is better.  In fact,
a really clever machine architecture might be able to dispense with
registers altogether.  On an array oriented vector machine, however,
there is a serious problem.  It is not unusual at all for a large
fraction of memory to be accessed with each iteration of a problem,
with each element accessed only once or a few times.  If the entire
cache is emptied before an element is reused, there isn't much point
in having a cache.  This is one point where array oriented machines and
scalar oriented machines have mutually exclusive requirements.  Without
a register file, it is difficult to do the setup that the
vector unit needs fast enough.  Typically, code is reorganized on these
machines so that some of the setup necessary for the NEXT vector
instruction is started before beginning execution on the CURRENT instruction.
Then, while the scalar setup instructions are completing the vector unit is
operating (everything pipelined, scoreboards, etc.).  So, on a vector
machine, you will still need sufficient registers, even with a cache,
and then it takes a lot of work to make sure that the cache doesn't hurt
your vector performance (it gets even more interesting when you throw
multiprocessor multitasking into the picture ;-)

>them, as in some RISC designs).  As for the number of registers, we've
>recently found that a perfect (or just really good) global register
>allocator should rarely want more than about 10 registers per processor --

I have seen the number "32" bandied about previously as the ideal number
of registers for C.  10 looks pretty small to me.  I can remember writing
assembly code on a CDC 7600 and running out of registers very easily
doing expression evaluations with array elements. It doesn't take many
multi-dimensional arrays to use up your index registers (but, of course,
multiplying the elements of a two dimensional array times a column of a
third is an extremely unusual thing to do :-)    What sort of programs
is "10" based on?  (Heavily recursive "small" programs are not a fair
basis of comparison.  How about troff or cc?)  Again, the assumption
that a load/store costs about the same as another instruction is not true
on a fast pipelined machine with no cache.  If a single load takes a lot
of cycles, you need more registers.  Many of these architectural choices
are difficult to separate from each other.


-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

stevew@nsc.nsc.com (Steve Wilson) (07/01/88)

In article <8429@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>
>Registers are not a valid substitute for cache:  they are fundamentally more
>restricted in function (although they are efficiently managed at
>compile-time).  For example, both a[i] and a[j] can be kept in cache ad
>infinitum, however, if we (the compiler) don't know if i==j, we can't put
>either one in a register without having to flush both from registers every
>time either one is stored into.  It's the classic ambiguous alias problem.
>
I was always under the impression that a data cache wasn't a wonderful
idea for scientific applications.  Typical explanation being that if
the data set is larger than the cache, the data set will never reside
in the cache entirely, i.e. thrashing.  Previous discussions about
VM and scientific applications centered on this same type of problem
but from a different angle( the hardware structures are the same, just the 
names have been changed to protect the software ;-)

Given the alias problem and cache behaviour as well, is there a happy
medium?

Steve Wilson
National Semiconductor

[The above opinions are mine, so don't blame my employer.]

smryan@garth.UUCP (Steven Ryan) (07/02/88)

>        (It bothers me that people talk about per-function optimisation,
>or even per-module, as `global optimisation'.  Global means global,
>not function-level, not module-level.

I think it started out as local referring to basic block optimisation and
global as everything else (anything with a jump or label).

hankd@pur-ee.UUCP (Hank Dietz) (07/05/88)

In article <11106@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
> In article <8429@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
> >Registers are not a valid substitute for cache:  they are fundamentally more
> >restricted in function (although they are efficiently managed at
> >compile-time).  For example, both a[i] and a[j] can be kept in cache ad
> >infinitum, however, if we (the compiler) don't know if i==j, we can't put
> >either one in a register without having to flush both from registers every
> >time either one is stored into.  It's the classic ambiguous alias problem.
> 
> Only scalars are allocated in registers using these compilers so there
> isn't an aliasing problem.  What these compilers do, and what
> the architecture supports, is the use of registers for local scalars,
> and the use of memory for everything else: arrays and global variables
> of all kinds.  While this is patently not the best arrangement for scalar
> oriented C, it works very well for Fortran because:

Actually, it "works" because FORTRAN doesn't have pointers -- scalars in C
are just like array references alias-wise because any pointer could point at
just about any datum (and nobody knows which :-), scalars included.

> >...  As for the number of registers, we've
> >recently found that a perfect (or just really good) global register
> >allocator should rarely want more than about 10 registers per processor --
> 
> I have seen the number "32" bandied about previously as the ideal number
> of registers for C.  10 looks pretty small to me....
> What sort of programs is "10" based on?

We used the standard MIPS benchmarks, which are admittedly small programs,
but they are the same ones "everybody" uses (so I guess we did cheat :-).

We also constructed random intereference graphs and obtained very similar
results even for graph-coloring register allocation using very large graphs
which are very strongly connected (up to thousands of nodes, with each node
connected to as many as 90% of all other nodes) -- remember that all planar
graphs can be 4-colored and the graph complexity which can be N-colored
grows MUCH (incredible understatement here :-) faster than N.  By the way,
although fewer-than-10 colorings nearly always existed and our algorithm
usually found them, the standard node-removal technique (see Chaitin,
"Register Allocation and Spilling via Graph Coloring," SIGPLAN Symp. on
Compiler Construction, June 1982, pp.  201-207) often does not find them.
The technique we used was a modified graph walk -- details and sample C code
available on request.

> Again, the assumption that a load/store costs about the same as another
> instruction is not true on a fast pipelined machine with no cache.
> If a single load takes a lot of cycles, you need more registers.

For the most part, no assumption was needed about load/store cost versus
anything -- each datum was in a register while it was live (i.e., while it's
value might be referenced again) and not involved with an alias.  Further
note that we didn't put "variables" in registers; we put "live values" in
registers and that makes quite a difference.

Of course, your mileage may vary...  but 16 registers seems to be plenty.

     __         /|
  _ |  |  __   / |  Compiler-oriented
 /  |--| |  | |  |  Architecture
/   |  | |__| |_/   Researcher from
\__ |  | | \  |     Purdue
    \    |  \  \
	 \      \   Prof. Hank Dietz, (317) 494 3357

pardo@june.cs.washington.edu (David Keppel) (07/06/88)

In article <8444@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
[ You may need more... but 16 registers seems to be enough ]

See papers by David Wall (1986 ACM??, ??) on global register
allocation at link time.  (Talk about global optimization!).
He's had very good results with a machine using about (read
"more than") 50 registers.

	;-D on  ( ZERO tolerance on drugs: impeach Bush )  Pardo