[comp.arch] Memory hierarchies

preston@ariel.rice.edu (Preston Briggs) (05/07/91)

>lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:

>I have only limited experience with the new, fast-only-in-cache, machines,
>but I have to say that the code you need to get optimum performance is
>even more non-intuitive than that for the older vector architecture machines.
>Even worse, code which was previously optimal for vector machines, and which
>was OK on a wide variety of other machines, is now pessimal for these machines

The reality of big systems is that they are implemented with a
memory hierarchy.  Typically

	registers
	cache
	tlb
	ram
	disk

A Cray, running vector stuff, it might look more like

	ram
	disk

but the hierarchy still exists.  A fair amount of the money spent
on a super is dedicated toward flattening the hierarchy.

For best results, you (or your compiler) should be concious
of the implementation (not just the architecture) of the target
machine.  Everyone knows about stripping for cache.  Well,
you can also block for registers, tlb, and ram.

Ken Kennedy would like to see programmers coding in a "blockable" style,
with compilers doing the actual blocking.  He makes an analogy
with vectorization.  When vectorizing compilers became available,
programmers learned to write somewhat stylized loops that they.
expected the vectorizer to recognize and handle efficiently;
they learned to write vectorizable code.  In many respects,
the code was portable, in that it could be transformed, by the
compilers, to run efficiently on a variety of (vector) machines.
Currently, we see programmers blocking their code by hand for
each machine they have have to use.  Kennedy (and others) hope
to develop adequate techniques to allow programmers to write
more portable code, trusting the compilers to find efficient
blockings.

Some researchers at IBM think that the RAM model used by most
programmers for reasoning about their program's complexity is
fatally flawed because it has only a single level of memory.
They propose a couple of more sophisticated models that account
for the memory hierarchy in various ways.

	The Uniform Memory Hierarchy Model of Computation
	Bowen Alpern, Larry Carter, Ephraim Feig, Ted Selker
	FOCS 90 (Foundations of Computer Science)


Regarding the invention of blocking,
there's an old paper

	Matrix Algebra Programs for the UNIVAC
	J. D. Rutledge and H. Rubenstein
	Wayne Conference on Automatic Computing Machinery and Applications
	March 1951

that discusses blocking various matrix routines for a memory hierarchy,
presumably including tape (though I don't have a copy of the paper
to be sure).

elm@sprite.Berkeley.EDU (ethan miller) (05/08/91)

In article <1991May7.152224.3146@rice.edu>, preston@ariel.rice.edu (Preston Briggs) writes:
%The reality of big systems is that they are implemented with a
%memory hierarchy.  Typically
%
%	registers
%	cache
%	tlb
%	ram
%	disk

Actually, *really* big systems look something like this:

registers
(cache)
ram
solid-state disk
fast disk
(slower disk)
tape

The slower disk might be there just as a speed-matching buffer for the
tape.  Cache may or may not exist.  Instruction caches (or instruction
buffers) are everywhere.  For scientific computing, though, a data cache
often isn't much good.  Most Cray sites look like the above list,
without cache or "slower disk."

%but the hierarchy still exists.  A fair amount of the money spent
%on a super is dedicated toward flattening the hierarchy.

Most of the money spent on a super goes towards making a memory system
that can sustain gigabyte/second transfer rates.  We know how to compute
at gigaflop speed.  Just throw together 100 of your favorite math
coprocessor which can do 10 MFLOPS.  This should cost you at most $200K.
The rest of the cost is building a memory system which can support
the high data rates these 100 processors can sustain.

ethan
-- 
=====================================
ethan miller--cs grad student   elm@cs.berkeley.edu
#include <std/disclaimer.h>     {...}!ucbvax!cs.berkeley.edu!elm
This signature will self-destruct in 5 seconds....

edwardm@hpcuhe.cup.hp.com (Edward McClanahan) (05/14/91)

> %The reality of big systems is that they are implemented with a
> %memory hierarchy.  Typically
> %
> %	registers
> %	cache
> %	tlb
> %	ram
> %	disk

> Actually, *really* big systems look something like this:

> registers
> (cache)
> ram
> solid-state disk
> fast disk
> (slower disk)
> tape

Or, how about the following:

  registers
  register-window
  on-chip cache                 on-chip tlb
  off-chip cache                off-chip tlb
  ...