preston@ariel.rice.edu (Preston Briggs) (05/07/91)
>lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes: >I have only limited experience with the new, fast-only-in-cache, machines, >but I have to say that the code you need to get optimum performance is >even more non-intuitive than that for the older vector architecture machines. >Even worse, code which was previously optimal for vector machines, and which >was OK on a wide variety of other machines, is now pessimal for these machines The reality of big systems is that they are implemented with a memory hierarchy. Typically registers cache tlb ram disk A Cray, running vector stuff, it might look more like ram disk but the hierarchy still exists. A fair amount of the money spent on a super is dedicated toward flattening the hierarchy. For best results, you (or your compiler) should be concious of the implementation (not just the architecture) of the target machine. Everyone knows about stripping for cache. Well, you can also block for registers, tlb, and ram. Ken Kennedy would like to see programmers coding in a "blockable" style, with compilers doing the actual blocking. He makes an analogy with vectorization. When vectorizing compilers became available, programmers learned to write somewhat stylized loops that they. expected the vectorizer to recognize and handle efficiently; they learned to write vectorizable code. In many respects, the code was portable, in that it could be transformed, by the compilers, to run efficiently on a variety of (vector) machines. Currently, we see programmers blocking their code by hand for each machine they have have to use. Kennedy (and others) hope to develop adequate techniques to allow programmers to write more portable code, trusting the compilers to find efficient blockings. Some researchers at IBM think that the RAM model used by most programmers for reasoning about their program's complexity is fatally flawed because it has only a single level of memory. They propose a couple of more sophisticated models that account for the memory hierarchy in various ways. The Uniform Memory Hierarchy Model of Computation Bowen Alpern, Larry Carter, Ephraim Feig, Ted Selker FOCS 90 (Foundations of Computer Science) Regarding the invention of blocking, there's an old paper Matrix Algebra Programs for the UNIVAC J. D. Rutledge and H. Rubenstein Wayne Conference on Automatic Computing Machinery and Applications March 1951 that discusses blocking various matrix routines for a memory hierarchy, presumably including tape (though I don't have a copy of the paper to be sure).
elm@sprite.Berkeley.EDU (ethan miller) (05/08/91)
In article <1991May7.152224.3146@rice.edu>, preston@ariel.rice.edu (Preston Briggs) writes:
%The reality of big systems is that they are implemented with a
%memory hierarchy. Typically
%
% registers
% cache
% tlb
% ram
% disk
Actually, *really* big systems look something like this:
registers
(cache)
ram
solid-state disk
fast disk
(slower disk)
tape
The slower disk might be there just as a speed-matching buffer for the
tape. Cache may or may not exist. Instruction caches (or instruction
buffers) are everywhere. For scientific computing, though, a data cache
often isn't much good. Most Cray sites look like the above list,
without cache or "slower disk."
%but the hierarchy still exists. A fair amount of the money spent
%on a super is dedicated toward flattening the hierarchy.
Most of the money spent on a super goes towards making a memory system
that can sustain gigabyte/second transfer rates. We know how to compute
at gigaflop speed. Just throw together 100 of your favorite math
coprocessor which can do 10 MFLOPS. This should cost you at most $200K.
The rest of the cost is building a memory system which can support
the high data rates these 100 processors can sustain.
ethan
--
=====================================
ethan miller--cs grad student elm@cs.berkeley.edu
#include <std/disclaimer.h> {...}!ucbvax!cs.berkeley.edu!elm
This signature will self-destruct in 5 seconds....
edwardm@hpcuhe.cup.hp.com (Edward McClanahan) (05/14/91)
> %The reality of big systems is that they are implemented with a > %memory hierarchy. Typically > % > % registers > % cache > % tlb > % ram > % disk > Actually, *really* big systems look something like this: > registers > (cache) > ram > solid-state disk > fast disk > (slower disk) > tape Or, how about the following: registers register-window on-chip cache on-chip tlb off-chip cache off-chip tlb ...