[comp.arch] Was: RISC is a nasty no-no! More to the point: Supercomputer addresses

lackey@Alliant.COM (Stan Lackey) (03/04/88)

In article <17415@think.UUCP> barmar@fafnir.think.com.UUCP (Barry Margolin) writes:
>>...when you start programming on supers you will soon
>>learn that power of two array sizes are the worse choice you can make.
>Could you explain why this is so?

Many supercomputers, and mini-supercomputers, to increase memory bandwidth,
set up independent banks of slow memory and interleave them on low address
bits.  Vector references effectively cause several banks to access/cycle
at roughly the same time, to stream elements to the CPU(s) at very high
rates.

So. Suppose the memory is 8-way interleaved, and the array is a power
of 2 greater than or equal to 8.  All references will go to the same bank,
reducing bandwidth to the CYCLE time of the memory RAM.  DRAM cycle times
are typically 200nS, so you can see what that can do to performance.
-Stan

eugene@eos.UUCP (Eugene Miya) (03/05/88)

In article <1334@alliant.Alliant.COM> lackey@alliant.UUCP (Stan Lackey) writes:
>Many supercomputers, and mini-supercomputers, to increase memory bandwidth,
>set up independent banks of slow memory and interleave them on low address

Actually, I would not say "many."  Crays are the only machines
I have recorded which measureably have this as a problem.  These
machines are no slouches on speed.  They are 'relatively' slow ;-).
Other machines (and I've not run this particular test on the Alliant
because it has a relatively low-resolution clock) have noted
this problem and have made other architectural compromises (e.g., Cyber 205
and IBM 3090 which work around this in different ways),
but maybe I'll get some time and stress (torture) our Alliant ;-).

From the Rock of Ages Home for Retired Hackers:

--eugene miya, NASA Ames Research Center, eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene

dfk@duke.cs.duke.edu (David Kotz) (03/05/88)

In article <1334@alliant.Alliant.COM>, lackey@Alliant.COM (Stan Lackey) writes:
> 
> So. Suppose the memory is 8-way interleaved, and the array is a power
> of 2 greater than or equal to 8.  All references will go to the same bank,
> reducing bandwidth to the CYCLE time of the memory RAM.  DRAM cycle times
> are typically 200nS, so you can see what that can do to performance.
> -Stan

This is true if your array is stored in row-major order (i.e. elements
of a row spread across modules, elements of a column all in one
module), and you are trying to access a given column. If you are
accessing successive elements of a row it doesn't matter what the
dimension of your array is. A good discussion of all this can be found
in Stone, "High Performance Computer Architectures", Addison Wesley
1987, chapter 5, section 3. 

David Kotz
-- 
Deparment of Computer Science, Duke University, Durham, NC 27706
ARPA:	dfk@cs.duke.edu
CSNET:	dfk@duke        
UUCP:	{ihnp4!decvax}!duke!dfk

lisper@yale.UUCP (Bjorn Lisper) (03/06/88)

In article <11233@duke.cs.duke.edu> dfk@duke.cs.duke.edu (David Kotz) writes:
>In article <1334@alliant.Alliant.COM>, lackey@Alliant.COM (Stan Lackey)
>writes:
>> So. Suppose the memory is 8-way interleaved, and the array is a power
>> of 2 greater than or equal to 8.  All references will go to the same bank,
....
>This is true if your array is stored in row-major order (i.e. elements
>of a row spread across modules, elements of a column all in one
>module), and you are trying to access a given column. If you are
>accessing successive elements of a row it doesn't matter what the
>dimension of your array is.

It seems that a smart compiler should be able to detect certain cases where
array elements are accessed in a order that cause slowdown due to memory
bank conflicts, and change the order of storage to speedup the access. Or
perhaps simply use a hash function to randomize the access pattern? This
is of course not possible in a language like FORTRAN where the order of
storage is strictly specified.

Does anyone have any comments on this?

Bjorn Lisper

marc@apollo.uucp (Marc Gibian) (03/08/88)

Many of the CDC Cyber machines of the high side of the 170 series used banked
memories, along with multiple functional units.  Once in a while you would run 
into an algorithm that would get into phase with itself and try to make all 
memory accesses from the same bank, and would be mysteriously slow.  These
were -real FUN- to fix.

email:  marc@apollo.UUCP

hankd@pur-ee.UUCP (Hank Dietz) (03/08/88)

In article <24605@yale-celray.yale.UUCP>, lisper@yale.UUCP (Bjorn Lisper) writes:
> It seems that a smart compiler should be able to detect certain cases where
> array elements are accessed in a order that cause slowdown due to memory
> bank conflicts, and change the order of storage to speedup the access. Or
> perhaps simply use a hash function to randomize the access pattern? This
> is of course not possible in a language like FORTRAN where the order of
> storage is strictly specified.

Bjorn is quite correct, indeed, this is one of the main problems that
we supercomputer compiler folk are looking at.  It turns out to be
quite a nasty problem to re-arrange C data structures without changing
the function of the program (see all the noise about just changing
member order within a struct in comp.lang.c), but it is SOMETIMES
do-able with a clever compiler.  FORTRAN is not really much easier
than C, incidentally, because although it doesn't have multi-level
indirection, it does have things like COMMONs and EQUIVALENCE which
link layout of multiple data structures in complex ways.  I'm
currently doing a lot of work on automatic parallelization for
non-shared-memory machines... and this is a major concern.

Incidentally, the hash function idea works:  for a couple of years I've
had a 95% completed paper sitting around on the use of nearly perfect
hash functions to implement software address skewing...  maybe I ought
to complete it and send it for publication somewhere?

						-hankd

lisper@yale.UUCP (Bjorn Lisper) (03/09/88)

In article <7690@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>In article <24605@yale-celray.yale.UUCP>, lisper@yale.UUCP (Bjorn Lisper)
>writes:
>> It seems that a smart compiler should be able to detect certain cases where
>> array elements are accessed in a order that cause slowdown due to memory
>> bank conflicts, and change the order of storage to speedup the access. Or
>> perhaps simply use a hash function to randomize the access pattern? This
>> is of course not possible in a language like FORTRAN where the order of
>> storage is strictly specified.
>
>Bjorn is quite correct, indeed, this is one of the main problems that
>we supercomputer compiler folk are looking at.  It turns out to be
>quite a nasty problem to re-arrange C data structures without changing
>the function of the program (see all the noise about just changing
>member order within a struct in comp.lang.c), but it is SOMETIMES
>do-able with a clever compiler.  FORTRAN is not really much easier
>than C, incidentally, because although it doesn't have multi-level
>indirection, it does have things like COMMONs and EQUIVALENCE which
>link layout of multiple data structures in complex ways.  I'm
>currently doing a lot of work on automatic parallelization for
>non-shared-memory machines... and this is a major concern.

One of the implications of my first posting that I didn't spell out is that
languages like C and FORTRAN maybe are not that suitable for programming
supercomputers for the reasons you mention; there are FORTRAN programs out
there whose correctness rely on matrix elements being stored in a certain
order, for instance. If we had a language that specifies nothing but the
functionality of accessing an array element, that is: the only thing we know
about a(i,j) is that it returns the current value of a(i,j) when referred
to, then an optimizing compiler for that language would have much more room
to play around with different storage orderings in order to minimize access
time. Isn't this a nice argument against using a language like FORTRAN in
scientific computing?

>Incidentally, the hash function idea works:  for a couple of years I've
>had a 95% completed paper sitting around on the use of nearly perfect
>hash functions to implement software address skewing...  maybe I ought
>to complete it and send it for publication somewhere?

Fun to hear that my idea wasn't totally out to lunch, it was just a flash I
got when writing my previous posting. I guess the trick is to find a hash
function that is cheap enough to evaluate while still not wasting too much
memory from the "imperfectness" of the hash function.

Bjorn Lisper

rober@weitek.UUCP (Don Rober) (03/10/88)

In article <24737@yale-celray.yale.UUCP> lisper@yale-celray.UUCP (Bjorn Lisper) writes:
>In article <7690@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>
>If we had a language that specifies nothing but the
>functionality of accessing an array element, that is: the only thing we know
>about a(i,j) is that it returns the current value of a(i,j) when referred
>to, then an optimizing compiler for that language would have much more room
>to play around with different storage orderings in order to minimize access
>time. Isn't this a nice argument against using a language like FORTRAN in
>scientific computing?
>
Interesting. For all of its warts, APL displays these properties.
For years it was the highest performing scientific language on the
Burroughs B6000/B7000 series of machines.-- 
----------------------------------------------------------------------------
Don Rober				UUCP: {pyramid, cae780}!weitek!rober
Weitek Corporation	1060 East Arques		Sunnyvale, CA 94086

leichter@yale.UUCP (Jerry Leichter) (03/10/88)

In article <24737@yale-celray.yale.UUCP> lisper@yale-celray.UUCP (Bjorn Lisper) writes:
>... If we had a language that specifies nothing but the
>functionality of accessing an array element, that is: the only thing we know
>about a(i,j) is that it returns the current value of a(i,j) when referred
>to, then an optimizing compiler for that language would have much more room
>to play around with different storage orderings in order to minimize access
>time. Isn't this a nice argument against using a language like FORTRAN in
>scientific computing?

Nice theoretical argument - but it ignores the central issue:  Maybe the
REASON the scientific community like FORTRAN is that it ALLOWS them to "play
around".  In fact, such "playing around" is an absolutely standard technique
in many large numerical codes; it's what makes them practical.  A common thing
to do is to allocate a large one-dimensional array and carve variable-sized
2-D arrays out of it.  The way this is done is ugly, because FORTRAN has no
inherent ability to deal with dynamic memory allocation so it all has to be
faked.  If the code had been done in C, for example, the space would have been
allocated with malloc().  Of course, that doesn't really give the a compiler
much more to go on than was available in the FORTRAN case!

Before proposing to remove something so central to large codes, you'd better
understand WHY it's central, and what you can provide to replace it with.

							-- Jerry

lisper%thor@CS.YALE.EDU (Bjorn Lisper) (03/11/88)

In article <24861@yale-celray.yale.UUCP> leichter@yale-celray.UUCP (Jerry
Leichter) writes:
>In article <24737@yale-celray.yale.UUCP> lisper@yale-celray.UUCP (Bjorn
>Lisper) writes:
>>... If we had a language that specifies nothing but the
>>functionality of accessing an array element, that is: the only thing we know
>>about a(i,j) is that it returns the current value of a(i,j) when referred
>>to, then an optimizing compiler for that language would have much more room
>>to play around with different storage orderings in order to minimize access
>>time. Isn't this a nice argument against using a language like FORTRAN in
>>scientific computing?
>
>Nice theoretical argument - but it ignores the central issue:  Maybe the
>REASON the scientific community like FORTRAN is that it ALLOWS them to "play
>around".  In fact, such "playing around" is an absolutely standard technique
>in many large numerical codes; it's what makes them practical. ....
....
>Before proposing to remove something so central to large codes, you'd better
>understand WHY it's central, and what you can provide to replace it with.

And maybe the reason why people want to have total control over memory
management as in FORTRAN is that there have been no compilers for other,
less memory-explicit languages that have produced good code? A great deal of
the development of parallelizing compilers for supercomputers have been
development of FORTRAN compilers, for natural marketing reasons.
Unfortunately, for the reasons I pointed out, there is a limit to how smart
these compilers can be. So as long there is no real effort put into
developing good parallelizing compilers for more high-level languages the
art of parallelizing & optimizing compiling will not develop to its full
potential.

Another issue is of course portability; FORTRAN code hand-tailored for one
machine will perhaps not perform too well on another (since the compiler
cannot alter the programmer's detailed decisions), while for a higher-level
language more decisions are left to the compiler. But this is an old story...

Anyhow: I have full understanding for that people who needs top performance
in supercomputing today use FORTRAN; they haven't got very much else to
choose from. But as long as people stick to this language there is a limit
to how good the compilers can be.

Maybe this isn't a comp.arch discussion anymore?

Bjorn Lisper

edk@gryphon.CTS.COM (Ed Kaulakis) (03/14/88)

In article <24861@yale-celray.yale.UUCP>, leichter@yale.UUCP (Jerry Leichter) writes:
> In article <24737@yale-celray.yale.UUCP> lisper@yale-celray.UUCP (Bjorn Lisper) writes:
[]
> Nice theoretical argument - but it ignores the central issue:  Maybe the
> REASON the scientific community like FORTRAN is that it ALLOWS them to "play
> around".  In fact, such "playing around" is an absolutely standard technique
> in many large numerical codes; it's what makes them practical.  A common thing
> to do is to allocate a large one-dimensional array and carve variable-sized
> 2-D arrays out of it.  The way this is done is ugly, because FORTRAN has no
> inherent ability to deal with dynamic memory allocation so it all has to be
> faked.  If the code had been done in C, for example, the space would have been
> allocated with malloc().  Of course, that doesn't really give the a compiler
> much more to go on than was available in the FORTRAN case!
> 
> Before proposing to remove something so central to large codes, you'd better
> understand WHY it's central, and what you can provide to replace it with.
> 
> 							-- Jerry


	Therefore, it's not worth discussing other approaches? 

	Look, I know how you feel. Read "Time, Progress, and the Older 
	Programmer". The only substitute for 30 years experience is 5
	years experience, 6 times.  It will be time to retire soon anyway.
	Take your pills.