jf@threel.co.uk (John Fisher) (12/07/90)
sarima@tdatirv.UUCP (Stanley Friesen) writes: > Except for one thing. In scientific computation typical dimensions would be > more like 1000 by 1000 by 1000 by 1000 by 50, which requires a great deal more > than a mere thirty pointers. I can't help but feel that someone who needs an array of 50 trillion elements has one or two other problems to worry about. --John Fisher
jlg@lanl.gov (Jim Giles) (12/08/90)
This discussion of pointer-to-pointer arrays vs. 'flat' arrays is
getting out of hand. Extravagant statements about the size of
'practical' computations vs. 'interesting' computations are not
relevant to the issue.
To begin with the original problem then, suppose you have an array
that is dimensioned 5x5x3x2x12. This will have 1800 data elements.
The claim was made that implementing this as a pointer-to-pointer
array would require 30 pointers and be some degree faster. So, I'll
analyze the issue step by step.
1) If your algorithm tends to do a lot of 'random' accessing of the
array, then the pointer-to-pointer version is faster than the 'flat'
version only if memory accesses are cheaper than multiplies. The
technology of memory accessing and of multiplication tend to bypass
each-other - at present, only micros seem to have faster memory than
multiplies.
Note: in this context, the _irrelevant_ claim that Cray 24-bit
multiplies were somehow inadequate to the job involved here was
made. But, 24-bits is the size of the address register, so these
are the multiplies appropriate to the task under discussion. Also,
46-bit multiplies in the S registers are only 7 clocks - which is
still over twice as fast as memory (which is 14 to 17 clocks depending
on machine model and assuming no bank conflicts). Further, the
irrelevant implication that 24-bits can only address 16 mega bytes
was made. The Cray addresses _words_, so 24-bits address 128 Mbytes.
2) If your algorithm tends to do most array references in nested loops
and indexes the array on the loop counters (the _vastly_ most common
case), then the pointer-to-pointer implementation _ties_ the 'flat'
array implementation - _PROVIDED_ that the loops are nested in the
same order as the pointer-to-pointer structure is laid out. If this
is not the case, the pointer load cannot be migrated out of the loop,
and must be done inside. Further, since it _is_ a _pointer_ being
loaded, the compiler may assume that they are aliased from one loop
trip to another (inhibiting optimizations). Meanwhile, the 'flat'
array implementation _always_ strength reduces the multiply out of
the loop - different loop nestings just require different strides.
Well, you could just make the rule that all loops _must_ be nested in
the declared order of the array. Well, you can't. The most common
operation on tensors, for example, is the tensor multiply. In this
operation, at least one of the tensors invloved will be accessed in a
different order than the other two. And a given tensor may be used
either as an operand of a result of each multiply. Other algorithms
for scientific computation make similar requirements for flexibility in
loop nesting order (one finite differencing scheme converges in 4 times
fewer timesteps if the mesh is traversed in row-order and column-order
on alternate steps). The user of a pointer-to-pointer implementation
has only two solutions to this problem available:
a) Do nothing. Let the out of order loops run less efficiently.
But, since (because of aliasing) this may involve running _really_
slow (lack of vectorization, etc.), this 'solution' seems not to
meet the needs of the user. After all, the reason for using
the pointer-to-pointer implementation was supposedly to gain
a speed advantage.
b) Set up pointer-to-pointer structures for the array in each of the
nesting orders that your program uses, and then the loops can nest
in any order and still be efficient. If, as was claimed, the number
of pointers required was insignificant, then you get identical speed
in nested loops out of pointer-to-pointer and faster speed on random
uses (still assuming memory is faster than multiplies).
3) Well, how many pointers do you need to implement an array with the
pointer-to-pointer scheme? Let's say that the dimensions of the array
are D1,D2,D3,...,Dn for an n-dimensional array. The number of data
elements is evidently the product of all the dimensions. The number of
pointers is the product of the first n-1 dimensions _plus_ the product
of the first n-2 _plus_ ... _plus_ the first dimension _plus_ one (this
last is the initial pointer to the whole structure - what gets passed
in a procedure call). So, for the original 5x5x3x2x12 array, we have
1800 data elements and 5*5*3*2 + 5*5*3 + 5*5 + 5 + 1 = 256 pointers to
implement the array in row-major order - not 30 as claimed. This is
14.2% of the total number of data elements.
Now, suppose you also need to nest your loops as 12x3x2x5x5 (again in
row major order). There are still 1800 data elements, but now you
need 481 pointers - that's 24.7% of the number of data items. And,
if you need to access your loops in both the first and second nesting
orders, you need 737 pointers total (plus some stride info on the
innermost pointer of the non-default nesting) this is 41% of the
total number of data elements. Clearly, the memory _space_ overhead
for the pointer-to-pointer implementation is not as trivial as has
been claimed.
So, what's the conclusion? Well, pointer-to-pointer implementations
are faster on _some_ hardware _IF_ the array tends to be randomly
accessed. The pointer-to-pointer version is _at_best_ as fast as the
'flat' array in the more common case sequentially accessing the array
in nested loops. The pointer-to-pointer version _may_ be much slower
if accessing is done in some order that there isn't a pointer structure
for. In all these cases, the spece required for these pointers can
be a significant overhead. So, most people would pick 'flat' arrays
over the pointer-to-pointer version because they don't have any storage
overhead for pointers and they match or exceed the speed of the pointer-
to-pointer version in the vast majority of real-world cases
J. Giles