jf@threel.co.uk (John Fisher) (12/07/90)
sarima@tdatirv.UUCP (Stanley Friesen) writes: > Except for one thing. In scientific computation typical dimensions would be > more like 1000 by 1000 by 1000 by 1000 by 50, which requires a great deal more > than a mere thirty pointers. I can't help but feel that someone who needs an array of 50 trillion elements has one or two other problems to worry about. --John Fisher
jlg@lanl.gov (Jim Giles) (12/08/90)
This discussion of pointer-to-pointer arrays vs. 'flat' arrays is getting out of hand. Extravagant statements about the size of 'practical' computations vs. 'interesting' computations are not relevant to the issue. To begin with the original problem then, suppose you have an array that is dimensioned 5x5x3x2x12. This will have 1800 data elements. The claim was made that implementing this as a pointer-to-pointer array would require 30 pointers and be some degree faster. So, I'll analyze the issue step by step. 1) If your algorithm tends to do a lot of 'random' accessing of the array, then the pointer-to-pointer version is faster than the 'flat' version only if memory accesses are cheaper than multiplies. The technology of memory accessing and of multiplication tend to bypass each-other - at present, only micros seem to have faster memory than multiplies. Note: in this context, the _irrelevant_ claim that Cray 24-bit multiplies were somehow inadequate to the job involved here was made. But, 24-bits is the size of the address register, so these are the multiplies appropriate to the task under discussion. Also, 46-bit multiplies in the S registers are only 7 clocks - which is still over twice as fast as memory (which is 14 to 17 clocks depending on machine model and assuming no bank conflicts). Further, the irrelevant implication that 24-bits can only address 16 mega bytes was made. The Cray addresses _words_, so 24-bits address 128 Mbytes. 2) If your algorithm tends to do most array references in nested loops and indexes the array on the loop counters (the _vastly_ most common case), then the pointer-to-pointer implementation _ties_ the 'flat' array implementation - _PROVIDED_ that the loops are nested in the same order as the pointer-to-pointer structure is laid out. If this is not the case, the pointer load cannot be migrated out of the loop, and must be done inside. Further, since it _is_ a _pointer_ being loaded, the compiler may assume that they are aliased from one loop trip to another (inhibiting optimizations). Meanwhile, the 'flat' array implementation _always_ strength reduces the multiply out of the loop - different loop nestings just require different strides. Well, you could just make the rule that all loops _must_ be nested in the declared order of the array. Well, you can't. The most common operation on tensors, for example, is the tensor multiply. In this operation, at least one of the tensors invloved will be accessed in a different order than the other two. And a given tensor may be used either as an operand of a result of each multiply. Other algorithms for scientific computation make similar requirements for flexibility in loop nesting order (one finite differencing scheme converges in 4 times fewer timesteps if the mesh is traversed in row-order and column-order on alternate steps). The user of a pointer-to-pointer implementation has only two solutions to this problem available: a) Do nothing. Let the out of order loops run less efficiently. But, since (because of aliasing) this may involve running _really_ slow (lack of vectorization, etc.), this 'solution' seems not to meet the needs of the user. After all, the reason for using the pointer-to-pointer implementation was supposedly to gain a speed advantage. b) Set up pointer-to-pointer structures for the array in each of the nesting orders that your program uses, and then the loops can nest in any order and still be efficient. If, as was claimed, the number of pointers required was insignificant, then you get identical speed in nested loops out of pointer-to-pointer and faster speed on random uses (still assuming memory is faster than multiplies). 3) Well, how many pointers do you need to implement an array with the pointer-to-pointer scheme? Let's say that the dimensions of the array are D1,D2,D3,...,Dn for an n-dimensional array. The number of data elements is evidently the product of all the dimensions. The number of pointers is the product of the first n-1 dimensions _plus_ the product of the first n-2 _plus_ ... _plus_ the first dimension _plus_ one (this last is the initial pointer to the whole structure - what gets passed in a procedure call). So, for the original 5x5x3x2x12 array, we have 1800 data elements and 5*5*3*2 + 5*5*3 + 5*5 + 5 + 1 = 256 pointers to implement the array in row-major order - not 30 as claimed. This is 14.2% of the total number of data elements. Now, suppose you also need to nest your loops as 12x3x2x5x5 (again in row major order). There are still 1800 data elements, but now you need 481 pointers - that's 24.7% of the number of data items. And, if you need to access your loops in both the first and second nesting orders, you need 737 pointers total (plus some stride info on the innermost pointer of the non-default nesting) this is 41% of the total number of data elements. Clearly, the memory _space_ overhead for the pointer-to-pointer implementation is not as trivial as has been claimed. So, what's the conclusion? Well, pointer-to-pointer implementations are faster on _some_ hardware _IF_ the array tends to be randomly accessed. The pointer-to-pointer version is _at_best_ as fast as the 'flat' array in the more common case sequentially accessing the array in nested loops. The pointer-to-pointer version _may_ be much slower if accessing is done in some order that there isn't a pointer structure for. In all these cases, the spece required for these pointers can be a significant overhead. So, most people would pick 'flat' arrays over the pointer-to-pointer version because they don't have any storage overhead for pointers and they match or exceed the speed of the pointer- to-pointer version in the vast majority of real-world cases J. Giles