[comp.lang.fortran] C vs Fortran for numerical work

jf@threel.co.uk (John Fisher) (12/07/90)

sarima@tdatirv.UUCP (Stanley Friesen) writes:

> Except for one thing. In scientific computation typical dimensions would be
> more like 1000 by 1000 by 1000 by 1000 by 50, which requires a great deal more
> than a mere thirty pointers.

I can't help but feel that someone who needs an array of 50 trillion
elements has one or two other problems to worry about.

--John Fisher

jlg@lanl.gov (Jim Giles) (12/08/90)

This discussion of pointer-to-pointer arrays vs. 'flat' arrays is
getting out of hand.  Extravagant statements about the size of
'practical' computations vs. 'interesting' computations are not
relevant to the issue.

To begin with the original problem then, suppose you have an array
that is dimensioned 5x5x3x2x12.  This will have 1800 data elements.
The claim was made that implementing this as a pointer-to-pointer
array would require 30 pointers and be some degree faster.  So, I'll
analyze the issue step by step.

1) If your algorithm tends to do a lot of 'random' accessing of the
   array, then the pointer-to-pointer version is faster than the 'flat'
   version only if memory accesses are cheaper than multiplies.  The
   technology of memory accessing and of multiplication tend to bypass
   each-other - at present, only micros seem to have faster memory than
   multiplies.

   Note: in this context, the _irrelevant_ claim that Cray 24-bit
   multiplies were somehow inadequate to the job involved here was
   made.  But, 24-bits is the size of the address register, so these
   are the multiplies appropriate to the task under discussion. Also,
   46-bit multiplies in the S registers are only 7 clocks - which is
   still over twice as fast as memory (which is 14 to 17 clocks depending
   on machine model and assuming no bank conflicts).  Further, the
   irrelevant implication that 24-bits can only address 16 mega bytes
   was made.  The Cray addresses _words_, so 24-bits address 128 Mbytes.

2) If your algorithm tends to do most array references in nested loops
   and indexes the array on the loop counters (the _vastly_ most common
   case), then the pointer-to-pointer implementation _ties_ the 'flat'
   array implementation - _PROVIDED_ that the loops are nested in the
   same order as the pointer-to-pointer structure is laid out.  If this
   is not the case, the pointer load cannot be migrated out of the loop,
   and must be done inside.  Further, since it _is_ a _pointer_ being
   loaded, the compiler may assume that they are aliased from one loop
   trip to another (inhibiting optimizations).  Meanwhile, the 'flat'
   array implementation _always_ strength reduces the multiply out of
   the loop - different loop nestings just require different strides.

   Well, you could just make the rule that all loops _must_ be nested in
   the declared order of the array.  Well, you can't.  The most common
   operation on tensors, for example, is the tensor multiply.  In this
   operation, at least one of the tensors invloved will be accessed in a
   different order than the other two.  And a given tensor may be used
   either as an operand of a result of each multiply.  Other algorithms
   for scientific computation make similar requirements for flexibility in
   loop nesting order (one finite differencing scheme converges in 4 times
   fewer timesteps if the mesh is traversed in row-order and column-order
   on alternate steps).  The user of a pointer-to-pointer implementation
   has only two solutions to this problem available:

   a) Do nothing.  Let the out of order loops run less efficiently.
      But, since (because of aliasing) this may involve running _really_
      slow (lack of vectorization, etc.), this 'solution' seems not to
      meet the needs of the user.  After all, the reason for using
      the pointer-to-pointer implementation was supposedly to gain
      a speed advantage.

   b) Set up pointer-to-pointer structures for the array in each of the
      nesting orders that your program uses, and then the loops can nest
      in any order and still be efficient.  If, as was claimed, the number
      of pointers required was insignificant, then you get identical speed
      in nested loops out of pointer-to-pointer and faster speed on random
      uses (still assuming memory is faster than multiplies).

3) Well, how many pointers do you need to implement an array with the
   pointer-to-pointer scheme?  Let's say that the dimensions of the array
   are D1,D2,D3,...,Dn for an n-dimensional array.  The number of data
   elements is evidently the product of all the dimensions.  The number of
   pointers is the product of the first n-1 dimensions _plus_ the product
   of the first n-2 _plus_ ... _plus_ the first dimension _plus_ one (this
   last is the initial pointer to the whole structure - what gets passed
   in a procedure call).  So, for the original 5x5x3x2x12 array, we have
   1800 data elements and 5*5*3*2 + 5*5*3 + 5*5 + 5 + 1 = 256 pointers to
   implement the array in row-major order - not 30 as claimed.  This is
   14.2% of the total number of data elements.

   Now, suppose you also need to nest your loops as 12x3x2x5x5 (again in
   row major order).  There are still 1800 data elements, but now you
   need 481 pointers - that's 24.7% of the number of data items.  And,
   if you need to access your loops in both the first and second nesting
   orders, you need 737 pointers total (plus some stride info on the
   innermost pointer of the non-default nesting) this is 41% of the
   total number of data elements.  Clearly, the memory _space_ overhead
   for the pointer-to-pointer implementation is not as trivial as has
   been claimed.

So, what's the conclusion?  Well, pointer-to-pointer implementations
are faster on _some_ hardware _IF_ the array tends to be randomly
accessed.  The pointer-to-pointer version is _at_best_ as fast as the
'flat' array in the more common case sequentially accessing the array
in nested loops.  The pointer-to-pointer version _may_ be much slower
if accessing is done in some order that there isn't a pointer structure
for.  In all these cases, the spece required for these pointers can
be a significant overhead.  So, most people would pick 'flat' arrays
over the pointer-to-pointer version because they don't have any storage
overhead for pointers and they match or exceed the speed of the pointer-
to-pointer version in the vast majority of real-world cases

J. Giles