[comp.sys.sgi] Array indexes vs. Pointers

ciemo@bananapc.wpd.sgi.com.SGI.COM (Dave Ciemiewicz) (05/20/89)

To use pointers or array indexes, that is the question.  The SGI (MIPS) 
compilers for the 4D are optimization monsters.  The optimizer is "smart"
enough to recognize special case use of array indexes and convert them to
pointers internally.  This removes the infamous multiply within your loop
when using array indices.  The fact of the matter is that there are cases
where use of indexes will generate more efficient code than the corresponding
code using pointers.  The following is a C source file which demonstrates
this.

----- optimizer.c -------------------------------------------------------------
/*  1 */extern v3f(float [3]);
/*  2 */
/*  3 */dovertices_indexed(float v[][3], unsigned int len) {
/*  4 */    register unsigned int	i;
/*  5 */
/*  6 */    for (i=0; i<len; i++) {
/*  7 */	v3f(v[i]);
/*  8 */    }
/*  9 */}
/* 10 */
/* 11 */dovertices_pointer(float v[][3], unsigned int len) {
/* 12 */    register float ** 		p;
/* 13 */
/* 14 */    for (p=v; p != v+len*3; p+=3) {
/* 15 */	v3f(*p);
/* 16 */    }
/* 17 */}
----- optimizer.c -------------------------------------------------------------

"cc -O2 -c optimizer.c" will generate an optimized object file "optimizer.o".
Use "dis optimizer.o" to disassemble the object code produced.
The array indexed code generates the following innerbody of the loop.

  [test.c:   7] 0x38:   0c000000        jal     v3f
* [test.c:   7] 0x3c:   02202021        move    a0,s1		
  [test.c:   8] 0x40:   2610000c        addiu   s0,s0,12
  [test.c:   8] 0x44:   0212082b        sltu    at,s0,s2
  [test.c:   8] 0x48:   1420fffb        bne     at,zero,0x38
* [test.c:   8] 0x4c:   2631000c        addiu   s1,s1,12

	a0 is the argument the function call v3f.  Yes it does get loaded
	*after* the jump to v3f is initiated.  A jump takes more than a
	cycle to compute so there is a delay.  A delay slot is left open
	for an executable instruction.  There is enough time to load the
	argument after initiating the jump but before completing the jump.
	(Tricky huh?)  s1 is the value of v[i].  s0 is the value of i*4.
	s2 is the value of len*4.  Guess where the multiply went.  6
	intructions were generated for the loop.  The time to execute is
	6 cycles.

The pointer version generates the following innerbody of the loop.

  [test.c:  15] 0x98:   8e040000        lw      a0,0(s0)
  [test.c:  15] 0x9c:   0c000000        jal     v3f
* [test.c:  15] 0xa0:   00000000        nop
  [test.c:  16] 0xa4:   2610000c        addiu   s0,s0,12
  [test.c:  16] 0xa8:   1630fffb        bne     s1,s0,0x98
* [test.c:  16] 0xac:   00000000        nop

	a0 is the argument to the function call v3f.  This time the loading of
	a0 is done from memory instead of a copy.  This load has a delay
	slot associated with it that the jump fills.  The jump delay slot
	is filled by a no-op.	s0 is p.  s1 is v+len*3.  The final
	instruction is a no-op to fill the branch delay slot.  Only 4 "real"
	instructions were generated for the innerloop.  The 2 no-ops could
	be replaced by other instructions in a different code example.
	The time to execute for this example is the same as before, 6 cycles.
	However, there is a potential for a data cache miss with the load
	instruction that would be some number of cycles for the first load,
	depending on the CPU type.

With the different setups involved for the two type of array traversals
-- indexes versus pointers -- and the potentials for data cache misses,
it is difficult to clearly say one style is superior to another.  The
advantages of one style or another is dependent on the instructions performed
in the loop and even on the CPU executing the loop.  From an asthetics point
of view, I prefer the use of array indexes.  Array indexes are easier to
understand than pointers.

My overall recommendation is to avoid language tricks when developing code.
After the code is working, use the profiler for locating areas that might
benefit from using the tricks.
--

Dave	   (commonplace)		"Boldly going where no one cares to go."
Ciemiewicz (incomprehensible)
ciemo 	   (infamous)

tarolli@dragon.wpd.sgi.com (Gary Tarolli) (05/22/89)

In article <33239@sgi.SGI.COM>, ciemo@bananapc.wpd.sgi.com.SGI.COM (Dave Ciemiewicz) writes:
> To use pointers or array indexes, that is the question.  The SGI (MIPS) 
> compilers for the 4D are optimization monsters.  The optimizer is "smart"
> enough to recognize special case use of array indexes and convert them to
> pointers internally.  This removes the infamous multiply within your loop
> when using array indices.  The fact of the matter is that there are cases
> where use of indexes will generate more efficient code than the corresponding
> code using pointers.  The following is a C source file which demonstrates
> this.
> 
> ----- optimizer.c -------------------------------------------------------------
> /*  1 */extern v3f(float [3]);
> /*  2 */
> /*  3 */dovertices_indexed(float v[][3], unsigned int len) {
> /*  4 */    register unsigned int	i;
> /*  5 */
> /*  6 */    for (i=0; i<len; i++) {
> /*  7 */	v3f(v[i]);
> /*  8 */    }
> /*  9 */}
> /* 10 */
> /* 11 */dovertices_pointer(float v[][3], unsigned int len) {
> /* 12 */    register float ** 		p;
> /* 13 */
> /* 14 */    for (p=v; p != v+len*3; p+=3) {
> /* 15 */	v3f(*p);
> /* 16 */    }
> /* 17 */}
> ----- optimizer.c -------------------------------------------------------------

If I am not mistaken, the second example should be "v3f(p);" and not *p.  This
will remove the offending load word instruction and the possibility of missing
the cache and also make the second loop more efficient.

However, I agree with Dave, arrays are generally as efficient as pointers
and they are easier to read.  Also, it makes sense to use arrays and then
use prof to isolate your bottlenecks.  However, there is one case where
arrays are much faster than pointers on the MIPS cpu - and that is where
you are unwinding an autoincremented pointer loop.  For example,

    for (i=0; i<end; i+=4) {
	to[i] = from[i];
	to[i+1] = from[i+1];
	to[i+2] = from[i+2];
	to[i+3] = frompi+3];
    }

is much better than

    while (to < end) {
	*to++ = *from++;
	*to++ = *from++;
	*to++ = *from++;
	*to++ = *from++;
    }

because MIPS doesn't have an autoincrementing register instruction and the 
first case doesn't need to do the extra 2 adds per line of code. So sometimes
it is better to use base+offset then *p++.  Sometimes ...  on certain 
machines ....  when the moon is full ....