[net.arch] Time penalty for non-alignment on VAX/780

radford@calgary.UUCP (Radford Neal) (03/21/85)

RE: The discussion of whether C ought to pad structures to align data
    and the subsequent discussion of how much this gets you on a VAX.

There's nothing like actual data on a question like this. I ran the following
quick test program on a VAX 11/780 (Berkeley 4.2 C compiler):

#include <stdio.h>

main(argc,argv)
  int argc;
  char **argv;
{ int a[2]; register int *p;
  int n; int o; int rw; register int x;
  n = atoi(*++argv); o = atoi(*++argv); rw = **++argv;
  p = (int*)((int)&a[0]+o);
  if (rw=='r')
  { while (n>0)
    { *p = 0; *p = 0; *p = 0; *p = 0; *p = 0;
      *p = 0; *p = 0; *p = 0; *p = 0; *p = 0;
      n -= 1;
    }
  }
  else
  { while (n>0)
    { x = *p; x = *p; x = *p; x = *p; x = *p;
      x = *p; x = *p; x = *p; x = *p; x = *p;
      n -= 1;
    }
  }
}

The results are as follows:

% time aligntime 100000 0 r
        2.1 real         1.8 user         0.0 sys
% time aligntime 100000 1 r
        5.3 real         3.6 user         0.0 sys
% time aligntime 100000 2 r
        5.1 real         3.6 user         0.0 sys
% time aligntime 100000 3 r
        5.7 real         3.7 user         0.1 sys
% time aligntime 100000 4 r
        1.9 real         1.5 user         0.0 sys

% time aligntime 100000 0 w
        1.3 real         1.1 user         0.0 sys
% time aligntime 100000 1 w
        3.2 real         2.7 user         0.0 sys
% time aligntime 100000 2 w
        5.5 real         2.9 user         0.0 sys
% time aligntime 100000 3 w
        3.0 real         2.6 user         0.0 sys
% time aligntime 100000 4 w
        1.5 real         1.2 user         0.0 sys

Conclusion: Alignment of longwords on a mod 4 boundary gets you better
than a factor of two speed-up on both reads and writes. Alignment at 
a mod 8 boundary is not significant.

This is a bit worse than I would expect. Does the 780's microcode fetch
non-aligned longwords a byte at a time from cache? Note that the 
64-bit data path to main memory is not relevant, only cache accesses,
for this test.

      Radford Neal
      The University of Calgary

paul@unisoft.UUCP (Paul Campbell) (03/27/85)

No .... the Vax does not (necessarily) read data from its cache 64 bits at a
time. What it does is fetch 64bits into cache from memory when a cache miss
occurs (the SBI is 64 bits wide). This is a good idea for several reasons, 
for instruction lookahead, data lookahead, and it makes the cache design a 
little easier not having to compare as many address lines.


			Paul Campbell ucbvax!unisoft!paul

afb3@hou2d.UUCP (A.BALDWIN) (04/02/85)

NO-NO-NO!!!!

The SBI is not 64-bits wide.  SBI transactions occur as follows:

	Address (30 bits on 32 bit bus)
	data	(32-bits on 32 bit bus)
	data	(32-bits on 32 bit bus)

All other transactions are "special" (and slow things down!@#$%^&).
The 11/780 has memory organized in 64-bit chunks which can cause
real problems on memory writes (the memory systems has to read the
data, mask in the new stuff, then write the data... sorta like core,
remember??).  If you read the info on the UBA's and MBA's the SBI
interaction is described in detail (hardware handbook).

The cache is two-way set associative (ie. as you say it reads 64-bits
on a miss) write-through cache.  The write through feature really 
slows things on non-aligned transfers for the reason above.  Also, it 
is unclear from the documentation (VAX hardware handbook) that the 
cache organization is quad-word oriented.  In fact most of the VAX cpu 
functions are byte oriented and I would suspect that cache to be also.


Al Baldwin
AT&T-Bell Labs
...!ihnp4!hou2d!afb3


[These opinions are my own....Who else would want them!!!]