[comp.arch] Explanation, please!

eric@snark.UUCP (Eric S. Raymond) (08/27/88)

(Code below reproduced so that comp.arch people seeing this followup only
won't get terminally frustrated. This is *really neat*, gang...)

In article <638@paris.ics.uci.edu> Douglas C. Schmidt writes:
> 
> void send(int *to,int *from, int count) {
>    int n = (count + 7) / 8;
>    
>    switch(count % 8) {
>       case 0:  do { *to++ = *from++;
>       case 7:       *to++ = *from++;
>       case 6:       *to++ = *from++;
>       case 5:       *to++ = *from++;
>       case 4:       *to++ = *from++;
>       case 3:       *to++ = *from++;
>       case 2:       *to++ = *from++;
>       case 1:       *to++ = *from++;
>                } while (--n > 0);
>    }
> 
> }
> 
> Finally, Stroustrup asks the rhetorical question ``why would anyone
> want to write something like this.''  Any guesses?!

Yeah. That's the most hackish way of trying to write a portable optimized
copy routine I've ever seen. I gather the whole point of the shenanigans
is to get all the *from++ -> *to++ instructions in the generated code to be
adjacent.

This only makes if the author knows he's got a hardware instruction pipeline
or cache that's no less than 8 and no more than 9 byte-copy instruction widths
long, and stuff executing out of the pipeline is a lot faster than if the
copies are interleaved with control transfers. Dollars to doughnuts this code
was written on a RISC machine.

(Gawrsh. That sounded just like one of the big boys on comp.arch tawkin'. I
think I'll cross-post over there just to see if I get shot down in flames...)

-- 
      Eric S. Raymond                     (the mad mastermind of TMN-Netnews)
      UUCP: ...!{uunet,att,rutgers}!snark!eric = eric@snark.UUCP
      Post: 22 S. Warren Avenue, Malvern, PA 19355      Phone: (215)-296-5718

schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) (08/27/88)

Hi,

   Since I posted my original question there has been a great deal of
abstract discussion about the relative merits of the loop unrolling
scheme.  The topic has piqued my curiousity, so I when ahead and
implemented a short test program, included below, to test Duff's
device against the ``ordinary for loop w/index variable'' technique.
See for yourself....   

After some quick testing I found that gcc 1.26 -O on a Sun 3 and a
Sequent Balance was pretty heavily in favor of the regular (non-Duff)
loop.  Your mileage may vary.  I realize that there may be other
tests, and if anyone has a better version, I'd like to see it!


Doug Schmidt
----------------------------------------
#include <sys/time.h>
#include <sys/resource.h>

double Start_Timer();
double Return_Elapsed_Time();

#define MAX_NUM 100000
int array1[MAX_NUM ];
int array2[MAX_NUM ];
int *A = array1, *B = array2;

main(int argc, char *argv[]) {
   double Elapsed_Time;
   int    Count = argc > 1 ? atoi(argv[1]) : MAX_NUM;
   int i;
   
   for (i = 0;i < Count ;i++) {
      array1[i] = i + 1;
      array2[i] = i;
   }
      
   printf("Starting Duff's device timing...\n");
   Start_Timer();

   {
      int n = (Count + 7) / 8;

      switch(Count % 8) {
         case 0:  do { *A++ = *B++;
         case 7:       *A++ = *B++;
         case 6:       *A++ = *B++;
         case 5:       *A++ = *B++;
         case 4:       *A++ = *B++;
         case 3:       *A++ = *B++;
         case 2:       *A++ = *B++;
         case 1:       *A++ = *B++;
                  } while (--n > 0);
      }
   }

   Elapsed_Time = Return_Elapsed_Time(0.0 );
   printf("Elapsed time = %.3f\n",Elapsed_Time);

   for (i = 0;i < Count ;i++) {
      if (array1[i] != array2[i]) {
         printf("Yow, problems at location %d!\n",i);
         break;
      }
   }

   for (i = 0;i < Count ;i++) {
      array1[i] = i + 1;
      array2[i] = i;
   }

   printf("Starting ordinary copy timing...\n");
   Start_Timer();

   for (i = 0;i < Count ;i++) {
      array1[i] = array2[i];
   }

   Elapsed_Time = Return_Elapsed_Time(0.0 );
   printf("Elapsed time = %.3f\n",Elapsed_Time);

   for (i = 0;i < Count ;i++) {
      if (array1[i] != array2[i]) { 
         printf("Yow, problems at location %d!\n",i);
         break;
      }
   }

}      


/* no such thing as "negative time"! */
#define  ERROR_VALUE -1.0   

static struct rusage Old_Time;
static struct rusage New_Time;
static int    Timer_Set = 0;

#ifdef __STDC__
double Start_Timer(void)
#else
double Start_Timer()
#endif
{
   Timer_Set = 1;
   getrusage(RUSAGE_SELF,&Old_Time);        /* set starting process time */
   return(Old_Time.ru_utime.tv_sec + (Old_Time.ru_utime.tv_usec / 1000000.0));
}

/* returns process time since Last_Time (if parameter is not DEFAULT_TIME, */
/* i.e., (double) 0.0 ),otherwise, if parameter == DEFAULT_TIME then       */
/* the time since the Old_Time was set is returned.  Returns ERROR_VALUE   */
/* if Start_Timer() is not called first */
#ifdef __STDC__
double Return_Elapsed_Time(double Last_Time)
#else
double Return_Elapsed_Time(Last_Time) 
double Last_Time;
#endif
{
   if (!Timer_Set) {
      return(ERROR_VALUE);
   }   
   else {
    /* get process time */
      getrusage(RUSAGE_SELF,&New_Time);
      if (Last_Time == 0.0) {
         return((New_Time.ru_utime.tv_sec - Old_Time.ru_utime.tv_sec) + 
               ((New_Time.ru_utime.tv_usec - Old_Time.ru_utime.tv_usec) 
                / 1000000.0));
      }
      else {
         return((New_Time.ru_utime.tv_sec + 
                (New_Time.ru_utime.tv_usec / 1000000.0)) - Last_Time);
      }
   }
}



--
schmidt@bonnie.ics.uci.edu (ARPA)
"If our behavior is strict, we do not need fun." -Zippy th' Pinhead
"If our behavior is struct, we do not need defun." -Anon

henry@utzoo.uucp (Henry Spencer) (08/28/88)

In article <653@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:
>After some quick testing I found that gcc 1.26 -O on a Sun 3 and a
>Sequent Balance was pretty heavily in favor of the regular (non-Duff)
>loop...

The odds are excellent that the compilers are applying optimizations to
the regular loop that they can't do for the Duff loop as you've written
it.  In particular, your pointers aren't even register variables (and they
are externs, so the compiler can't safely promote them quietly), whereas
a good optimizing compiler will certainly be using register pointers in
the regular loop.

My partner in crime, Geoff Collyer, applied Duff's Device in a number of
places in the C News rnews; the performance improvements were substantial.
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

levy@ttrdc.UUCP (Daniel R. Levy) (08/28/88)

In article <653@paris.ICS.UCI.EDU>, schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:
> Hi,
> 
>    Since I posted my original question there has been a great deal of
> abstract discussion about the relative merits of the loop unrolling
> scheme.  The topic has piqued my curiousity, so I when ahead and
> implemented a short test program, included below, to test Duff's
> device against the ``ordinary for loop w/index variable'' technique.
> See for yourself....   
> 
> After some quick testing I found that gcc 1.26 -O on a Sun 3 and a
> Sequent Balance was pretty heavily in favor of the regular (non-Duff)
> loop.  Your mileage may vary.  I realize that there may be other
> tests, and if anyone has a better version, I'd like to see it!

[schmidt's program deleted, please refer to parent article if you want it]

I modified this program to run under System V, changed the arrays to be dynam-
ically allocated, and changed both the Duff and ordinary copies to use register
pointers instead of global pointers (for the Duff copy) and array indexing (for
the ordinary copy).  I then tried it on a SVR2 3B20, a SVR3 3B2, a Sun-3, and a
Sun-4 both with and without -O optimization (using the standard pcc-type C
compiler on each system).  The result?  Duff wins by about 10%-20% on all
machines tested.

Here is the code thus modified:


#include <sys/types.h>
#include <sys/times.h>
#include <sys/param.h>

double Start_Timer();
double Return_Elapsed_Time();

#define DEFAULT_NUM 100000

main(argc,argv) char **argv; {
   double Elapsed_Time_Duff;
   double Elapsed_Time_Ordinary;
   int    Count = argc > 1 ? atoi(argv[1]) : DEFAULT_NUM;
   int i;
   register int *A, *B;
   register int n;
   register int *A_end;
   int *array1, *array2;
   char *malloc();

   if (Count <= 0) {
       printf("Bad Count\n");
       return 1;
   } else {
       printf("Count=%d\n", Count);
   }

   if (!(array1=(int *)malloc(sizeof(int) * Count))) {
      printf("Can't malloc %u bytes for array1\n", sizeof(int) * Count);
      return 1;
   } else if (!(array2=(int *)malloc(sizeof(int) * Count))) {
      printf ("Can't malloc %u bytes for array2\n", sizeof(int) * Count);
      return 1;
   }
   
   for (i = 0;i < Count ;i++) {
      array1[i] = i + 1;
      array2[i] = i;
   }
      
   printf("Starting Duff's device timing...\n");
   Start_Timer();

   {
      n = (Count + 7) / 8;
      A = array1;
      B = array2;

      switch(Count & 0x7) {
         case 0:  do { *A++ = *B++;
         case 7:       *A++ = *B++;
         case 6:       *A++ = *B++;
         case 5:       *A++ = *B++;
         case 4:       *A++ = *B++;
         case 3:       *A++ = *B++;
         case 2:       *A++ = *B++;
         case 1:       *A++ = *B++;
                  } while (--n > 0);
      }
   }

   Elapsed_Time_Duff = Return_Elapsed_Time(0.0 );
   printf("Elapsed time = %.3lf\n",Elapsed_Time_Duff);

   for (i = 0;i < Count ;i++) {
      if (array1[i] != array2[i]) {
         printf("Yow, problems at location %d!\n",i);
         break;
      }
   }

   for (i = 0;i < Count ;i++) {
      array1[i] = i + 1;
      array2[i] = i;
   }

   printf("Starting ordinary copy timing...\n");
   Start_Timer();

   {
      A = array1;
      B = array2;
      A_end = array1 + Count - 1;

      while (A <= A_end)
         *A++ = *B++;
   }

   Elapsed_Time_Ordinary = Return_Elapsed_Time(0.0 );
   printf("Elapsed time = %.3lf\n",Elapsed_Time_Ordinary);

   for (i = 0;i < Count ;i++) {
      if (array1[i] != array2[i]) { 
         printf("Yow, problems at location %d!\n",i);
         break;
      }
   }
   printf("Duff/Ordinary copy time ratio = %.3lf\n",
      Elapsed_Time_Duff/Elapsed_Time_Ordinary);

}      


/* no such thing as "negative time"! */
#define  ERROR_VALUE -1.0   

static struct tms Old_Time;
static struct tms New_Time;
static int    Timer_Set = 0;

#ifdef __STDC__
double Start_Timer(void)
#else
double Start_Timer()
#endif
{
   Timer_Set = 1;
   times(&Old_Time);		/* set starting process time */
   return ((double)Old_Time.tms_utime)/(double)HZ;
}

/* returns process time since Last_Time (if parameter is not DEFAULT_TIME, */
/* i.e., (double) 0.0 ),otherwise, if parameter == DEFAULT_TIME then       */
/* the time since the Old_Time was set is returned.  Returns ERROR_VALUE   */
/* if Start_Timer() is not called first */
#ifdef __STDC__
double Return_Elapsed_Time(double Last_Time)
#else
double Return_Elapsed_Time(Last_Time) 
double Last_Time;
#endif
{
   if (!Timer_Set) {
      return(ERROR_VALUE);
   }   
   else {
    /* get process time */
      times(&New_Time);
      if (Last_Time == 0.0) {
	 return ((double)(New_Time.tms_utime - Old_Time.tms_utime))/(double) HZ;
      }
      else {
	 return (((double)New_Time.tms_utime)/(double)HZ) - Last_Time;
      }
   }
}


-- 
|------------Dan Levy------------|  THE OPINIONS EXPRESSED HEREIN ARE MINE ONLY
| Bell Labs Area 61 (R.I.P., TTY)|  AND ARE NOT TO BE IMPUTED TO AT&T.
|        Skokie, Illinois        | 
|-----Path:  att!ttbcad!levy-----|

aglew@urbsdc.Urbana.Gould.COM (08/28/88)

>..> Duff's device
>
>Yeah. That's the most hackish way of trying to write a portable optimized
>copy routine I've ever seen. I gather the whole point of the shenanigans
>is to get all the *from++ -> *to++ instructions in the generated code to be
>adjacent.

Well, that's a start. Duff's device does better on some machines 
(like, mine) that don't have autoincrement addressing modes, and even
on some that do, if you use indexing instead of incrementing.
Unfortunately, that tends to make you run the copy backwards,
which louses up some caches.

>This only makes if the author knows he's got a hardware instruction pipeline
>or cache that's no less than 8 and no more than 9 byte-copy instruction widths
>long, and stuff executing out of the pipeline is a lot faster than if the
>copies are interleaved with control transfers. Dollars to doughnuts this code
>was written on a RISC machine.
>
>      Eric S. Raymond                     (the mad mastermind of TMN-Netnews)

No doughnuts. The canonical example of the Duff device originated on a VAX
- it's that stuff in the BSD source that produces diagnostics every time
you recompile a kernel. Or did it originate on a PDP-11, or early?
Gosh darn it, we don't have version 7 reference source on line any more.



Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
    aglew@gould.com     	- preferred, if you have MX records
    aglew@xenurus.gould.com     - if you don't
    ...!ihnp4!uiucuxc!ccvaxa!aglew  - paths may still be the only way
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

gwyn@smoke.ARPA (Doug Gwyn ) (08/28/88)

In article <dpmuY#2EBC4R=eric@snark.UUCP> eric@snark.UUCP (Eric S. Raymond) writes:
>Dollars to doughnuts this code was written on a RISC machine.

Send me my doughnuts!  (Or dollars, if you prefer.)  RISC machines
were practically non-existent when this C code was first written down
(whether or not by Tom Duff, I'm not sure).  It probably appeared on
a PDP-11, perhaps a VAX.

I suspect the first use of Duff's device was as an "efficiency hack",
but as you observe it is not universally the most efficient method.

The modern VALID reason for such code is that it is a relatively
efficient way to do something portably that CANNOT BE DONE by
using the "preferred block copy routine" supplied by an implementation
(be it bcopy() or memcpy()).  Whenever the preferred routine will
work correctly (in the portable case, that means whenever the source
and destination fields do not overlap), then of course one should use
it, especially since if the C vendors are doing their job, the C
library routine (which could also be expanded in-line by the compiler)
will be near-optimally tuned for each specific architecture.

P.S.  Some implementations of bcopy() and memcpy() do guarantee
proper handling of overlapping fields.  But not all do, so one cannot
count on it if one's trying to write truly portable code.  ANSI C will
specify a new function memmove() with this property guaranteed.

casey@admin.cognet.ucla.edu (Casey Leedom) (08/29/88)

In article <28200192@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
| > > Duff's device
| > 
| > That's the most hackish way of trying to write a portable optimized
| > copy routine I've ever seen. I gather the whole point of the shenanigans
| > is to get all the *from++ -> *to++ instructions in the generated code to be
| > adjacent.
| 
| Well, that's a start. Duff's device does better on some machines (like,
| mine) that don't have autoincrement addressing modes, and even on some
| that do, if you use indexing instead of incrementing.  Unfortunately,
| that tends to make you run the copy backwards, which louses up some
| caches.
| 
| > This only makes if the author knows he's got a hardware instruction
| > pipeline or cache that's no less than 8 and no more than 9 byte-copy
| > instruction widths long, and stuff executing out of the pipeline is a lot
| > faster than if the copies are interleaved with control transfers. Dollars
| > to doughnuts this code was written on a RISC machine.

  I wrote the bcopy routine for 2.10BSD with some loop unrolling.  But
there the motivation wasn't pipelining, caching, or control/transfer
considerations.  Basically, it's very simple:

	1:
		mov	(r0)+, (r1)+	; 1.50 usec
		sob	r2, 1b		; 0.60 usec

	sob (subtract 1 and branch if non-zero) is 28% of total loop time.

vs:

	2:
		mov	(r0)+, (r1)+	; 1.50 usec
		mov	(r0)+, (r1)+	; 1.50 usec
		mov	(r0)+, (r1)+	; 1.50 usec
		mov	(r0)+, (r1)+	; 1.50 usec
		sob	r2, 2b		; 0.60 usec


	sob is 10% of total loop time.

  The other big performance win was to optimize odd to odd, even to even
with odd length, etc. copies which previous versions of bcopy implemented
as byte copy loops.

Casey

henry@utzoo.uucp (Henry Spencer) (08/29/88)

In article <dpmuY#2EBC4R=eric@snark.UUCP> eric@snark.UUCP (Eric S. Raymond) writes:
>This only makes if the author knows he's got a hardware instruction pipeline
>or cache that's no less than 8 and no more than 9 byte-copy instruction widths
>long, and stuff executing out of the pipeline is a lot faster than if the
>copies are interleaved with control transfers. Dollars to doughnuts this code
>was written on a RISC machine.

Nope.  Bell Labs Research uses VAXen and 68Ks, mostly.

The key point is not pipelining, but loop-control overhead.  There is in
fact a tradeoff here:  unrolling the loop further will reduce control
overhead further, but will increase code size.  That last is of some
significance when caching gets into the act:  cache-loading overhead
favors short loops, and small cache sizes very strongly favor short ones.
In general there is an optimal point in there somewhere, and an unrolling
factor of 8 or 16 is a pretty good guess at it on the machines I've looked
at closely.
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

aglew@urbsdc.Urbana.Gould.COM (08/29/88)

..> Duff Device

My apologies. In an earlier post I said that the Duff device
was in VAX BSD source, and caused diagnostics when recompiling
the kernel.

Well, on being challenged by Chris Torek, I found it in 
Gould specific UTX source, specifically in sel/in_cksum.c.

But the point remains - the Duff device was found to be a performance
win on a Gould machine, which, while considerably simpler than a
VAX, is hardly a RISC.


Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
    aglew@gould.com     	- preferred, if you have MX records
    aglew@xenurus.gould.com     - if you don't
    ...!ihnp4!uiucuxc!ccvaxa!aglew  - paths may still be the only way
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

rick@seismo.CSS.GOV (Rick Adams) (08/30/88)

From harpo!ihnp4!we13!burl!ulysses!allegra!alice!td Mon May  7 10:19:21 1984
From: td@alice.UUCP (Tom Duff)
Newsgroups: net.lang.c
Subject: Unwinding loops
Message-ID: <2748@alice.UUCP>
Date: 7 May 84 14:19:21 GMT
Status: RO

Consider the following routine, abstracted from code which copies an
array of shorts into the Programmed IO data register of an Evans &
Sutherland Picture System II:

	send(to, from, count)
	register short *to, *from;
	register count;
	{
		do
			*to = *from++;
		while(--count>0);
	}

(Obviously, this fails if the count is zero.)
The VAX C compiler compiles the loop into 2 instructions (a movw and
a sobleq, I think.)  As it turns out, this loop was the bottleneck in
a real-time animation playback program which ran too slowly by about 50%.
The standard way to get more speed out of something like this is to unwind
the loop a few times, decreasing the number of sobleqs.  When you do that,
you wind up with a leftover partial loop.  I usually handle this in C with
a switch that indexes a list of copies of the original loop body.  Of
course, if I were writing assembly language code, I'd just jump into the
middle of the unwound loop to deal with the leftovers.  Thinking about this
one day last October, the following implementation occurred to me:

	send(to, from, count)
	register short *to, *from;
	register count;
	{
		register n=(count+7)/8;
		switch(count%8){
		case 0:	do{	*to = *from++;
		case 7:		*to = *from++;
		case 6:		*to = *from++;
		case 5:		*to = *from++;
		case 4:		*to = *from++;
		case 3:		*to = *from++;
		case 2:		*to = *from++;
		case 1:		*to = *from++;
			}while(--n>0);
		}
	}

Disgusting, no?  But it compiles and runs just fine on all known C compilers.
Dennis Ritchie has endorsed it as legal C.
I feel a combination of pride and revulsion at this discovery;
I think I'll name it after myself -- ``Duff's Device'' has a nice ring to it.

It amazes me that after 10 years of writing C there are still little corners
that I haven't explored fully.  (Actually, I have another revolting way to
use switches to implement interrupt driven state machines but it's too
horrid to go into.)

Many people (even Brian Kernighan?) have said that the worst feature of C is that
switches don't break automatically before each case label.  This code forms
some sort of argument in that debate, but I'm not sure whether it's for or
against.

Tom Duff {ucbvax,decvax,ihnp4,...}!alice!td

Note:  Work reported herein was done while the author was an employee of
Lucasfilm Ltd., San Rafael, CA.

chuck@amdahl.uts.amdahl.com (Charles Simmons) (08/30/88)

In article <2877@ttrdc.UUCP> levy@ttrdc.UUCP (Daniel R. Levy) writes:
>In article <653@paris.ICS.UCI.EDU>, schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:
>>    Since I posted my original question there has been a great deal of
>> abstract discussion about the relative merits of the loop unrolling
>> scheme.  The topic has piqued my curiousity, so I when ahead and
>> implemented a short test program, included below, to test Duff's
>> device against the ``ordinary for loop w/index variable'' technique.
>> See for yourself....   
>> 
>> After some quick testing I found that gcc 1.26 -O on a Sun 3 and a
>> Sequent Balance was pretty heavily in favor of the regular (non-Duff)
>> loop.  Your mileage may vary.  I realize that there may be other
>> tests, and if anyone has a better version, I'd like to see it!
>
>I modified this program to run under System V, changed the arrays to be dynam-
>ically allocated, and changed both the Duff and ordinary copies to use register
>pointers instead of global pointers (for the Duff copy) and array indexing (for
>the ordinary copy).  I then tried it on a SVR2 3B20, a SVR3 3B2, a Sun-3, and a
>Sun-4 both with and without -O optimization (using the standard pcc-type C
>compiler on each system).  The result?  Duff wins by about 10%-20% on all
>machines tested.

I then added a piece to the program to use 'memcpy'.  The results?
Duff beats a simple loop by 10%.  'memcpy' is 9 times faster than
Duff.  So why do people spend so much time avoiding standard subroutines?

-- Chuck

lvc@cbnews.ATT.COM (Lawrence V. Cipriani) (08/30/88)

In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
	[discussion of Duff copy deleted]
>I then added a piece to the program to use 'memcpy'.  The results?
>Duff beats a simple loop by 10%.  'memcpy' is 9 times faster than
>Duff.  So why do people spend so much time avoiding standard subroutines?

Sometimes the standard subroutines are implemented horribly.  I was
horrified when I saw that the machine dependent version of memcpy
on the AT&T 3Bs is nothing but a byte by byte transfer written in
assembly language.  It is tricky, but doable, to speed this up by a
roughly a factor of sizeof(long).  In fact it already is done in the
3B implementation of the UNIX(tm) operating system in the copyin (?)
routine.  Why wasn't it done in memcpy too?  Sigh.

-- 
Larry Cipriani, AT&T Network Systems, Columbus OH, cbnews!lvc lvc@cbnews.ATT.COM

nat@bales.UUCP (Nathaniel Stitt) (08/31/88)

In article <dpmuY#2EBC4R=eric@snark.UUCP> eric@snark.UUCP (Eric S. Raymond) writes:
>(Code below reproduced so that comp.arch people seeing this followup only
>won't get terminally frustrated. This is *really neat*, gang...)
>
>In article <638@paris.ics.uci.edu> Douglas C. Schmidt writes:
>> 
>> void send(int *to,int *from, int count) {
>>    int n = (count + 7) / 8;
>>    
>>    switch(count % 8) {
>>       case 0:  do { *to++ = *from++;
>>       case 7:       *to++ = *from++;
>>       case 6:       *to++ = *from++;
>>       case 5:       *to++ = *from++;
>>       case 4:       *to++ = *from++;
>>       case 3:       *to++ = *from++;
>>       case 2:       *to++ = *from++;
>>       case 1:       *to++ = *from++;
>>                } while (--n > 0);
>>    }
>> 
>> }
>> 
>> Finally, Stroustrup asks the rhetorical question ``why would anyone
>> want to write something like this.''  Any guesses?!
>
>Yeah. That's the most hackish way of trying to write a portable optimized
>copy routine I've ever seen. I gather the whole point of the shenanigans
>is to get all the *from++ -> *to++ instructions in the generated code to be
>adjacent.
>

Here is my own personal version of the "Portable Optimized Copy" routine.
It certainly seems more clear than the above example, and I would expect
it to be at least as fast on virtually any machine.  (Note the use of division
and mod in the above routine.)

Code does NOT have to be ugly to be fast.


/* Copy 'count' ints from *from to *to. */
void nats_send(to, from, count)
register int *to;
register int *from;
register int count;
{
	/* Copy the bulk of the data */
	while(count >= 8)
	{
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;

		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;

		count -= 8;
	}

	/* Copy the rest. */

	if(count >= 4)
	{
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;

		count -= 4;
	}

	if(count >= 2)
	{
		*to++ = *from++;
		*to++ = *from++;

		count -= 2;
	}

	if(count)
		*to++ = *from++;
}

Hope this helps.

-- 
Nathaniel Stitt           | This life is a test.  It is only a test.  Had
Guidelines Software, Inc. | this been an actual life, you would have received
ucbvax!cogsci!z!nat       | further instructions as to what to do and where
(415) 376-1395            | to go.

bill@proxftl.UUCP (T. William Wells) (08/31/88)

In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
: I then added a piece to the program to use 'memcpy'.  The results?
: Duff beats a simple loop by 10%.  'memcpy' is 9 times faster than
: Duff.  So why do people spend so much time avoiding standard subroutines?

Try some history, bud; it's good for what ails you.

I doubt that memcpy even existed then; and it is *not* standard
now.  Perhaps it will be several years after the ANSI standard is
adopted, but not till then.

---
Bill
novavax!proxftl!bill

earl@mips.COM (Earl Killian) (09/01/88)

   In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
   : I then added a piece to the program to use 'memcpy'.  The results?
   : Duff beats a simple loop by 10%.  'memcpy' is 9 times faster than
   : Duff.  So why do people spend so much time avoiding standard subroutines?

   Try some history, bud; it's good for what ails you.

   I doubt that memcpy even existed then; and it is *not* standard
   now.  Perhaps it will be several years after the ANSI standard is
   adopted, but not till then.

Perhaps a better reason:

According to the author, that code was used for copying to a 16-bit IO
device.  It would have be illegal to use memcpy or bcopy because they
would make word references to the device.
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

lvc@cbnews.ATT.COM (Lawrence V. Cipriani) (09/01/88)

In article <1002@cbnews.ATT.COM>, lvc@cbnews.ATT.COM (that's me) wrote:
> Sometimes the standard subroutines are implemented horribly.  I was
> horrified when I saw that the machine dependent version of memcpy
> on the AT&T 3Bs is nothing but a byte by byte transfer written in
> assembly language. ...

Oops.  I should have said this was on an early version (~1984) of
a 3B5.  New 3Bs don't have this deficiency.

-- 
Larry Cipriani, AT&T Network Systems, Columbus OH, cbnews!lvc lvc@cbnews.ATT.COM

hankd@pur-ee.UUCP (Hank Dietz) (09/01/88)

In article <189@bales.UUCP>, nat@bales.UUCP (Nathaniel Stitt) writes:
> Here is my own personal version of the "Portable Optimized Copy" routine.
.... then he gives a rather verbose, but structured, encoding....

As long as we're getting into structured, portable, hacks, let me suggest
the following two ways of doing block copy:

1.	If the number of items/bytes is known at compile time, then you can
	define a struct type of the appropriate size and use struct assign.
	with type casts to make it fly.  For example, suppose p and q are
	pointers to ints and I want to copy 601 ints from p to q.  Then I
	can write the fast and surprizingly portable:

	struct t601 { int t[601]; };
	*((struct t601 *) q) = *((struct t601 *) p);

	Of course, you do have to watch-out for alignment problems, but
	if your compiler doesn't generate very fast code for this....

2.	If the number of items/bytes is not known, then build a binary tree of
	such structs and copy half, then half of what remains, etc.  This is
	funny looking, but very fast also.  Suppose the number of ints (n) is
	not known at compile time, but can't be more than 601.  You can write:

	struct t512 { int t[512]; };
	struct t256 { int t[256]; };
	struct t128 { int t[128]; };
	struct t64 { int t[64]; };
	struct t32 { int t[32]; };
	struct t16 { int t[16]; };
	struct t8 { int t[8]; };
	struct t4 { int t[4]; };
	struct t2 { int t[2]; };
	if (n & 512) {
		*((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512;
	}
	if (n & 256) {
		*((struct t256 *) q) = *((struct t256 *) p); q+=256; p+=256;
	}
	if (n & 128) {
		*((struct t128 *) q) = *((struct t128 *) p); q+=128; p+=128;
	}
	if (n & 64) {
		*((struct t64 *) q) = *((struct t64 *) p); q+=64; p+=64;
	}
	if (n & 32) {
		*((struct t32 *) q) = *((struct t32 *) p); q+=32; p+=32;
	}
	if (n & 16) {
		*((struct t16 *) q) = *((struct t16 *) p); q+=16; p+=16;
	}
	if (n & 8) {
		*((struct t8 *) q) = *((struct t8 *) p); q+=8; p+=8;
	}
	if (n & 4) {
		*((struct t4 *) q) = *((struct t4 *) p); q+=4; p+=4;
	}
	if (n & 2) {
		*((struct t2 *) q) = *((struct t2 *) p); q+=2; p+=2;
	}
	if (n & 1) *q = *p;

	Notice that, in this case, n, p, and q should be declared as being
	register variables and that p and q are altered by this routine.  Of
	course, you can copy larger things by making larger power-of-2 sized
	structs.

	Incidentally, this ran about 8x faster (on a VAX 11/780) than using
	the usual copy loop.  Unfortunately, the above code should have been
	written as:

	if (n & 512) {
		*(((struct t512 *) q)++) = *(((struct t512 *) p)++);
	}
	...

	but, for some unknown reason, the VAX C compiler didn't like that.


Enjoy.
					hankd@ee.ecn.purdue.edu

gwyn@smoke.ARPA (Doug Gwyn ) (09/01/88)

In article <2945@wright.mips.COM> earl@mips.COM (Earl Killian) writes:
>According to the author, that code was used for copying to a 16-bit IO
>device.  It would have be illegal to use memcpy or bcopy because they
>would make word references to the device.

Wrong reason.  memcpy() etc. don't have the right behavior for stashing
a sequence of values into the same destination location.

loren@pixar.UUCP (Loren Carpenter) (09/01/88)

The Duff Loop (as far as I know) was first cast into C by Tom Duff when
he was at Lucasfilm in the early 1980's.  We used it at Lucasfilm wherever
we needed reasonable speed without resorting to assembly language.  It
obviously generalizes to more than memory copy & clear.

I and others have used this control construct for many years, but always
in assembly language.  I learned it from Howard Schmeising at Boeing in 1969,
where we were writing optimal stack code for CDC 6600's (a $7M RISC machine).
We could get 5+ 60-bit Mflops if we worked at it.

			Loren Carpenter
			...{ucbvax,sun}!pixar!loren

rodc@unet.unet.pacbell.COM (Rod Creason) (09/01/88)

Newsgroups: comp.lang.c,comp.arch
Subject: Re: Explanation, please!
Summary: 
Expires: 
References: <638@paris.ics.uci.edu> <dpmuY#2EBC4R=eric@snark.UUCP> <189@bales.UUCP> <9064@pur-ee.UUCP>
Sender: 
Reply-To: rodc@unet.PacBell.COM (Rod Creason)
Followup-To: 
Distribution: 
Organization: Network Equipment Technologies, Redwood City, CA
Keywords: byte copy

In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:

>2.	If the number of items/bytes is not known, then build a binary tree of
>	such structs and copy half, then half of what remains, etc.  This is
>	funny looking, but very fast also.  Suppose the number of ints (n) is
>	not known at compile time, but can't be more than 601.  You can write:
>
>	struct t512 { int t[512]; };
>	struct t256 { int t[256]; };
>	struct t128 { int t[128]; };
>	struct t64 { int t[64]; };
>	struct t32 { int t[32]; };
>	struct t16 { int t[16]; };
>	struct t8 { int t[8]; };
>	struct t4 { int t[4]; };
>	struct t2 { int t[2]; };
>	if (n & 512) {
>		*((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512;
>	}
> [ rest of the example ]

A key point is that this code is *not* portable unless you can *guarantee*
that q and p are ALWAYS aligned at a four-byte boundary.  Any 3b2 (WE32000),
for example, will core dump if this is not the case.  Although 68000 based 
computers only require even alignment, they would still fail in most cases
where this routine would be used.

The key to a fast memory-to-memory copy is getting to these boundaries.
(and in the case where q or p is an odd address and the other is at an 
even address, none of this will work)

>					hankd@ee.ecn.purdue.edu

Rod Creason
...!{ames,amdahl,oliveb,pacbell}!unet!rodc

pf@csc.ti.com (Paul Fuqua) (09/01/88)

    Date: Wednesday, August 31, 1988  9:13pm (CDT)
    From: loren at pixar.UUCP (Loren Carpenter)
    Subject: Re: Explanation, please!
    Newsgroups: comp.lang.c,comp.arch

    The Duff Loop (as far as I know) was first cast into C by Tom Duff when
    he was at Lucasfilm in the early 1980's.  We used it at Lucasfilm wherever
    we needed reasonable speed without resorting to assembly language.  It
    obviously generalizes to more than memory copy & clear.

There was a discussion of the device in comp.protocols.tcp-ip not too long
ago, as a way to speed up the calculation of checksums.  The one point that I
haven't seen brought out here yet is that if the unrolling factor is a power
of two, the divide/remainder operations are simply a shift and a mask.

                              pf

Paul Fuqua
Texas Instruments Computer Science Center, Dallas, Texas
CSNet:  pf@csc.ti.com (ARPA too, sometimes)
UUCP:   {smu, texsun, cs.utexas.edu, im4u, rice}!ti-csl!pf

klg@njsmu.UUCP (Kenneth Goodwin) (09/02/88)

In article <9064@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes:
> In article <189@bales.UUCP>, nat@bales.UUCP (Nathaniel Stitt) writes:
> > Here is my own personal version of the "Portable Optimized Copy" routine.
> 2.	If the number of items/bytes is not known, then build a binary tree of
> 	such structs and copy half, then half of what remains, etc.  This is
> 	struct t512 { int t[512]; };
> 	struct t256 { int t[256]; };
> 	struct t128 { int t[128]; };
	.... etc .....
> 	if (n & 512) {
> 		*((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512;
> 	}
> 	if (n & 256) {
> 		*((struct t256 *) q) = *((struct t256 *) p); q+=256; p+=256;
> 	}
	...  etc ...
> 	Incidentally, this ran about 8x faster (on a VAX 11/780) than using
> 	the usual copy loop.  Unfortunately, the above code should have been
> 	written as:
> 
> 	if (n & 512) {
> 		*(((struct t512 *) q)++) = *(((struct t512 *) p)++);
> 	}
> 	...

	BUT This is where UNIONS come in handy, I used a similar although
	more brief technique for a faster version of a bmov() (byte move)
	subroutine on our PDP11-70 a while ago, and subsequently ported
	it to memcpy when we updated from V6 to System V.
	The basic idea that was used is to create a union of long, int,
	(short), and char pointers, use the character pointer to achieve
	the needed alignments and then use the largest available pointer
	to do the copy. There is no reason why a stucture copy could not be
	used, although I suspect on NON-VAX systems it may actually 
	be detremental (sp?) in some cases.
	The PDP11 C compiler used to stuff registers onto the stack
	and create a 16 bit word copy loop to do structure copies
	using the freed registers, restoring them when it was done.
	So a structure copy would be the same as a word copy on that style
	of a system (ie, ones without block move instructions)

	So In the case of your example, a modified brief version of it
	would be:

		union ptr_types {
			struct t512 { int t512[512] } *t512;
			....
			struct t32 { int t32[32] } *t32;
			long	*t_long;
			int	*t_int;
			short	*t_short;
			char	*t_char;
		} ;

		(probably could dispense with long and short pointers
		and related tests)

	memcpy(a, b, len)
	char *a; *b;
	{
		register union ptr_types a_ptr, b_ptr;

		a_ptr.t_char = a;
		b_ptr.t_char = b;

		while(NOT ON A WORD BOUNDARY AND CHARS LEFT) {
			*a_ptr.t_char++ = *b_ptr.t_char++;
			len--;
		}
		if(len >= sizeof(int) * 512) {
			/* if we can use a 512 int structure copy */
			*a_ptr.t512++ = *b_ptr.t512++;
			len -= (512 * sizeof(int));
		}
		/*M the biggest win is that the pointers increment correctly
		len -= (sizeof(*element pointer)) is the correct form over
		N INTS * sizeof int */

		.......
		I guess the rest is obvious, some GLUE may be needed
		that has not be shown.... :-)
		Boundaries should be checked on source and destination addresses
		to avoid memory faults....
		As you may be given incompatible source and destination address
		that may require a full char by char copy. The first
		test loop sort of does this, but all the other copies
		should also check for proper address alignments before
		proceeding.
Ken Goodwin
NJSMU.

andrew@frip.gwd.tek.com (Andrew Klossner) (09/02/88)

Nathaniel Stitt writes:

	"Here is my own personal version of the "Portable Optimized
	Copy" routine.  It certainly seems more clear than the above
	example, and I would expect it to be at least as fast on
	virtually any machine."

then goes on to present a routine which uses follow-on code to handle
the last few bytes after all octets have been copied.  It's cleaner
code, but it won't be quite as fast on many systems with instruction
caches because it has fifteen byte-move instructions, replacing eight
in the original, so more time is spent loading the loop into the
i-cache.  On systems with very small i-caches (my favorite example is
the IBM 360/91 with 16 bytes), the bigger loop may not all fit into
cache, and would be considerably slower.

Several contributors have suggested that unrolling a byte-copy loop is
a win.  On some architectures it is, but on a good pipelined system it
may not be.  As an example, the program fragment

	while (count--) {
		to[i] = from[i];
		++i;
	}

can be compiled to code on the M88k which copies memory as fast as a
DMA controller could; the instructions to decrement, increment, and
branch overlap with the data load/store requests.

[If everything's in registers, indexing in this case is actually faster
than keeping separate "to" and "from" pointers and incrementing both.]

This assumes that "to" and "from" are pointers-to-ints or
pointers-to-doubles.  Copying less than a word at a time is slower.

  -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
                        (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]

jabir@quintus.uucp (Jabir Hussain) (09/02/88)

In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>	struct t512 { int t[512]; };
>	struct t256 { int t[256]; };
[...]
>	if (n & 512) {
>		*((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512;
>	}
[...]
>	                      Unfortunately, the above code should have been
>	written as:
>
>	if (n & 512) {
>		*(((struct t512 *) q)++) = *(((struct t512 *) p)++);
>	}
>	...
>
>	but, for some unknown reason, the VAX C compiler didn't like that.


one portable way around that is to do something like:

	typedef char *caddr_t;
	typedef union {
	    struct { int t[512]; } t512;
	    struct { int t[256]; } t256;
	    caddr_t               caddr;
	} ustruct_t;

	{
	    register ustruct_t src,dst;

	    src.caddr = (caddr_t) p;
	    dst.caddr = (caddr_t) q;

	    if (n & 512) *dst.t512++ = *src.t512++;
	    ...
	}

jh

rick@pcrat.UUCP (Rick Richardson) (09/02/88)

I did a quick comparison of the times for "memcpy(,,n*4)", Duff's
device, Nathaniel's "nat", and Hank's versions.  Here's the results:

TEST	MACHINE				ATT cc	VENIXcc	GNU1.26
memcpy	8Mhz 286, Venix SVR2		 3.35	11.15	-
duff	8Mhz 286, Venix SVR2		16.58	 4.56	-
nat	8Mhz 286, Venix SVR2		16.73	 5.11	-
hank	8Mhz 286, Venix SVR2		 1.83	 ccbarf	-
memcpy	16Mhz 386, 386/ix 1.0.4		  .82	  -	 .82
duff	16Mhz 386, 386/ix 1.0.4		 2.01	  -	2.26
nat	16Mhz 386, 386/ix 1.0.4		 1.95	  -	2.21
hank	16Mhz 386, 386/ix 1.0.4		  .86	  -	 .89

		* Everything was compiled with -O
		* test moved 2000,1999,...,2,1 int's

I learned this: Venix obviously doesn't have a good memcpy, and
Duff, or nat make a good replacement.  AT&T uses the block move
in memcpy, and it is a definite win on both machines. The AT&T
286 C compiler is terrible for these program on the 286.  AT&T
still has a slight edge over gcc *for these programs*. Hanks
version, while not portable, is a clear win on the 286 (disassemble
memcpy and hank's to see why), and virtually as good as memcpy on 386.

Conclusion: you can't make any.  It all depends on your computer
and compiler.
-- 
		Rick Richardson, PC Research, Inc.
		rick%pcrat.uucp@uunet.uu.net (INTERNET)
		   uunet!pcrat!rick (UUCP, Personal Mail)
..!pcrat!jetroff (JetRoff Info)		..!pcrat!dry2 (Dhrystone Submissions)

mouse@mcgill-vision.UUCP (der Mouse) (09/02/88)

In article <9064@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes:
> In article <189@bales.UUCP>, nat@bales.UUCP (Nathaniel Stitt) writes:
>> ["Portable Optimized Copy" routine]
> As long as we're getting into structured, portable, hacks, let me
> suggest the following two ways of doing block copy:

> 1.    If the number of items/bytes is known at compile time, then you
>       can define a struct type of the appropriate size and use struct
>       assignment with type casts to make it fly.  For example,

[ int *p; int *q; /* want to copy 601 ints from p to q */ ]
>       struct t601 { int t[601]; };
>       *((struct t601 *) q) = *((struct t601 *) p);

Surprise!  Your compiler has sizeof(int)=2, sizeof(long)=4, and aligns
structs on long boundaries.  You just copied 602 ints!

>       Unfortunately, the above code should have been written as:

>       if (n & 512) {
>               *(((struct t512 *) q)++) = *(((struct t512 *) p)++);
>       }

>       but, for some unknown reason, the VAX C compiler didn't like
>       that.

Perhaps the "unknown reason" is that it's illegal!  An expression
formed by casting another expression is not an lvalue and never has
been.  Any compiler that accepts that is broken.  (Yes, there exist
such broken compilers.)

What would you expect from

    double d;

    ((int)d) ++;

Perhaps you'd want d+=1?  Perhaps d=1+(int)d?  Perhaps
*(int*)&d=1+(int)d even?

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu

chris@mimsy.UUCP (Chris Torek) (09/02/88)

-In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
->		*(((struct t512 *) q)++) = *(((struct t512 *) p)++);
->	but, for some unknown reason, the VAX C compiler didn't like that.

In article <339@quintus.UUCP> jabir@quintus.uucp (Jabir Hussain) writes:
-one portable way around that is to do something like:
-
-	typedef char *caddr_t;
-	typedef union {
-	    struct { int t[512]; } t512;
-	    struct { int t[256]; } t256;
-	    caddr_t               caddr;
-	} ustruct_t;
	...
-	    if (n & 512) *dst.t512++ = *src.t512++;

`Portable'?  I suggest you try it on a Data General MV/10000.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

alverson@decwrl.dec.com (Robert Alverson) (09/03/88)

In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>	Incidentally, this ran about 8x faster (on a VAX 11/780) than using
>	the usual copy loop.  Unfortunately, the above code should have been
>	written as:
>
>	if (n & 512) {
>		*(((struct t512 *) q)++) = *(((struct t512 *) p)++);
>	}
>	...
>
>	but, for some unknown reason, the VAX C compiler didn't like that.
>
>
>Enjoy.
>					hankd@ee.ecn.purdue.edu
Actually, the VAX C compiler was completely correct.  Attempting
to post-increment the result of a cast is not legal C.  One way
around this is to use a little extra indirection:

    char* p;

    /* was:   ((int *)p)++; */
    (*((int **) &p))++;		/* phew! */

The reasoning is that the result of a cast is an expression, not a
lvalue.  Quite often, compilers relax this rule for coercions which
produce no code.  In this case, it looks like the VAX C doesn't

Bob

hankd@pur-ee.UUCP (Hank Dietz) (09/05/88)

Relative to my article <9064@pur-ee.UUCP> giving a struct-based hack for
doing block copy and to all followups thereof....

Correction:

I hereby appologize for having ommitted a (:-) and hence having made it
sound like I didn't know why the VAX compiler didn't want to use a cast
pointer as the operand of ++.  I do know (and I knew even before a dozen of
you reminded me :-), I just think it should be allowed in such cases; i.e.,
be implementation dependent rather than illegal.  Anyway, the union-based fix
given by Jabir is certainly a good way to do it (once you fix his typedef to
make t512 and t256 pointers to structs, rather than structs).

Comments:

As I noted in the original posting, the use of a structure-hack block copy is
subject to alignment problems -- if the target machine has alignment
constraints.  There are three additional notes on this:

1.	Both the source AND destination pointers may need to be aligned,
	which implies that the difference between the pointers MUST BE A
	MULTIPLE OF THE ALIGNMENT FACTOR.  Arbitrary copies will not have
	this property, hence, on such machines one must always have a
	byte-oriented copy routine to fall back on.

	For a 4-byte alignment machine, the test is something like:

	if ((p - q) & 3) *byte copy* else *struct copy*

2.	If the difference between the pointers is a multiple of the machine
	alignment factor, then copying up until one is aligned will insure
	alignment of both.  Hence, we simply need to insert a few moves to
	achieve alignment BEFORE the struct copy code I gave earlier.

	For a 4-byte alignment machine, the additional code is:

	if ((n >= 1) && (p & 1)) {
		*q = *p; ++q; ++p; --n;
	}
	if ((n >= 2) && (p & 2)) {
		*((struct t2 *) q) = *((struct t2 *) p); q+=2; p+=2; n-=2;
	}

	Notice that, with the above code as a prefix, trailing alignment is
	also satisfied (proof left as an exercise to the reader).

3.	Although I know of no such C compiler, it is legal for a C compiler
	to make the size of a struct anything big enough to hold the things
	inside it.  Hence, for example, a struct containing an array of 4
	things might be the size of 5 things....  If this were done, the
	struct copy would nearly work; the only problem is that it could
	overshoot the end of the block.  There is no general fix for this.

						hankd@ee.ecn.purdue.edu

mark@bnr-rsc.UUCP (Mark MacLean) (09/06/88)

In article <10329@tekecs.TEK.COM> andrew@frip.gwd.tek.com (Andrew Klossner) writes:
>Several contributors have suggested that unrolling a byte-copy loop is
>a win.  On some architectures it is, but on a good pipelined system it
>may not be.  As an example, the program fragment
>
>	while (count--) {
>		to[i] = from[i];
>		++i;
>	}
>
>can be compiled to code on the M88k which copies memory as fast as a
>DMA controller could; the instructions to decrement, increment, and
>branch overlap with the data load/store requests.

If you have instructions to load, store, increment, decrement, and branch in
your loop then presumably this takes at least 5 clocks to perform only 2 
memory accesses.  Would'nt a DMA controller be able to perform a memory access
every clock cycle?  

Is it not possible to unroll the loop into an inline stream of instructions
(if the length is small and known at compile time) to produce an instruction
sequence which could perform a memory access every cycle?  If not, why not?
It would seem very un-RISCy if the 88000 was unable to do this.

Mark MacLean

pardo@june.cs.washington.edu (David Keppel) (09/07/88)

hankd@pur-ee.UUCP (Hank Dietz) writes:
>	if ((p - q) & 3) *byte copy* else *struct copy*

I believe that the VAX "movc" command takes arbitrary pointers and
does the following:

* If both are word-aligned, do a word copy (I mean a 4-byte word).
* If both are non-aligned and could be aligned with 1, 2, or 3 bytes
  of byte-copy at either end, then do a byte copy at either end and do
  a word copy down the middle.
* If niether aligned then ??

Unfortunately, my VAX hardware reference is out of town for a couple
of weeks, so I can't ask him about neither aligned.  Anybody know?

I can immagine that on some machines it is faster to copy words into
register and repack the words in the registers rather than do a byte
copy, since you could be taking advantage of some hardware gak.

Simple example:

    machine X has register W1 divided into B4, B5, B6, B7.  To do a
    copy, align the source pointer (doing byte copies) then read a
    wrod-at-a-time into the W1 register, write it back out by writing
    B4, B5, B6, B7  (little-endian).

This is beginning to look suspiciously like the kinds of optimizations
that get done for bit BLTs.  Anybody know if this ever really gets
done?

	;-D on  ( Ahh.  Architecture at its finest )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

rik@june.cs.washington.edu (Rik Littlefield) (09/07/88)

In article <5654@june.cs.washington.edu>, pardo@june.cs.washington.edu (David Keppel) writes:
> 
> I can immagine that on some machines it is faster to copy words into
> register and repack the words in the registers rather than do a byte
> copy, since you could be taking advantage of some hardware gak.
> 

On the old CDC 6000-series machines (early RISCs...) that was the *only*
practical way to do it, as well as being blazingly fast.  We had copies
that would handle arbitrary *bit* alignments at a cost of around 6 instructions
and 2 memory references per 60-bit word, in the middle of the string.  
The sequence was basically fetch, shift, mask, mask, OR, and store, 
appropriately rearranged to minimize memory delay and functional unit 
conflicts, of course.  I vaguely remember that this thing could even
be unrolled a couple of times and still fit in the instruction cache
("stack", in those days) for machines expensive enough to have one.

VAXen I don't know about for sure, but I'd be real surprised if their
microcode didn't do the same thing.

--Rik

news@amdcad.AMD.COM (Network News) (09/07/88)

In article <5654@june.cs.washington.edu> pardo@uw-june.UUCP (David Keppel) writes:
| hankd@pur-ee.UUCP (Hank Dietz) writes:
| >	if ((p - q) & 3) *byte copy* else *struct copy*
| 
| I believe that the VAX "movc" command takes arbitrary pointers and
| does the following:
| 
| * If both are word-aligned, do a word copy (I mean a 4-byte word).
| * If both are non-aligned and could be aligned with 1, 2, or 3 bytes
|   of byte-copy at either end, then do a byte copy at either end and do
|   a word copy down the middle.
| * If niether aligned then ??
| 
| Unfortunately, my VAX hardware reference is out of town for a couple
| of weeks, so I can't ask him about neither aligned.  Anybody know?

I assume that "both are non-aligned" in the above list means that the
source and the destination have the same alignment, but are not aligned
with respect to a 4-byte boundary, and that "neither aligned" means that
the source and the destination are misaligned.

We use an interesting trick in the Am29000 memcpy routine for
source/destination misalignment.  In this case, we set up the alignment
difference in the funnel-count register, read in two source words, and
"extract" a destination word using the funnel-shifter's ability to
extract any 32-bit word from a 64-bit double-word in a single cycle. 
The inner loop then consists of shifting the low source word to the high
source word, reading a new low source word, extracting a destination and
storing it (well, there is also the overhead of counting down the
correct number of "word" moves, but you get the idea.)

	-- Tim Olson
	Advanced Micro Devices
	(tim@delirun.amd.com)

aglew@urbsdc.Urbana.Gould.COM (09/07/88)

>I can immagine that on some machines it is faster to copy words into
>register and repack the words in the registers rather than do a byte
>copy, since you could be taking advantage of some hardware gak.
>
>Simple example:
>
>    machine X has register W1 divided into B4, B5, B6, B7.  To do a
>    copy, align the source pointer (doing byte copies) then read a
>    wrod-at-a-time into the W1 register, write it back out by writing
>    B4, B5, B6, B7  (little-endian).
>
>This is beginning to look suspiciously like the kinds of optimizations
>that get done for bit BLTs.  Anybody know if this ever really gets
>done?

Yep, it's done on Gould machines for halfwords to words in some copy 
routines. However, on the NP1 for "typical" distributions of operands, 
it turns out to be better to just copy at the greatest common denominator
of alignment using the appropriate vector moves.

rpw3@amdcad.AMD.COM (Rob Warnock) (09/10/88)

In article <22859@amdcad.AMD.COM> tim@delirun.amd.com (Tim Olson) writes:
+---------------
| We use an interesting trick in the Am29000 memcpy routine for
| source/destination misalignment.  In this case, we set up the alignment
| difference in the funnel-count register, read in two source words, and
| "extract" a destination word using the funnel-shifter's ability to
| extract any 32-bit word from a 64-bit double-word in a single cycle. 
| The inner loop then consists of shifting the low source word to the high
| source word, reading a new low source word, extracting a destination and
| storing it...
+---------------

It's even better than that, Tim. In the non-aligned case, the inner loop
of the "bcopy()" routine in the 4.3 port for the 29000 uses load-multiple
to grab a bunch of source words (making good use of "burst mode" in the memory,
if any), a straight-line series of extract's, then a store-multiple, with
a couple more instructions at either end of the inner block to propagate
the "left-over" bits to the next block.

Using this trick, an Am29000 can do large block transfers between arbitary
*bit* boundaries with an asymptotic overhead of only one more cycle per word
than word-aligned block-copy. Given single-cycle burst-mode memories (not
hard to achieve, it's starting the burst that's expense), that puts the
asymptotic performance of non-aligned transfers at 3 cycles/word, and
word-aligned transfers at 2 cycles/word.  Not too shabby...

An assembly-language version of Duff's device keeps the overhead for
small transfers reasonable, too.


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

mouse@mcgill-vision.UUCP (der Mouse) (09/11/88)

In article <339@quintus.UUCP>, jabir@quintus.uucp (Jabir Hussain) writes:
> In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>>		*(((struct t512 *) q)++) = *(((struct t512 *) p)++);
>>	but, for some unknown reason, the VAX C compiler didn't like that.

> one portable way around that is to do something like:
[summarized]
> union { struct { int t[512]; } *t512; ...; char *caddr; }
[then use ++ on .t512]

This falls hopelessly flat if char * and struct {...} * have
significantly different representation.

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu

df@nud.UUCP (Dale Farnsworth) (09/14/88)

Mark MacLean (mark@bnr-rsc.UUCP) writes:

> Is it not possible to unroll the loop into an inline stream of instructions
> (if the length is small and known at compile time) to produce an instruction
> sequence which could perform a memory access every cycle?  If not, why not?
> It would seem very un-RISCy if the 88000 was unable to do this.

Of course, it is possible to do a memory access every cycle on the MC88100
(with a 2 cycle pipeline startup delay on the first memory access),
but at 25 MHz, your memory will have to have an effective cycle time of
40 nS.  If memory is slower, the processor can increment pointers and
check termination conditions while waiting for memory.

-Dale