[comp.lang.c] Explanation, please!

schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) (08/25/88)

The following piece of wonderful obscurity comes from Stroustup's
C++ Programming Language book, page 100:

void send(int *to,int *from, int count) {
   int n = (count + 7) / 8;
   
   switch(count % 8) {
      case 0:  do { *to++ = *from++;
      case 7:       *to++ = *from++;
      case 6:       *to++ = *from++;
      case 5:       *to++ = *from++;
      case 4:       *to++ = *from++;
      case 3:       *to++ = *from++;
      case 2:       *to++ = *from++;
      case 1:       *to++ = *from++;
               } while (--n > 0);
   }

}

Now, much to my surprise, this is not only valid C++, it is also valid C!
Could some one please explain to me why this is so?  It seems like
the case 7-1 labels are actually nested inside the do {} while loop,
and thus not in the scope of the switch (should a break statement exit
both the switch and the loop, or just one?!?!).

Finally, Stroustrup asks the rhetorical question ``why would anyone
want to write something like this.''  Any guesses?!

thanks,

   Doug Schmidt
--
Douglas Schmidt
"If our behavior is strict, we do not need fun." -Zippy th' Pinhead
"If our behavior is struct, we do not need defun." -Anon

chris@mimsy.UUCP (Chris Torek) (08/25/88)

In article <638@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas
C. Schmidt) writes:
>The following piece of wonderful obscurity comes from Stroustup's
>C++ Programming Language book, page 100:

>   switch(count % 8) {
>      case 0:  do { *to++ = *from++;
>      case 7:       *to++ = *from++;
	...
>      case 1:       *to++ = *from++;
>               } while (--n > 0);
>   }

>Now, much to my surprise, this is not only valid C++, it is also valid C!
>Could some one please explain to me why this is so?  It seems like
>the case 7-1 labels are actually nested inside the do {} while loop,
>and thus not in the scope of the switch (should a break statement exit
>both the switch and the loop, or just one?!?!).

`break' exits the innermost switch or loop, hence a `break' in cases
7 through 1 exits the do-while, not the switch.

>Finally, Stroustrup asks the rhetorical question ``why would anyone
>want to write something like this.''  Any guesses?!

This has been called `Duff's device' (after Tom Duff, who probably did
not invent it first), and it looks exactly like what an optimising
compiler generates when it does loop unrolling.  It works because case
labels are just that---labels.  As long as there is a switch in
scope, a case label is legal; the case applies to the closest switch.
(And if C had used a separate keyword for `break switch', the whole
thing could be consistent :-) .)

Why?  To quote a certain infamous C language hack :-) , `It looks exactly
like what an optimising compiler generates when it does loop unrolling.'

Incidentally, there are some compilers that choke on that form.  I
would have to look hard at the dpANS to decide whether it is Officially
Legal.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

gwyn@smoke.ARPA (Doug Gwyn ) (08/25/88)

In article <638@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:
-   switch(count % 8) {
-      case 0:	do { *to++ = *from++;
-      case 7:	     *to++ = *from++;
-      case 6:	     *to++ = *from++;
-      case 5:	     *to++ = *from++;
-      case 4:	     *to++ = *from++;
-      case 3:	     *to++ = *from++;
-      case 2:	     *to++ = *from++;
-      case 1:	     *to++ = *from++;
-		} while (--n > 0);
-   }

-Now, much to my surprise, this is not only valid C++, it is also valid C!
-Could some one please explain to me why this is so?

Could you explain why it shouldn't be?  "case" labels are just a
special form of label.  You can stick a label on most statements.

-It seems like the case 7-1 labels are actually nested inside the
-do {} while loop, and thus not in the scope of the switch

?? "Thus"?  Why would the labels go out of scope?  They're
definitely within the body of the switch.

-(should a break statement exit both the switch and the loop, or just one?!?!)

A "break" applies to whatever the innermost container is.  A "break"
between two of the "*to++ = *from++;" statements would exit the
do-while loop.

-Finally, Stroustrup asks the rhetorical question ``why would anyone
-want to write something like this.''  Any guesses?!

It's about the fastest way to move arbitrarily-aligned data in
portable C with a guarantee as to what happens in the case that the
data overlaps.  memcpy() doesn't guarantee anything for overlaps.
memmove() does, but that's a recent X3J11 invention that probably
doesn't exist on your system yet.

levy@ttrdc.UUCP (Daniel R. Levy) (08/26/88)

< void send(int *to,int *from, int count) {
<    int n = (count + 7) / 8;
<    
<    switch(count % 8) {
<       case 0:  do { *to++ = *from++;
<       case 7:       *to++ = *from++;
<       case 6:       *to++ = *from++;
<       case 5:       *to++ = *from++;
<       case 4:       *to++ = *from++;
<       case 3:       *to++ = *from++;
<       case 2:       *to++ = *from++;
<       case 1:       *to++ = *from++;
<                } while (--n > 0);
<    }
< }

Question:  what if count==0?
-- 
|------------Dan Levy------------|  THE OPINIONS EXPRESSED HEREIN ARE MINE ONLY
| Bell Labs Area 61 (R.I.P., TTY)|  AND ARE NOT TO BE IMPUTED TO AT&T.
|        Skokie, Illinois        | 
|-----Path:  att!ttbcad!levy-----|

jbs@fenchurch.MIT.EDU (Jeff Siegal) (08/26/88)

In article <2873@ttrdc.UUCP> levy@ttrdc.UUCP (Daniel R. Levy) writes:
>>[Duff's device loop example from C++ book]
>
>Question:  what if count==0?

The program breaks.  I prefer (and also use, in highly-speed-sensitive
code):

#define duff16(counter, block) \
  switch (counter & 0x0f) { \
   do \
   { \
     counter -= 16; \
     { block; } \
     case 15:	{ block; } \
     case 14:	{ block; } \
     case 13:	{ block; } \
     case 12:	{ block; } \
     case 11:	{ block; } \
     case 10:	{ block; } \
     case 9:	{ block; } \
     case 8:	{ block; } \
     case 7:	{ block; } \
     case 6:	{ block; } \
     case 5:	{ block; } \
     case 4:	{ block; } \
     case 3:	{ block; } \
     case 2:	{ block; } \
     case 1:	{ block; } \
     case 0:    /* null statement */; \
   } while (counter >= 16); \
  }

  duff16(n, *to++ = *from++)

Jeff Siegal

eric@snark.UUCP (Eric S. Raymond) (08/27/88)

(Code below reproduced so that comp.arch people seeing this followup only
won't get terminally frustrated. This is *really neat*, gang...)

In article <638@paris.ics.uci.edu> Douglas C. Schmidt writes:
> 
> void send(int *to,int *from, int count) {
>    int n = (count + 7) / 8;
>    
>    switch(count % 8) {
>       case 0:  do { *to++ = *from++;
>       case 7:       *to++ = *from++;
>       case 6:       *to++ = *from++;
>       case 5:       *to++ = *from++;
>       case 4:       *to++ = *from++;
>       case 3:       *to++ = *from++;
>       case 2:       *to++ = *from++;
>       case 1:       *to++ = *from++;
>                } while (--n > 0);
>    }
> 
> }
> 
> Finally, Stroustrup asks the rhetorical question ``why would anyone
> want to write something like this.''  Any guesses?!

Yeah. That's the most hackish way of trying to write a portable optimized
copy routine I've ever seen. I gather the whole point of the shenanigans
is to get all the *from++ -> *to++ instructions in the generated code to be
adjacent.

This only makes if the author knows he's got a hardware instruction pipeline
or cache that's no less than 8 and no more than 9 byte-copy instruction widths
long, and stuff executing out of the pipeline is a lot faster than if the
copies are interleaved with control transfers. Dollars to doughnuts this code
was written on a RISC machine.

(Gawrsh. That sounded just like one of the big boys on comp.arch tawkin'. I
think I'll cross-post over there just to see if I get shot down in flames...)

-- 
      Eric S. Raymond                     (the mad mastermind of TMN-Netnews)
      UUCP: ...!{uunet,att,rutgers}!snark!eric = eric@snark.UUCP
      Post: 22 S. Warren Avenue, Malvern, PA 19355      Phone: (215)-296-5718

henry@utzoo.uucp (Henry Spencer) (08/27/88)

In article <638@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:
>The following piece of wonderful obscurity comes from Stroustup's
>C++ Programming Language book, page 100...

Tom Duff, what hast thou wrought? :-)

>...
>Now, much to my surprise, this is not only valid C++, it is also valid C!
>Could some one please explain to me why this is so?  It seems like
>the case 7-1 labels are actually nested inside the do {} while loop,
>and thus not in the scope of the switch (should a break statement exit
>both the switch and the loop, or just one?!?!).

You're thinking of switch as if it were Pascal's case; the correct model
is closer to Fortran's computed goto.  The labels must appear within the
case body, but they can be (essentially) *anywhere* within the body,
including within nested blocks.  The sensible programmer will not exploit
this freedom except in truly unusual situations, but it is available.

Break breaks out of the innermost construct that it could apply to, i.e.
"just one".

>Finally, Stroustrup asks the rhetorical question ``why would anyone
>want to write something like this.''  Any guesses?!

This construct is called "Duff's Device", after its discoverer.  One can
often make a loop run faster by "unrolling" it, duplicating its body
N times and running one-Nth as many iterations -- this reduces the loop-
control overhead by a factor of N.  The annoying part is that the number
of times one needs to do the body is usually not an even multiple of N,
so one has to do a partial iteration at beginning or end.  The usual
assembly-language trick is to compute a start address in the middle
of the loop and start by branching there, not to the beginning, thus
doing a partial iteration first.  Most high-level languages have no way
to express this.  Tom Duff (of Bell Labs) realized that it could be done
in C.  Ugh.
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

davidsen@steinmetz.ge.com (William E. Davidsen Jr) (08/27/88)

In article <638@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:
| The following piece of wonderful obscurity comes from Stroustup's
| C++ Programming Language book, page 100:

[ a do loop within a switch, with cases all through the do loop ]

This does not seem to be clearly covered in the latest dpANS (section
3.6). The issue is if it is legal *and defined* to jump into a loop.
While a case label behaves just like any other label, it's not clear
what a jump into a loop implies, and I don't see that the existing
standard is any clearer on the topic than the discussion was three years
ago.

Consider:
	i = 15;
	if (j < 10) goto BadMove;

	/* more code here */

	do {
	  int i = 7;
	  int myvect[400];
	BadMove: /* entry into block */
	  myvect[300] = i;
	  if (j < 5) return i;
	} while (j-- > 302);

  There was a lot of discussion of this and I don't see a clarification
in the standard as it stands. I think there are several possibilities:
  1) it's not allowed
  2) it's allowed, but the initializations at the start of the
	block don't take place. What does this imply about the
	block local variables?
  3) it's allowed and the block local variables are allocated and
	initialized as you would expect.
  4) it's allowed but the results are implementation dependent.

  In all other cases entry into a block causes allocation of the block
local (auto) variables. Initialization is performed at the time of
allocation. By allowing entry at any point the expected behavior is
unspecified.

Suggestion:
  Forget (1), it's there for completeness. Pick any of 2-4 and add it to
the standard NOW, as an editorial change. We talked about this three
years ago, it can hardly be considered a recent issue. I would not
suggest (1) because it breaks a lot of existing code, but I would be
happy to see a (5) "you can't jump into a block with local variables."

  This is one of the rare times when I would like to be wrong, to be
told "oh, it's there in a footnote, you can, or you can't, and this is
what it does." If the action is implementation dependent then I
(personally) wouldn't use the feature, since I value portability
highly. Please tell me X3J11 didn't let this go unresolved for three
years.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

gwyn@smoke.ARPA (Doug Gwyn ) (08/27/88)

In article <9957@eddie.MIT.EDU> jbs@fenchurch.MIT.EDU (Jeff Siegal) writes:
>...  I prefer (and also use, in highly-speed-sensitive code): ...

It is worth noting that byte-at-a-time is not optimal on most architectures.
Duff's device is useful for arbitrarily-aligned data, but if you can
arrange for (most of) the source and destination buffers to be relatively
word-aligned, then along with processing the head and tail pieces via
something like Duff's device, the bulk of the data can be transferred in
bigger chunks, gaining considerable increase in speed.  One should really
use memcpy() in all cases where the source and destination do not overlap,
because all the work done to optimize block-move for a specific architecture
are supposed to be centralized there.

A couple of people have complained about the tone of my previous response
on this topic.  I meant it to be informative and minimal, i.e. requiring
some thought on the reader's part.  Since some have missed the point, let
me summarize the C style issue:

	Since the constructs are not forbidden by the C language
	rules, they are perforce permitted.  One should have an
	explicit rule to point at as being violated in order to
	think that the code is not valid.  "Intuition" is a lousy
	guide, particularly when it was formed by experience in
	other contexts (e.g. Pascal).

	Now, because of the demands made on one's understanding
	of the language, the cited code would be considered bad
	programming style IF what it did could be effectively
	accomplished in some other way.  As I pointed out in my
	previous posting on this, Duff's device does indeed have
	use in cases where other approaches fail, namely moving
	overlapping unaligned data.  Eventually we can all use
	memmove() for this, but not today.

Incidentally, Duff's device is used to manipulate text strings in the
"sam" text editor.  I recently changed it to use the device for both
directions, only in cases of overlap, using memcpy() for all other
cases.  (Originally the device was used for all transfers in one
direction and memcpy() for all in the other, which assumed more about
the behavior of memcpy() than is officially guaranteed.)  So this is
not just an academic issue..

schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) (08/27/88)

Hi,

   Since I posted my original question there has been a great deal of
abstract discussion about the relative merits of the loop unrolling
scheme.  The topic has piqued my curiousity, so I when ahead and
implemented a short test program, included below, to test Duff's
device against the ``ordinary for loop w/index variable'' technique.
See for yourself....   

After some quick testing I found that gcc 1.26 -O on a Sun 3 and a
Sequent Balance was pretty heavily in favor of the regular (non-Duff)
loop.  Your mileage may vary.  I realize that there may be other
tests, and if anyone has a better version, I'd like to see it!


Doug Schmidt
----------------------------------------
#include <sys/time.h>
#include <sys/resource.h>

double Start_Timer();
double Return_Elapsed_Time();

#define MAX_NUM 100000
int array1[MAX_NUM ];
int array2[MAX_NUM ];
int *A = array1, *B = array2;

main(int argc, char *argv[]) {
   double Elapsed_Time;
   int    Count = argc > 1 ? atoi(argv[1]) : MAX_NUM;
   int i;
   
   for (i = 0;i < Count ;i++) {
      array1[i] = i + 1;
      array2[i] = i;
   }
      
   printf("Starting Duff's device timing...\n");
   Start_Timer();

   {
      int n = (Count + 7) / 8;

      switch(Count % 8) {
         case 0:  do { *A++ = *B++;
         case 7:       *A++ = *B++;
         case 6:       *A++ = *B++;
         case 5:       *A++ = *B++;
         case 4:       *A++ = *B++;
         case 3:       *A++ = *B++;
         case 2:       *A++ = *B++;
         case 1:       *A++ = *B++;
                  } while (--n > 0);
      }
   }

   Elapsed_Time = Return_Elapsed_Time(0.0 );
   printf("Elapsed time = %.3f\n",Elapsed_Time);

   for (i = 0;i < Count ;i++) {
      if (array1[i] != array2[i]) {
         printf("Yow, problems at location %d!\n",i);
         break;
      }
   }

   for (i = 0;i < Count ;i++) {
      array1[i] = i + 1;
      array2[i] = i;
   }

   printf("Starting ordinary copy timing...\n");
   Start_Timer();

   for (i = 0;i < Count ;i++) {
      array1[i] = array2[i];
   }

   Elapsed_Time = Return_Elapsed_Time(0.0 );
   printf("Elapsed time = %.3f\n",Elapsed_Time);

   for (i = 0;i < Count ;i++) {
      if (array1[i] != array2[i]) { 
         printf("Yow, problems at location %d!\n",i);
         break;
      }
   }

}      


/* no such thing as "negative time"! */
#define  ERROR_VALUE -1.0   

static struct rusage Old_Time;
static struct rusage New_Time;
static int    Timer_Set = 0;

#ifdef __STDC__
double Start_Timer(void)
#else
double Start_Timer()
#endif
{
   Timer_Set = 1;
   getrusage(RUSAGE_SELF,&Old_Time);        /* set starting process time */
   return(Old_Time.ru_utime.tv_sec + (Old_Time.ru_utime.tv_usec / 1000000.0));
}

/* returns process time since Last_Time (if parameter is not DEFAULT_TIME, */
/* i.e., (double) 0.0 ),otherwise, if parameter == DEFAULT_TIME then       */
/* the time since the Old_Time was set is returned.  Returns ERROR_VALUE   */
/* if Start_Timer() is not called first */
#ifdef __STDC__
double Return_Elapsed_Time(double Last_Time)
#else
double Return_Elapsed_Time(Last_Time) 
double Last_Time;
#endif
{
   if (!Timer_Set) {
      return(ERROR_VALUE);
   }   
   else {
    /* get process time */
      getrusage(RUSAGE_SELF,&New_Time);
      if (Last_Time == 0.0) {
         return((New_Time.ru_utime.tv_sec - Old_Time.ru_utime.tv_sec) + 
               ((New_Time.ru_utime.tv_usec - Old_Time.ru_utime.tv_usec) 
                / 1000000.0));
      }
      else {
         return((New_Time.ru_utime.tv_sec + 
                (New_Time.ru_utime.tv_usec / 1000000.0)) - Last_Time);
      }
   }
}



--
schmidt@bonnie.ics.uci.edu (ARPA)
"If our behavior is strict, we do not need fun." -Zippy th' Pinhead
"If our behavior is struct, we do not need defun." -Anon

bill@proxftl.UUCP (T. William Wells) (08/27/88)

In article <638@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:
: The following piece of wonderful obscurity comes from Stroustup's
: C++ Programming Language book, page 100:
:
: void send(int *to,int *from, int count) {
:    int n = (count + 7) / 8;
:
:    switch(count % 8) {
:       case 0:  do { *to++ = *from++;
:       case 7:       *to++ = *from++;
:       case 6:       *to++ = *from++;
:       case 5:       *to++ = *from++;
:       case 4:       *to++ = *from++;
:       case 3:       *to++ = *from++;
:       case 2:       *to++ = *from++;
:       case 1:       *to++ = *from++;
:                } while (--n > 0);
:    }
:
: }
:
: Now, much to my surprise, this is not only valid C++, it is also valid C!
: Could some one please explain to me why this is so?  It seems like
: the case 7-1 labels are actually nested inside the do {} while loop,
: and thus not in the scope of the switch (should a break statement exit
: both the switch and the loop, or just one?!?!).

A switch statement has the syntax

	switch (<expression>) <statement>

While it is conventional to use a compound statement as the
embedded statement, there is nothing requiring it.  What a switch
does is permit the definition of a set of labels within the
statement, to which control is passed by the head of the switch.

The exclusive use of compound statements is not only conventional
coding practice, it is also good coding practice.

: Finally, Stroustrup asks the rhetorical question ``why would anyone
: want to write something like this.''  Any guesses?!

The false god of efficiency has reared it ugly head.  This
routine would be imagined to be more efficient than the almost
equivalent:

void
send(
int     *to,
int     *from,
int     count)
{
	if (count <= 0) {
		create progam bug
	}
	while (--count >= 0) {
		*to++ = *from++;
	}
}

However, it often (always?) is not.  Consider a machine which has
a fast means of moving small blocks of memory.  The following
routine should do the same thing (without the bug, of course),
might be more efficient, and is more understandable as well.  Of
course, it might generate a smidgin more code, but I think that,
in this case, this is a small price to pay.

void
send(
int     *to,
int     *from,
int     count)
{
	if (count <= 0) {
		return;
	}
	switch (count % 7) {
	case 7: *to++ = *from++;
	case 6: *to++ = *from++;
	case 5: *to++ = *from++;
	case 4: *to++ = *from++;
	case 3: *to++ = *from++;
	case 2: *to++ = *from++;
	case 1: *to++ = *from++;
	}
	count >>= 3;
	while (--count >= 0) {
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
	}
}

However, at least two IBM-PC compilers ought to generate better
code from the "unoptimized" version.  (Actually, they do it if
the copied items are characters; I presume that they'll do it for
integers.) I wouldn't be surprised if other compilers did as
well.

void
send(
int     *to,
int     *from,
int     count)
{
	/* If you are really paranoid, add the following code, it
	   catches the case where count is minus full scale on a
	   two's complement machine.  I tend to not do this,
	   since I consider minus full scale to be an illegal
	   value and take the minimal pains needed to never
	   generate it.

	if (count <= 0) {
		return;
	} */

	while (--count >= 0) {
		*to++ = *from++;
	}
}

---
Bill
novavax!proxftl!bill

bill@proxftl.UUCP (T. William Wells) (08/27/88)

In article <13198@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
: Incidentally, there are some compilers that choke on that form.  I
: would have to look hard at the dpANS to decide whether it is Officially
: Legal.

It is.

---
Bill
novavax!proxftl!bill

henry@utzoo.uucp (Henry Spencer) (08/28/88)

In article <653@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:
>After some quick testing I found that gcc 1.26 -O on a Sun 3 and a
>Sequent Balance was pretty heavily in favor of the regular (non-Duff)
>loop...

The odds are excellent that the compilers are applying optimizations to
the regular loop that they can't do for the Duff loop as you've written
it.  In particular, your pointers aren't even register variables (and they
are externs, so the compiler can't safely promote them quietly), whereas
a good optimizing compiler will certainly be using register pointers in
the regular loop.

My partner in crime, Geoff Collyer, applied Duff's Device in a number of
places in the C News rnews; the performance improvements were substantial.
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

henry@utzoo.uucp (Henry Spencer) (08/28/88)

In article <634@proxftl.UUCP> bill@proxftl.UUCP (T. William Wells) writes:
>[Duff's Device]
>The false god of efficiency has reared it ugly head.  This
>routine would be imagined to be more efficient than the almost
>equivalent:	...
>	while (--count >= 0)
>		*to++ = *from++;
>
>However, it often (always?) is not...

On the contrary, it often (usually?) is.  This is from experience, not
theory.  (Specifically, experience in C News and related code.)  The
major exceptions are (a) the 68010, on which unrolling of simple loops
is a mistake because of "loop mode" [the hardware's very limited loop
cache], and (b) machines with fairly smart compilers and specialized
bulk-transfer instructions, as the compilers may be able to recognize
and optimize the simple loop but not the complex one.

In general, however, in the long run the correct way to implement bulk
data copying is to call "memcpy", which (in the long run) is likely
to be recognized and given special attention by most compilers.
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

levy@ttrdc.UUCP (Daniel R. Levy) (08/28/88)

In article <653@paris.ICS.UCI.EDU>, schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:
> Hi,
> 
>    Since I posted my original question there has been a great deal of
> abstract discussion about the relative merits of the loop unrolling
> scheme.  The topic has piqued my curiousity, so I when ahead and
> implemented a short test program, included below, to test Duff's
> device against the ``ordinary for loop w/index variable'' technique.
> See for yourself....   
> 
> After some quick testing I found that gcc 1.26 -O on a Sun 3 and a
> Sequent Balance was pretty heavily in favor of the regular (non-Duff)
> loop.  Your mileage may vary.  I realize that there may be other
> tests, and if anyone has a better version, I'd like to see it!

[schmidt's program deleted, please refer to parent article if you want it]

I modified this program to run under System V, changed the arrays to be dynam-
ically allocated, and changed both the Duff and ordinary copies to use register
pointers instead of global pointers (for the Duff copy) and array indexing (for
the ordinary copy).  I then tried it on a SVR2 3B20, a SVR3 3B2, a Sun-3, and a
Sun-4 both with and without -O optimization (using the standard pcc-type C
compiler on each system).  The result?  Duff wins by about 10%-20% on all
machines tested.

Here is the code thus modified:


#include <sys/types.h>
#include <sys/times.h>
#include <sys/param.h>

double Start_Timer();
double Return_Elapsed_Time();

#define DEFAULT_NUM 100000

main(argc,argv) char **argv; {
   double Elapsed_Time_Duff;
   double Elapsed_Time_Ordinary;
   int    Count = argc > 1 ? atoi(argv[1]) : DEFAULT_NUM;
   int i;
   register int *A, *B;
   register int n;
   register int *A_end;
   int *array1, *array2;
   char *malloc();

   if (Count <= 0) {
       printf("Bad Count\n");
       return 1;
   } else {
       printf("Count=%d\n", Count);
   }

   if (!(array1=(int *)malloc(sizeof(int) * Count))) {
      printf("Can't malloc %u bytes for array1\n", sizeof(int) * Count);
      return 1;
   } else if (!(array2=(int *)malloc(sizeof(int) * Count))) {
      printf ("Can't malloc %u bytes for array2\n", sizeof(int) * Count);
      return 1;
   }
   
   for (i = 0;i < Count ;i++) {
      array1[i] = i + 1;
      array2[i] = i;
   }
      
   printf("Starting Duff's device timing...\n");
   Start_Timer();

   {
      n = (Count + 7) / 8;
      A = array1;
      B = array2;

      switch(Count & 0x7) {
         case 0:  do { *A++ = *B++;
         case 7:       *A++ = *B++;
         case 6:       *A++ = *B++;
         case 5:       *A++ = *B++;
         case 4:       *A++ = *B++;
         case 3:       *A++ = *B++;
         case 2:       *A++ = *B++;
         case 1:       *A++ = *B++;
                  } while (--n > 0);
      }
   }

   Elapsed_Time_Duff = Return_Elapsed_Time(0.0 );
   printf("Elapsed time = %.3lf\n",Elapsed_Time_Duff);

   for (i = 0;i < Count ;i++) {
      if (array1[i] != array2[i]) {
         printf("Yow, problems at location %d!\n",i);
         break;
      }
   }

   for (i = 0;i < Count ;i++) {
      array1[i] = i + 1;
      array2[i] = i;
   }

   printf("Starting ordinary copy timing...\n");
   Start_Timer();

   {
      A = array1;
      B = array2;
      A_end = array1 + Count - 1;

      while (A <= A_end)
         *A++ = *B++;
   }

   Elapsed_Time_Ordinary = Return_Elapsed_Time(0.0 );
   printf("Elapsed time = %.3lf\n",Elapsed_Time_Ordinary);

   for (i = 0;i < Count ;i++) {
      if (array1[i] != array2[i]) { 
         printf("Yow, problems at location %d!\n",i);
         break;
      }
   }
   printf("Duff/Ordinary copy time ratio = %.3lf\n",
      Elapsed_Time_Duff/Elapsed_Time_Ordinary);

}      


/* no such thing as "negative time"! */
#define  ERROR_VALUE -1.0   

static struct tms Old_Time;
static struct tms New_Time;
static int    Timer_Set = 0;

#ifdef __STDC__
double Start_Timer(void)
#else
double Start_Timer()
#endif
{
   Timer_Set = 1;
   times(&Old_Time);		/* set starting process time */
   return ((double)Old_Time.tms_utime)/(double)HZ;
}

/* returns process time since Last_Time (if parameter is not DEFAULT_TIME, */
/* i.e., (double) 0.0 ),otherwise, if parameter == DEFAULT_TIME then       */
/* the time since the Old_Time was set is returned.  Returns ERROR_VALUE   */
/* if Start_Timer() is not called first */
#ifdef __STDC__
double Return_Elapsed_Time(double Last_Time)
#else
double Return_Elapsed_Time(Last_Time) 
double Last_Time;
#endif
{
   if (!Timer_Set) {
      return(ERROR_VALUE);
   }   
   else {
    /* get process time */
      times(&New_Time);
      if (Last_Time == 0.0) {
	 return ((double)(New_Time.tms_utime - Old_Time.tms_utime))/(double) HZ;
      }
      else {
	 return (((double)New_Time.tms_utime)/(double)HZ) - Last_Time;
      }
   }
}


-- 
|------------Dan Levy------------|  THE OPINIONS EXPRESSED HEREIN ARE MINE ONLY
| Bell Labs Area 61 (R.I.P., TTY)|  AND ARE NOT TO BE IMPUTED TO AT&T.
|        Skokie, Illinois        | 
|-----Path:  att!ttbcad!levy-----|

gwyn@smoke.ARPA (Doug Gwyn ) (08/28/88)

In article <dpmuY#2EBC4R=eric@snark.UUCP> eric@snark.UUCP (Eric S. Raymond) writes:
>Dollars to doughnuts this code was written on a RISC machine.

Send me my doughnuts!  (Or dollars, if you prefer.)  RISC machines
were practically non-existent when this C code was first written down
(whether or not by Tom Duff, I'm not sure).  It probably appeared on
a PDP-11, perhaps a VAX.

I suspect the first use of Duff's device was as an "efficiency hack",
but as you observe it is not universally the most efficient method.

The modern VALID reason for such code is that it is a relatively
efficient way to do something portably that CANNOT BE DONE by
using the "preferred block copy routine" supplied by an implementation
(be it bcopy() or memcpy()).  Whenever the preferred routine will
work correctly (in the portable case, that means whenever the source
and destination fields do not overlap), then of course one should use
it, especially since if the C vendors are doing their job, the C
library routine (which could also be expanded in-line by the compiler)
will be near-optimally tuned for each specific architecture.

P.S.  Some implementations of bcopy() and memcpy() do guarantee
proper handling of overlapping fields.  But not all do, so one cannot
count on it if one's trying to write truly portable code.  ANSI C will
specify a new function memmove() with this property guaranteed.

henry@utzoo.uucp (Henry Spencer) (08/29/88)

In article <dpmuY#2EBC4R=eric@snark.UUCP> eric@snark.UUCP (Eric S. Raymond) writes:
>This only makes if the author knows he's got a hardware instruction pipeline
>or cache that's no less than 8 and no more than 9 byte-copy instruction widths
>long, and stuff executing out of the pipeline is a lot faster than if the
>copies are interleaved with control transfers. Dollars to doughnuts this code
>was written on a RISC machine.

Nope.  Bell Labs Research uses VAXen and 68Ks, mostly.

The key point is not pipelining, but loop-control overhead.  There is in
fact a tradeoff here:  unrolling the loop further will reduce control
overhead further, but will increase code size.  That last is of some
significance when caching gets into the act:  cache-loading overhead
favors short loops, and small cache sizes very strongly favor short ones.
In general there is an optimal point in there somewhere, and an unrolling
factor of 8 or 16 is a pretty good guess at it on the machines I've looked
at closely.
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

bill@proxftl.UUCP (T. William Wells) (08/29/88)

In article <11997@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
: [ a do loop within a switch, with cases all through the do loop ]
:
: This does not seem to be clearly covered in the latest dpANS (section
: 3.6). The issue is if it is legal *and defined* to jump into a loop.

It is defined, and the same as it always has been.  To quote from
section 3.1.2.4:

"Storage is guaranteed to be reserved for a new instance of such
an object [objects with *automatic storage duration*] on each
normal entry into the block in which it is declared, or on a jump
from outside the block to a label in the block or in an enclosed
block.  If an initialization is specified for the value stored in
the object, it is performed on each normal entry, but not if the
block is entered by a jump to a label."

: While a case label behaves just like any other label, it's not clear
: what a jump into a loop implies, and I don't see that the existing
: standard is any clearer on the topic than the discussion was three years
: ago.
:
:       i = 15;
:       if (j < 10) goto BadMove;
:
:       /* more code here */
:
:       do {
:         int i = 7;
:         int myvect[400];
:       BadMove: /* entry into block */
:         myvect[300] = i;
:         if (j < 5) return i;
:       } while (j-- > 302);

The above tells us that your example 1) allocates space for i and
myvect before or as the block is entered, and 2) does not
initialize `i' when the loop is entered via the goto, but does on
each subsequent iteration of the loop.

:   This is one of the rare times when I would like to be wrong, to be
: told "oh, it's there in a footnote, you can, or you can't, and this is
: what it does." If the action is implementation dependent then I
: (personally) wouldn't use the feature, since I value portability
: highly. Please tell me X3J11 didn't let this go unresolved for three
: years.

Nope.  All they did was move it from 3.6.2 (the equivalent of K&R
9.2) to 3.1.2.4.  I too have been bitten by not having the
standard say things in the expected place.  (Remember my
NULL!=NULL posting?  I got corrected by no less than dmr
himself!  Flaming red :-)

From K&R section 9.2:

"Any initializations of auto or register variables are performed
each time the block is entered at the top.  It is currently
possible (but a bad practice) to transfer into a block; in that
case the initializations are not performed."

Nothing could be clearer, eh?

---
Bill
novavax!proxftl!bill

bill@proxftl.UUCP (T. William Wells) (08/29/88)

In article <1988Aug28.031926.19222@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
: In article <634@proxftl.UUCP> bill@proxftl.UUCP (T. William Wells) writes:
: >[Duff's Device]
: >The false god of efficiency has reared it ugly head.  This
: >routine would be imagined to be more efficient than the almost
: >equivalent:  ...
: >     while (--count >= 0)
: >             *to++ = *from++;
: >
: >However, it often (always?) is not...
:
: On the contrary, it often (usually?) is.  This is from experience, not
: theory.  (Specifically, experience in C News and related code.)

Well, I'll bow to experimental evidence.  However, that does make
me wish that more compilers went to the effort to deal with this
kind (the "unoptimized" version) of loop, seeing as how it is so
commonly important.

: In general, however, in the long run the correct way to implement bulk
: data copying is to call "memcpy", which (in the long run) is likely
: to be recognized and given special attention by most compilers.

Agreed. That is how I would wish it to be done.

---
Bill
novavax!proxftl!bill

dave@onfcanim.UUCP (Dave Martindale) (08/30/88)

In article <1988Aug28.031926.19222@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>
>In general, however, in the long run the correct way to implement bulk
>data copying is to call "memcpy", which (in the long run) is likely
>to be recognized and given special attention by most compilers.

For the case of copying bytes, yes.  But there are other operations
done in tight loops that do a bit more work than a simple byte move,
so memcpy is not usable, yet little enough that unrolling the loop
is still worthwhile.  Thus Duff's Device will remain useful, until
all compilers are smart enough to do loop unrolling on their own.

For example: move while packing or unpacking data, table lookup, perhaps
CRC generation.

	Dave Martindale

td@alice.UUCP (Tom Duff) (08/30/88)

I normally do not read comp.lang.c, but Jim McKie told me
that ``Duff's device'' had come up in comp.lang.c again.  I
have lost the version that was sent to netnews in May 1984,
but I have reproduced below the note in which I originally
proposed the device.  (If anybody has a copy of the netnews
version, I would gratefully receive a copy at research!td or
td@research.att.com.)

To clear up a few points:
1)	The point of the device is to express general
	loop unrolling directly in C.  People who have
	posted saying `just use memcpy' have missed the
	point, as have those who have criticized it using
	various machine-dependent memcpy implementations
	as support.  In fact, the example in the message is
	not implementable as memcpy, nor is any computer
	likely to have an memcpy-like idiom that implements
	it.

2)	Somebody claimed that while the device was named
	for me, I probably didn't invent it.  I almost
	certainly did invent it.  I had definitely not
	seen or heard of it when I came upon it, and nobody
	has ever even claimed prior knowledge, let alone
	provided dates and times.  Note the headers on the
	message below:  apparently I invented the device
	on November 9, 1983, and was proud (or disgusted)
	enough to send mail to dmr.  Please note that I
	do not claim to have invented loop unrolling, merely
	this particular expression of it in C.

3)	The device is legal dpANS C.  I cannot quote chapter
	and verse, but Larry Rosler, who was chairman of the
	language subcommittee (I think), has assured me that X3J11
	considered it carefully and decided that it was legal.
	Somewhere I have a note from dmr certifying that all
	the compilers that he believes in accept it.  Of course,
	the device is also legal C++, since Bjarne uses it in
	his book.

4)	Somebody invoked (or more properly, banished) the
	`false god of efficiency.'  Careful reading of my
	original note will put this slur to rest.  The
	alternative to genuflecting before the god of
	code-bumming is finding a better algorithm.  It
	should be clear that none such was available.  If
	your code is too slow, you must make it faster.  If no
	better algorithm is available, you must trim cycles.

5)	The same person claimed that the device wouldn't exhibit
	the desired speed-up.  The argument was flawed in two
	regards:  first, it didn't address the performance of
	the device, but rather the performance of one of its
	few uses (implementing memcpy) for which many machines
	have a high-performance idiom.  Second, the poster
	made his claims in the absence of timing data, which
	renders his assertion suspect.  A second poster tried
	the test, but botched the implementation, proving
	only that with diligence it is possible to make anything
	run slowly.

6)	Even Henry Spencer, who hit every other nail square on
	the end with the flat round thing stuck to it, made a
	mistake (albeit a trivial one).  Here is Henry replying
	to bill@proxftl.UUCP (T. William Wells):
	>>... Dollars to doughnuts this code
	>>was written on a RISC machine.

	>Nope.  Bell Labs Research uses VAXen and 68Ks, mostly.

	I was at Lucasfilm when I invented the device.

7)	Transformations like this can only be justified by measuring the
	resulting code.  Be careful when you use this thing that you don't
	unwind the loop so much that you overflow your machine's instruction
	cache.  Don't try to be smarter than an over-clever C compiler that
	recognizes loops that implement block move or block clear and compiles
	them into machine idioms.

Here then, is the original document describing Duff's device:

From research!ucbvax!dagobah!td  Sun Nov 13 07:35:46 1983
Received: by ucbvax.ARPA (4.16/4.13)
	id AA18997; Sun, 13 Nov 83 07:35:46 pst
Received: by dagobah.LFL (4.6/4.6b)
	id AA01034; Thu, 10 Nov 83 17:57:56 PST
Date: Thu, 10 Nov 83 17:57:56 PST
From: ucbvax!dagobah!td (Tom Duff)
Message-Id: <8311110157.AA01034@dagobah.LFL>
To: ucbvax!decvax!hcr!rrg, ucbvax!ihnp4!hcr!rrg, ucbvax!research!dmr,
        ucbvax!research!rob

Consider the following routine, abstracted from code which copies an
array of shorts into the Programmed IO data register of an Evans &
Sutherland Picture System II:

	send(to, from, count)
	register short *to, *from;
	register count;
	{
		do
			*to = *from++;
		while(--count>0);
	}

(Obviously, this fails if the count is zero.)
The VAX C compiler compiles the loop into 2 instructions (a movw and
a sobleq, I think.)  As it turns out, this loop was the bottleneck in
a real-time animation playback program which ran too slowly by about 50%.
The standard way to get more speed out of something like this is to unwind
the loop a few times, decreasing the number of sobleqs.  When you do that,
you wind up with a leftover partial loop.  I usually handle this in C with
a switch that indexes a list of copies of the original loop body.  Of
course, if I were writing assembly language code, I'd just jump into the
middle of the unwound loop to deal with the leftovers.  Thinking about this
yesterday, the following implementation occurred to me:

	send(to, from, count)
	register short *to, *from;
	register count;
	{
		register n=(count+7)/8;
		switch(count%8){
		case 0:	do{	*to = *from++;
		case 7:		*to = *from++;
		case 6:		*to = *from++;
		case 5:		*to = *from++;
		case 4:		*to = *from++;
		case 3:		*to = *from++;
		case 2:		*to = *from++;
		case 1:		*to = *from++;
			}while(--n>0);
		}
	}

Disgusting, no?  But it compiles and runs just fine.  I feel a combination
of pride and revulsion at this discovery.  If no one's thought of it before,
I think I'll name it after myself.

It amazes me that after 10 years of writing C there are still little corners
that I haven't explored fully.  (Actually, I have another revolting way to
use switches to implement interrupt driven state machines but it's too
horrid to go into.)

Many people (even bwk?) have said that the worst feature of C is that
switches don't break automatically before each case label.  This code forms
some sort of argument in that debate, but I'm not sure whether it's for or
against.

			yrs trly
			Tom

chuck@amdahl.uts.amdahl.com (Charles Simmons) (08/30/88)

In article <2877@ttrdc.UUCP> levy@ttrdc.UUCP (Daniel R. Levy) writes:
>In article <653@paris.ICS.UCI.EDU>, schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:
>>    Since I posted my original question there has been a great deal of
>> abstract discussion about the relative merits of the loop unrolling
>> scheme.  The topic has piqued my curiousity, so I when ahead and
>> implemented a short test program, included below, to test Duff's
>> device against the ``ordinary for loop w/index variable'' technique.
>> See for yourself....   
>> 
>> After some quick testing I found that gcc 1.26 -O on a Sun 3 and a
>> Sequent Balance was pretty heavily in favor of the regular (non-Duff)
>> loop.  Your mileage may vary.  I realize that there may be other
>> tests, and if anyone has a better version, I'd like to see it!
>
>I modified this program to run under System V, changed the arrays to be dynam-
>ically allocated, and changed both the Duff and ordinary copies to use register
>pointers instead of global pointers (for the Duff copy) and array indexing (for
>the ordinary copy).  I then tried it on a SVR2 3B20, a SVR3 3B2, a Sun-3, and a
>Sun-4 both with and without -O optimization (using the standard pcc-type C
>compiler on each system).  The result?  Duff wins by about 10%-20% on all
>machines tested.

I then added a piece to the program to use 'memcpy'.  The results?
Duff beats a simple loop by 10%.  'memcpy' is 9 times faster than
Duff.  So why do people spend so much time avoiding standard subroutines?

-- Chuck

lvc@cbnews.ATT.COM (Lawrence V. Cipriani) (08/30/88)

In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
	[discussion of Duff copy deleted]
>I then added a piece to the program to use 'memcpy'.  The results?
>Duff beats a simple loop by 10%.  'memcpy' is 9 times faster than
>Duff.  So why do people spend so much time avoiding standard subroutines?

Sometimes the standard subroutines are implemented horribly.  I was
horrified when I saw that the machine dependent version of memcpy
on the AT&T 3Bs is nothing but a byte by byte transfer written in
assembly language.  It is tricky, but doable, to speed this up by a
roughly a factor of sizeof(long).  In fact it already is done in the
3B implementation of the UNIX(tm) operating system in the copyin (?)
routine.  Why wasn't it done in memcpy too?  Sigh.

-- 
Larry Cipriani, AT&T Network Systems, Columbus OH, cbnews!lvc lvc@cbnews.ATT.COM

nat@bales.UUCP (Nathaniel Stitt) (08/31/88)

In article <dpmuY#2EBC4R=eric@snark.UUCP> eric@snark.UUCP (Eric S. Raymond) writes:
>(Code below reproduced so that comp.arch people seeing this followup only
>won't get terminally frustrated. This is *really neat*, gang...)
>
>In article <638@paris.ics.uci.edu> Douglas C. Schmidt writes:
>> 
>> void send(int *to,int *from, int count) {
>>    int n = (count + 7) / 8;
>>    
>>    switch(count % 8) {
>>       case 0:  do { *to++ = *from++;
>>       case 7:       *to++ = *from++;
>>       case 6:       *to++ = *from++;
>>       case 5:       *to++ = *from++;
>>       case 4:       *to++ = *from++;
>>       case 3:       *to++ = *from++;
>>       case 2:       *to++ = *from++;
>>       case 1:       *to++ = *from++;
>>                } while (--n > 0);
>>    }
>> 
>> }
>> 
>> Finally, Stroustrup asks the rhetorical question ``why would anyone
>> want to write something like this.''  Any guesses?!
>
>Yeah. That's the most hackish way of trying to write a portable optimized
>copy routine I've ever seen. I gather the whole point of the shenanigans
>is to get all the *from++ -> *to++ instructions in the generated code to be
>adjacent.
>

Here is my own personal version of the "Portable Optimized Copy" routine.
It certainly seems more clear than the above example, and I would expect
it to be at least as fast on virtually any machine.  (Note the use of division
and mod in the above routine.)

Code does NOT have to be ugly to be fast.


/* Copy 'count' ints from *from to *to. */
void nats_send(to, from, count)
register int *to;
register int *from;
register int count;
{
	/* Copy the bulk of the data */
	while(count >= 8)
	{
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;

		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;

		count -= 8;
	}

	/* Copy the rest. */

	if(count >= 4)
	{
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;

		count -= 4;
	}

	if(count >= 2)
	{
		*to++ = *from++;
		*to++ = *from++;

		count -= 2;
	}

	if(count)
		*to++ = *from++;
}

Hope this helps.

-- 
Nathaniel Stitt           | This life is a test.  It is only a test.  Had
Guidelines Software, Inc. | this been an actual life, you would have received
ucbvax!cogsci!z!nat       | further instructions as to what to do and where
(415) 376-1395            | to go.

henry@utzoo.uucp (Henry Spencer) (08/31/88)

In article <8144@alice.UUCP> td@alice.UUCP (Tom Duff) writes:
> [me]	>Nope.  Bell Labs Research uses VAXen and 68Ks, mostly.
>
> [td]	I was at Lucasfilm when I invented the device.

Mea culpa.  I remembered that the article came from Bell Labs, but I
forgot the little note on the end saying that the invention happened
at Lucasfilm.  Hokay, ve correct it:

	Nope.  Lucasfilm used VAXen and 68Ks, mostly.
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

bill@proxftl.UUCP (T. William Wells) (08/31/88)

In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
: I then added a piece to the program to use 'memcpy'.  The results?
: Duff beats a simple loop by 10%.  'memcpy' is 9 times faster than
: Duff.  So why do people spend so much time avoiding standard subroutines?

Try some history, bud; it's good for what ails you.

I doubt that memcpy even existed then; and it is *not* standard
now.  Perhaps it will be several years after the ANSI standard is
adopted, but not till then.

---
Bill
novavax!proxftl!bill

earl@mips.COM (Earl Killian) (09/01/88)

   In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
   : I then added a piece to the program to use 'memcpy'.  The results?
   : Duff beats a simple loop by 10%.  'memcpy' is 9 times faster than
   : Duff.  So why do people spend so much time avoiding standard subroutines?

   Try some history, bud; it's good for what ails you.

   I doubt that memcpy even existed then; and it is *not* standard
   now.  Perhaps it will be several years after the ANSI standard is
   adopted, but not till then.

Perhaps a better reason:

According to the author, that code was used for copying to a 16-bit IO
device.  It would have be illegal to use memcpy or bcopy because they
would make word references to the device.
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

lvc@cbnews.ATT.COM (Lawrence V. Cipriani) (09/01/88)

In article <1002@cbnews.ATT.COM>, lvc@cbnews.ATT.COM (that's me) wrote:
> Sometimes the standard subroutines are implemented horribly.  I was
> horrified when I saw that the machine dependent version of memcpy
> on the AT&T 3Bs is nothing but a byte by byte transfer written in
> assembly language. ...

Oops.  I should have said this was on an early version (~1984) of
a 3B5.  New 3Bs don't have this deficiency.

-- 
Larry Cipriani, AT&T Network Systems, Columbus OH, cbnews!lvc lvc@cbnews.ATT.COM

hankd@pur-ee.UUCP (Hank Dietz) (09/01/88)

In article <189@bales.UUCP>, nat@bales.UUCP (Nathaniel Stitt) writes:
> Here is my own personal version of the "Portable Optimized Copy" routine.
.... then he gives a rather verbose, but structured, encoding....

As long as we're getting into structured, portable, hacks, let me suggest
the following two ways of doing block copy:

1.	If the number of items/bytes is known at compile time, then you can
	define a struct type of the appropriate size and use struct assign.
	with type casts to make it fly.  For example, suppose p and q are
	pointers to ints and I want to copy 601 ints from p to q.  Then I
	can write the fast and surprizingly portable:

	struct t601 { int t[601]; };
	*((struct t601 *) q) = *((struct t601 *) p);

	Of course, you do have to watch-out for alignment problems, but
	if your compiler doesn't generate very fast code for this....

2.	If the number of items/bytes is not known, then build a binary tree of
	such structs and copy half, then half of what remains, etc.  This is
	funny looking, but very fast also.  Suppose the number of ints (n) is
	not known at compile time, but can't be more than 601.  You can write:

	struct t512 { int t[512]; };
	struct t256 { int t[256]; };
	struct t128 { int t[128]; };
	struct t64 { int t[64]; };
	struct t32 { int t[32]; };
	struct t16 { int t[16]; };
	struct t8 { int t[8]; };
	struct t4 { int t[4]; };
	struct t2 { int t[2]; };
	if (n & 512) {
		*((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512;
	}
	if (n & 256) {
		*((struct t256 *) q) = *((struct t256 *) p); q+=256; p+=256;
	}
	if (n & 128) {
		*((struct t128 *) q) = *((struct t128 *) p); q+=128; p+=128;
	}
	if (n & 64) {
		*((struct t64 *) q) = *((struct t64 *) p); q+=64; p+=64;
	}
	if (n & 32) {
		*((struct t32 *) q) = *((struct t32 *) p); q+=32; p+=32;
	}
	if (n & 16) {
		*((struct t16 *) q) = *((struct t16 *) p); q+=16; p+=16;
	}
	if (n & 8) {
		*((struct t8 *) q) = *((struct t8 *) p); q+=8; p+=8;
	}
	if (n & 4) {
		*((struct t4 *) q) = *((struct t4 *) p); q+=4; p+=4;
	}
	if (n & 2) {
		*((struct t2 *) q) = *((struct t2 *) p); q+=2; p+=2;
	}
	if (n & 1) *q = *p;

	Notice that, in this case, n, p, and q should be declared as being
	register variables and that p and q are altered by this routine.  Of
	course, you can copy larger things by making larger power-of-2 sized
	structs.

	Incidentally, this ran about 8x faster (on a VAX 11/780) than using
	the usual copy loop.  Unfortunately, the above code should have been
	written as:

	if (n & 512) {
		*(((struct t512 *) q)++) = *(((struct t512 *) p)++);
	}
	...

	but, for some unknown reason, the VAX C compiler didn't like that.


Enjoy.
					hankd@ee.ecn.purdue.edu

gwyn@smoke.ARPA (Doug Gwyn ) (09/01/88)

In article <2945@wright.mips.COM> earl@mips.COM (Earl Killian) writes:
>According to the author, that code was used for copying to a 16-bit IO
>device.  It would have be illegal to use memcpy or bcopy because they
>would make word references to the device.

Wrong reason.  memcpy() etc. don't have the right behavior for stashing
a sequence of values into the same destination location.

loren@pixar.UUCP (Loren Carpenter) (09/01/88)

The Duff Loop (as far as I know) was first cast into C by Tom Duff when
he was at Lucasfilm in the early 1980's.  We used it at Lucasfilm wherever
we needed reasonable speed without resorting to assembly language.  It
obviously generalizes to more than memory copy & clear.

I and others have used this control construct for many years, but always
in assembly language.  I learned it from Howard Schmeising at Boeing in 1969,
where we were writing optimal stack code for CDC 6600's (a $7M RISC machine).
We could get 5+ 60-bit Mflops if we worked at it.

			Loren Carpenter
			...{ucbvax,sun}!pixar!loren

mouse@mcgill-vision.UUCP (der Mouse) (09/01/88)

In article <1988Aug28.031926.19222@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes:
> In article <634@proxftl.UUCP> bill@proxftl.UUCP (T. William Wells) writes:
>> [Duff's Device]
>> The false god of efficiency has reared it ugly head.
> In general, however, in the long run the correct way to implement
> bulk data copying is to call "memcpy", which (in the long run) is
> likely to be recognized and given special attention by most
> compilers.

Ultimately, Henry is correct (as usual).  However, for people who want
their code to run fast today, when memcpy() is a real routine call, it
is worthwhile to actually time things.  In particular, if you have a
small amount of data to move, the routine call overhead may well be
enough to completely swamp the speedup obtained by using a highly-tuned
copy routine, particularly on machines like the VAX with high call
overhead.

Moral: If you care, take the time to find out how it *really* is.

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu

jim.nutt@p11.f15.n114.z1.fidonet.org (jim nutt) (09/01/88)

 ->    In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> 
 -> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
 ->    : I then added a piece to the program to use 'memcpy'.  The results?
 ->    : Duff beats a simple loop by 10%.  'memcpy' is 9 times faster than
 ->    : Duff.  So why do people spend so much time avoiding standard 
 -> subroutines?
 -> 
 ->    Try some history, bud; it's good for what ails you.
 -> 
 ->    I doubt that memcpy even existed then; and it is *not* standard
 ->    now.  Perhaps it will be several years after the ANSI standard is
 ->    adopted, but not till then.

memcpy() IS in the ansi standard library, along with a similar function called memmove().  admittedly, the standard is not finalized yet but it is scheduled to be so soon.  i would think that it would behoove compiler vendors to begin tracking the standard now if they aren't already...

jim nutt
'the computer handyman'


--  
St. Joseph's Hospital/Medical Center - Usenet <=> FidoNet Gateway
Uucp: ...ncar!noao!asuvax!stjhmc!15.11!jim.nutt
Internet: jim.nutt@p11.f15.n114.z1.fidonet.org

rodc@unet.unet.pacbell.COM (Rod Creason) (09/01/88)

Newsgroups: comp.lang.c,comp.arch
Subject: Re: Explanation, please!
Summary: 
Expires: 
References: <638@paris.ics.uci.edu> <dpmuY#2EBC4R=eric@snark.UUCP> <189@bales.UUCP> <9064@pur-ee.UUCP>
Sender: 
Reply-To: rodc@unet.PacBell.COM (Rod Creason)
Followup-To: 
Distribution: 
Organization: Network Equipment Technologies, Redwood City, CA
Keywords: byte copy

In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:

>2.	If the number of items/bytes is not known, then build a binary tree of
>	such structs and copy half, then half of what remains, etc.  This is
>	funny looking, but very fast also.  Suppose the number of ints (n) is
>	not known at compile time, but can't be more than 601.  You can write:
>
>	struct t512 { int t[512]; };
>	struct t256 { int t[256]; };
>	struct t128 { int t[128]; };
>	struct t64 { int t[64]; };
>	struct t32 { int t[32]; };
>	struct t16 { int t[16]; };
>	struct t8 { int t[8]; };
>	struct t4 { int t[4]; };
>	struct t2 { int t[2]; };
>	if (n & 512) {
>		*((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512;
>	}
> [ rest of the example ]

A key point is that this code is *not* portable unless you can *guarantee*
that q and p are ALWAYS aligned at a four-byte boundary.  Any 3b2 (WE32000),
for example, will core dump if this is not the case.  Although 68000 based 
computers only require even alignment, they would still fail in most cases
where this routine would be used.

The key to a fast memory-to-memory copy is getting to these boundaries.
(and in the case where q or p is an odd address and the other is at an 
even address, none of this will work)

>					hankd@ee.ecn.purdue.edu

Rod Creason
...!{ames,amdahl,oliveb,pacbell}!unet!rodc

pf@csc.ti.com (Paul Fuqua) (09/01/88)

    Date: Wednesday, August 31, 1988  9:13pm (CDT)
    From: loren at pixar.UUCP (Loren Carpenter)
    Subject: Re: Explanation, please!
    Newsgroups: comp.lang.c,comp.arch

    The Duff Loop (as far as I know) was first cast into C by Tom Duff when
    he was at Lucasfilm in the early 1980's.  We used it at Lucasfilm wherever
    we needed reasonable speed without resorting to assembly language.  It
    obviously generalizes to more than memory copy & clear.

There was a discussion of the device in comp.protocols.tcp-ip not too long
ago, as a way to speed up the calculation of checksums.  The one point that I
haven't seen brought out here yet is that if the unrolling factor is a power
of two, the divide/remainder operations are simply a shift and a mask.

                              pf

Paul Fuqua
Texas Instruments Computer Science Center, Dallas, Texas
CSNet:  pf@csc.ti.com (ARPA too, sometimes)
UUCP:   {smu, texsun, cs.utexas.edu, im4u, rice}!ti-csl!pf

klg@njsmu.UUCP (Kenneth Goodwin) (09/02/88)

In article <9064@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes:
> In article <189@bales.UUCP>, nat@bales.UUCP (Nathaniel Stitt) writes:
> > Here is my own personal version of the "Portable Optimized Copy" routine.
> 2.	If the number of items/bytes is not known, then build a binary tree of
> 	such structs and copy half, then half of what remains, etc.  This is
> 	struct t512 { int t[512]; };
> 	struct t256 { int t[256]; };
> 	struct t128 { int t[128]; };
	.... etc .....
> 	if (n & 512) {
> 		*((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512;
> 	}
> 	if (n & 256) {
> 		*((struct t256 *) q) = *((struct t256 *) p); q+=256; p+=256;
> 	}
	...  etc ...
> 	Incidentally, this ran about 8x faster (on a VAX 11/780) than using
> 	the usual copy loop.  Unfortunately, the above code should have been
> 	written as:
> 
> 	if (n & 512) {
> 		*(((struct t512 *) q)++) = *(((struct t512 *) p)++);
> 	}
> 	...

	BUT This is where UNIONS come in handy, I used a similar although
	more brief technique for a faster version of a bmov() (byte move)
	subroutine on our PDP11-70 a while ago, and subsequently ported
	it to memcpy when we updated from V6 to System V.
	The basic idea that was used is to create a union of long, int,
	(short), and char pointers, use the character pointer to achieve
	the needed alignments and then use the largest available pointer
	to do the copy. There is no reason why a stucture copy could not be
	used, although I suspect on NON-VAX systems it may actually 
	be detremental (sp?) in some cases.
	The PDP11 C compiler used to stuff registers onto the stack
	and create a 16 bit word copy loop to do structure copies
	using the freed registers, restoring them when it was done.
	So a structure copy would be the same as a word copy on that style
	of a system (ie, ones without block move instructions)

	So In the case of your example, a modified brief version of it
	would be:

		union ptr_types {
			struct t512 { int t512[512] } *t512;
			....
			struct t32 { int t32[32] } *t32;
			long	*t_long;
			int	*t_int;
			short	*t_short;
			char	*t_char;
		} ;

		(probably could dispense with long and short pointers
		and related tests)

	memcpy(a, b, len)
	char *a; *b;
	{
		register union ptr_types a_ptr, b_ptr;

		a_ptr.t_char = a;
		b_ptr.t_char = b;

		while(NOT ON A WORD BOUNDARY AND CHARS LEFT) {
			*a_ptr.t_char++ = *b_ptr.t_char++;
			len--;
		}
		if(len >= sizeof(int) * 512) {
			/* if we can use a 512 int structure copy */
			*a_ptr.t512++ = *b_ptr.t512++;
			len -= (512 * sizeof(int));
		}
		/*M the biggest win is that the pointers increment correctly
		len -= (sizeof(*element pointer)) is the correct form over
		N INTS * sizeof int */

		.......
		I guess the rest is obvious, some GLUE may be needed
		that has not be shown.... :-)
		Boundaries should be checked on source and destination addresses
		to avoid memory faults....
		As you may be given incompatible source and destination address
		that may require a full char by char copy. The first
		test loop sort of does this, but all the other copies
		should also check for proper address alignments before
		proceeding.
Ken Goodwin
NJSMU.

andrew@frip.gwd.tek.com (Andrew Klossner) (09/02/88)

Nathaniel Stitt writes:

	"Here is my own personal version of the "Portable Optimized
	Copy" routine.  It certainly seems more clear than the above
	example, and I would expect it to be at least as fast on
	virtually any machine."

then goes on to present a routine which uses follow-on code to handle
the last few bytes after all octets have been copied.  It's cleaner
code, but it won't be quite as fast on many systems with instruction
caches because it has fifteen byte-move instructions, replacing eight
in the original, so more time is spent loading the loop into the
i-cache.  On systems with very small i-caches (my favorite example is
the IBM 360/91 with 16 bytes), the bigger loop may not all fit into
cache, and would be considerably slower.

Several contributors have suggested that unrolling a byte-copy loop is
a win.  On some architectures it is, but on a good pipelined system it
may not be.  As an example, the program fragment

	while (count--) {
		to[i] = from[i];
		++i;
	}

can be compiled to code on the M88k which copies memory as fast as a
DMA controller could; the instructions to decrement, increment, and
branch overlap with the data load/store requests.

[If everything's in registers, indexing in this case is actually faster
than keeping separate "to" and "from" pointers and incrementing both.]

This assumes that "to" and "from" are pointers-to-ints or
pointers-to-doubles.  Copying less than a word at a time is slower.

  -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
                        (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]

jabir@quintus.uucp (Jabir Hussain) (09/02/88)

In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>	struct t512 { int t[512]; };
>	struct t256 { int t[256]; };
[...]
>	if (n & 512) {
>		*((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512;
>	}
[...]
>	                      Unfortunately, the above code should have been
>	written as:
>
>	if (n & 512) {
>		*(((struct t512 *) q)++) = *(((struct t512 *) p)++);
>	}
>	...
>
>	but, for some unknown reason, the VAX C compiler didn't like that.


one portable way around that is to do something like:

	typedef char *caddr_t;
	typedef union {
	    struct { int t[512]; } t512;
	    struct { int t[256]; } t256;
	    caddr_t               caddr;
	} ustruct_t;

	{
	    register ustruct_t src,dst;

	    src.caddr = (caddr_t) p;
	    dst.caddr = (caddr_t) q;

	    if (n & 512) *dst.t512++ = *src.t512++;
	    ...
	}

jh

rick@pcrat.UUCP (Rick Richardson) (09/02/88)

I did a quick comparison of the times for "memcpy(,,n*4)", Duff's
device, Nathaniel's "nat", and Hank's versions.  Here's the results:

TEST	MACHINE				ATT cc	VENIXcc	GNU1.26
memcpy	8Mhz 286, Venix SVR2		 3.35	11.15	-
duff	8Mhz 286, Venix SVR2		16.58	 4.56	-
nat	8Mhz 286, Venix SVR2		16.73	 5.11	-
hank	8Mhz 286, Venix SVR2		 1.83	 ccbarf	-
memcpy	16Mhz 386, 386/ix 1.0.4		  .82	  -	 .82
duff	16Mhz 386, 386/ix 1.0.4		 2.01	  -	2.26
nat	16Mhz 386, 386/ix 1.0.4		 1.95	  -	2.21
hank	16Mhz 386, 386/ix 1.0.4		  .86	  -	 .89

		* Everything was compiled with -O
		* test moved 2000,1999,...,2,1 int's

I learned this: Venix obviously doesn't have a good memcpy, and
Duff, or nat make a good replacement.  AT&T uses the block move
in memcpy, and it is a definite win on both machines. The AT&T
286 C compiler is terrible for these program on the 286.  AT&T
still has a slight edge over gcc *for these programs*. Hanks
version, while not portable, is a clear win on the 286 (disassemble
memcpy and hank's to see why), and virtually as good as memcpy on 386.

Conclusion: you can't make any.  It all depends on your computer
and compiler.
-- 
		Rick Richardson, PC Research, Inc.
		rick%pcrat.uucp@uunet.uu.net (INTERNET)
		   uunet!pcrat!rick (UUCP, Personal Mail)
..!pcrat!jetroff (JetRoff Info)		..!pcrat!dry2 (Dhrystone Submissions)

mouse@mcgill-vision.UUCP (der Mouse) (09/02/88)

In article <9064@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes:
> In article <189@bales.UUCP>, nat@bales.UUCP (Nathaniel Stitt) writes:
>> ["Portable Optimized Copy" routine]
> As long as we're getting into structured, portable, hacks, let me
> suggest the following two ways of doing block copy:

> 1.    If the number of items/bytes is known at compile time, then you
>       can define a struct type of the appropriate size and use struct
>       assignment with type casts to make it fly.  For example,

[ int *p; int *q; /* want to copy 601 ints from p to q */ ]
>       struct t601 { int t[601]; };
>       *((struct t601 *) q) = *((struct t601 *) p);

Surprise!  Your compiler has sizeof(int)=2, sizeof(long)=4, and aligns
structs on long boundaries.  You just copied 602 ints!

>       Unfortunately, the above code should have been written as:

>       if (n & 512) {
>               *(((struct t512 *) q)++) = *(((struct t512 *) p)++);
>       }

>       but, for some unknown reason, the VAX C compiler didn't like
>       that.

Perhaps the "unknown reason" is that it's illegal!  An expression
formed by casting another expression is not an lvalue and never has
been.  Any compiler that accepts that is broken.  (Yes, there exist
such broken compilers.)

What would you expect from

    double d;

    ((int)d) ++;

Perhaps you'd want d+=1?  Perhaps d=1+(int)d?  Perhaps
*(int*)&d=1+(int)d even?

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu

chris@mimsy.UUCP (Chris Torek) (09/02/88)

-In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
->		*(((struct t512 *) q)++) = *(((struct t512 *) p)++);
->	but, for some unknown reason, the VAX C compiler didn't like that.

In article <339@quintus.UUCP> jabir@quintus.uucp (Jabir Hussain) writes:
-one portable way around that is to do something like:
-
-	typedef char *caddr_t;
-	typedef union {
-	    struct { int t[512]; } t512;
-	    struct { int t[256]; } t256;
-	    caddr_t               caddr;
-	} ustruct_t;
	...
-	    if (n & 512) *dst.t512++ = *src.t512++;

`Portable'?  I suggest you try it on a Data General MV/10000.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

alverson@decwrl.dec.com (Robert Alverson) (09/03/88)

In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>	Incidentally, this ran about 8x faster (on a VAX 11/780) than using
>	the usual copy loop.  Unfortunately, the above code should have been
>	written as:
>
>	if (n & 512) {
>		*(((struct t512 *) q)++) = *(((struct t512 *) p)++);
>	}
>	...
>
>	but, for some unknown reason, the VAX C compiler didn't like that.
>
>
>Enjoy.
>					hankd@ee.ecn.purdue.edu
Actually, the VAX C compiler was completely correct.  Attempting
to post-increment the result of a cast is not legal C.  One way
around this is to use a little extra indirection:

    char* p;

    /* was:   ((int *)p)++; */
    (*((int **) &p))++;		/* phew! */

The reasoning is that the result of a cast is an expression, not a
lvalue.  Quite often, compilers relax this rule for coercions which
produce no code.  In this case, it looks like the VAX C doesn't

Bob

hankd@pur-ee.UUCP (Hank Dietz) (09/05/88)

Relative to my article <9064@pur-ee.UUCP> giving a struct-based hack for
doing block copy and to all followups thereof....

Correction:

I hereby appologize for having ommitted a (:-) and hence having made it
sound like I didn't know why the VAX compiler didn't want to use a cast
pointer as the operand of ++.  I do know (and I knew even before a dozen of
you reminded me :-), I just think it should be allowed in such cases; i.e.,
be implementation dependent rather than illegal.  Anyway, the union-based fix
given by Jabir is certainly a good way to do it (once you fix his typedef to
make t512 and t256 pointers to structs, rather than structs).

Comments:

As I noted in the original posting, the use of a structure-hack block copy is
subject to alignment problems -- if the target machine has alignment
constraints.  There are three additional notes on this:

1.	Both the source AND destination pointers may need to be aligned,
	which implies that the difference between the pointers MUST BE A
	MULTIPLE OF THE ALIGNMENT FACTOR.  Arbitrary copies will not have
	this property, hence, on such machines one must always have a
	byte-oriented copy routine to fall back on.

	For a 4-byte alignment machine, the test is something like:

	if ((p - q) & 3) *byte copy* else *struct copy*

2.	If the difference between the pointers is a multiple of the machine
	alignment factor, then copying up until one is aligned will insure
	alignment of both.  Hence, we simply need to insert a few moves to
	achieve alignment BEFORE the struct copy code I gave earlier.

	For a 4-byte alignment machine, the additional code is:

	if ((n >= 1) && (p & 1)) {
		*q = *p; ++q; ++p; --n;
	}
	if ((n >= 2) && (p & 2)) {
		*((struct t2 *) q) = *((struct t2 *) p); q+=2; p+=2; n-=2;
	}

	Notice that, with the above code as a prefix, trailing alignment is
	also satisfied (proof left as an exercise to the reader).

3.	Although I know of no such C compiler, it is legal for a C compiler
	to make the size of a struct anything big enough to hold the things
	inside it.  Hence, for example, a struct containing an array of 4
	things might be the size of 5 things....  If this were done, the
	struct copy would nearly work; the only problem is that it could
	overshoot the end of the block.  There is no general fix for this.

						hankd@ee.ecn.purdue.edu

pardo@june.cs.washington.edu (David Keppel) (09/07/88)

hankd@pur-ee.UUCP (Hank Dietz) writes:
>	if ((p - q) & 3) *byte copy* else *struct copy*

I believe that the VAX "movc" command takes arbitrary pointers and
does the following:

* If both are word-aligned, do a word copy (I mean a 4-byte word).
* If both are non-aligned and could be aligned with 1, 2, or 3 bytes
  of byte-copy at either end, then do a byte copy at either end and do
  a word copy down the middle.
* If niether aligned then ??

Unfortunately, my VAX hardware reference is out of town for a couple
of weeks, so I can't ask him about neither aligned.  Anybody know?

I can immagine that on some machines it is faster to copy words into
register and repack the words in the registers rather than do a byte
copy, since you could be taking advantage of some hardware gak.

Simple example:

    machine X has register W1 divided into B4, B5, B6, B7.  To do a
    copy, align the source pointer (doing byte copies) then read a
    wrod-at-a-time into the W1 register, write it back out by writing
    B4, B5, B6, B7  (little-endian).

This is beginning to look suspiciously like the kinds of optimizations
that get done for bit BLTs.  Anybody know if this ever really gets
done?

	;-D on  ( Ahh.  Architecture at its finest )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

rik@june.cs.washington.edu (Rik Littlefield) (09/07/88)

In article <5654@june.cs.washington.edu>, pardo@june.cs.washington.edu (David Keppel) writes:
> 
> I can immagine that on some machines it is faster to copy words into
> register and repack the words in the registers rather than do a byte
> copy, since you could be taking advantage of some hardware gak.
> 

On the old CDC 6000-series machines (early RISCs...) that was the *only*
practical way to do it, as well as being blazingly fast.  We had copies
that would handle arbitrary *bit* alignments at a cost of around 6 instructions
and 2 memory references per 60-bit word, in the middle of the string.  
The sequence was basically fetch, shift, mask, mask, OR, and store, 
appropriately rearranged to minimize memory delay and functional unit 
conflicts, of course.  I vaguely remember that this thing could even
be unrolled a couple of times and still fit in the instruction cache
("stack", in those days) for machines expensive enough to have one.

VAXen I don't know about for sure, but I'd be real surprised if their
microcode didn't do the same thing.

--Rik

news@amdcad.AMD.COM (Network News) (09/07/88)

In article <5654@june.cs.washington.edu> pardo@uw-june.UUCP (David Keppel) writes:
| hankd@pur-ee.UUCP (Hank Dietz) writes:
| >	if ((p - q) & 3) *byte copy* else *struct copy*
| 
| I believe that the VAX "movc" command takes arbitrary pointers and
| does the following:
| 
| * If both are word-aligned, do a word copy (I mean a 4-byte word).
| * If both are non-aligned and could be aligned with 1, 2, or 3 bytes
|   of byte-copy at either end, then do a byte copy at either end and do
|   a word copy down the middle.
| * If niether aligned then ??
| 
| Unfortunately, my VAX hardware reference is out of town for a couple
| of weeks, so I can't ask him about neither aligned.  Anybody know?

I assume that "both are non-aligned" in the above list means that the
source and the destination have the same alignment, but are not aligned
with respect to a 4-byte boundary, and that "neither aligned" means that
the source and the destination are misaligned.

We use an interesting trick in the Am29000 memcpy routine for
source/destination misalignment.  In this case, we set up the alignment
difference in the funnel-count register, read in two source words, and
"extract" a destination word using the funnel-shifter's ability to
extract any 32-bit word from a 64-bit double-word in a single cycle. 
The inner loop then consists of shifting the low source word to the high
source word, reading a new low source word, extracting a destination and
storing it (well, there is also the overhead of counting down the
correct number of "word" moves, but you get the idea.)

	-- Tim Olson
	Advanced Micro Devices
	(tim@delirun.amd.com)

bill@proxftl.UUCP (T. William Wells) (09/08/88)

In article <485.231DAAC3@stjhmc.fidonet.org> jim.nutt@p11.f15.n114.z1.fidonet.org (jim nutt) writes:
:  ->    I doubt that memcpy even existed then; and it is *not* standard
:  ->    now.  Perhaps it will be several years after the ANSI standard is
:  ->    adopted, but not till then.
:
: memcpy() IS in the ansi standard library, along with a similar
: function called memmove().

Yes, I know.  I said "it isn't standard", not "it isn't in the
standard".  The difference is that "it isn't standard" means that
you can't write portable code that assumes that it exists; the
other should be obvious.

---
Bill
novavax!proxftl!bill

mouse@mcgill-vision.UUCP (der Mouse) (09/11/88)

In article <339@quintus.UUCP>, jabir@quintus.uucp (Jabir Hussain) writes:
> In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>>		*(((struct t512 *) q)++) = *(((struct t512 *) p)++);
>>	but, for some unknown reason, the VAX C compiler didn't like that.

> one portable way around that is to do something like:
[summarized]
> union { struct { int t[512]; } *t512; ...; char *caddr; }
[then use ++ on .t512]

This falls hopelessly flat if char * and struct {...} * have
significantly different representation.

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu