[comp.lang.c] "conventional" copy fragment?

jones@ingr.UUCP (Mark Jones) (09/21/88)

In article <8809092109.AA06071@tycho.yerkes.uchicago.edu>, pearce@TYCHO.YERKES.UCHICAGO.EDU ("Eric C. Pearce") writes:
> 
> Personally, I prefer this "conventional" copy fragment:
> 
>       A = array1;
>       B = array2;
> 
>       n = Count / 8;
>       for (i =  Count%8; i > 0; i--)
> 	 *A++ = *B++;
>       while (--n >= 0) {
> 	 *A++ = *B++; 
> 	 *A++ = *B++; 
> 	 *A++ = *B++; 
> 	 *A++ = *B++; 
> 	 *A++ = *B++; 
> 	 *A++ = *B++; 
> 	 *A++ = *B++; 
> 	 *A++ = *B++; 
>       } 
> 
> >From a performance standpoint, it falls only 2% behind duff device and
> is 102% more readable and not "kludgy".

How about this "conventional" copy fragment

	memcpy(array2,array1,sizeof(*A)*Count);	/* copy array1 to array2 */

Almost every system has this call (I have never seen one that didn't)
and if it was implemented well, it could be done in as few as 7
instructions, on a brain damaged microprocessor.

	mov	cx,{Count|
	shl	cx,1			/* for 2 byte objects */
	shl	cx,1			/* for 4 byte objects */
	mov	si,{address of array1}
	mov	di,{address of array2}
	repnz
	movsw				/* done */

or
	mov.l	a0,?(a7)
	mov.l	a1,?(a7)
	mov.?	d0,Count;
	/* shift left once or twice */
loop:
	mov.b	(a0)+,(a1)+
	dbnz	loop		/* i don't remember the exact syntax here */

Mark Jones

PS I don't guarantee syntax, but you should get the idea.  Use the
functions in the C library.  If you need more speed, recode the library
routines in assembly, and link them in ahead of the standard library. 
Don't write trashy C code to try to get better speed.  It just ain't
worth it, now or later.

PPS Yes, I know the 68000 code could be made even better, very easily,
(should check for alignment, use bigger moves, etc...) but this is only
an example.

mouse@mcgill-vision.UUCP (der Mouse) (10/01/88)

In article <2585@ingr.UUCP>, jones@ingr.UUCP (Mark Jones) writes:
> In article <8809092109.AA06071@tycho.yerkes.uchicago.edu>, pearce@TYCHO.YERKES.UCHICAGO.EDU ("Eric C. Pearce") writes:
>> Personally, I prefer this "conventional" copy fragment:
>> [code]
> How about this "conventional" copy fragment
> 	memcpy(array2,array1,sizeof(*A)*Count);

As with most of these arguments, it really depends on the specific
application.  On some machines (such as the VAX), function call
overhead is large enough that if you're copying small amounts of data
(like ten or twelve bytes), it's faster to inline it.

> Use the functions in the C library.  If you need more speed, recode
> the library routines in assembly, and link them in ahead of the
> standard library.

Not enough: even the function call overhead can swamp the time taken to
perform the copy.

> Don't write trashy C code to try to get better speed.  It just ain't
> worth it, now or later.

Welcome to the real world, where we need to get the response packet out
to the robot within 28 milliseconds or the whole thing comes to a
screeching halt.  If inlining a copy makes the difference between a
working system and a not-quite-working system, it's worth it.

This is not just my idea, either.  I wrote a program to find out what's
really happening.  This was run on an otherwise free MicroVAX-II under
4.3BSD with the standard 4.3 pcc-based cc, with no optimization options
enabled.  The times given are real times, but they stayed stable across
several runs, so they ought to be reasonably good.  Here's what the
various lines mean:

Overhead: The time taken when there's no copy being done (ie, loop
	overhead).
Fancy inline: The "conventional" copy recommended by "Eric C. Pearce",
	with all inner-loop variables declared register.
Simple inline: An inline loop whose core is
	for (nb=...;nb>0;nb--) *to++ = *from++;
	where all three variables are declared register.
asm inline: In-line asm() directives to implement a loop such as the
	one used for "Simple inline".
Fxncall: Calling a function containing the same loop used by "Simple
	inline".
Library: Calling memcpy().

All pointers are pointers to char, so copying copies one byte at a time.

Here are the times resulting from 100000 iterations, copying 12 bytes
each loop:

Overhead = 2.3 usec/loop
Fancy inline = 60.6 usec/loop
Simple inline = 49.1 usec/loop
asm inline = 40.1 usec/loop
Fxncall = 74.3 usec/loop
Library = 71.9 usec/loop

And here, 10000 iterations, copying 512 bytes each loop:

Overhead = 2 usec/loop
Fancy inline = 979 usec/loop
Simple inline = 1881 usec/loop
asm inline = 1654 usec/loop
Fxncall = 1877 usec/loop
Library = 1794 usec/loop

Moral: Nothing beats actually trying a few things and finding out
what's best.  Of course, "fastest" is not always "best", though if you
care enough to start unrolling your block copies, speed is presumably
important to you.  In particular, note the differences depending on the
amount of data being copied, so what you want to do depends on how much
stuff you're typically copying.

For the curious, the program I used appears below; to extract it, feed
this through atob and uncompress.

xbtoa Begin
+.\K<d^U14V@K"L+F),(6RL%s(MJ1E#WT'H_1Zn#b]uK%9q*mL,u-R97Nb#F9V+@t1'L&S(bfBW/5s
t+_C8'`'EM&UL("4m`h3V_#DAH]>C+HUZ6SS[U1.XYouRoH)*A#Y^r_5GN1;hD=O;?2_Msos$KRF3P
@e6kU-a]g:uN.*;mgGt_L4Ihm+o'dl+@0CO:DQ&:%p%9@HL9]'KaP8J.YpXM\]tlJCr?Fc&*5YHH`M
Gk&XJGI2&JX?ma+549F3'7"JAn#eY]n(C%_<NRjtb)aEbm[0K<h/9_Y\1Kh8,30uF>#,-Q'DS71P(8
\Y'2IPn*qf]_,\EIfhj1gp.m)H7K*8NKA#n6sTd1kCGMmb8@s'baP.-7(5n'U;&#VU].hg_@@=Fu.p
1$2SdOYu;kFFAR,+'kaX-K/HsiF!f=(ja8k$?Tr0)M]h2kTPg5Nn0nZR_]9`MIt('UQ$5TfHQ;4VKe
kp56AqON]Om&A<0:d)'b%lp&RKpW2U!N*$2r47N&JU<nh*El.-nocN3cKIENLpdRmFtVc@f]7CUoke
TmM5HJrjVB5W(<Kjo@PMWnH#oBga/IK2k!hgtY28Eb_g\R?Y97hXJqPuCkc5T"RLdA9RE<<oCIJs,p
K:hQMak&G4KaW+l3@Q#THRs'[@#]74_7i5]cnAS6Ua8po!psn*<Xh!lRl`W&H)[!?>EI'1=]!;><#t
b@O"16)))M8MB,hjgV_PXO=$fo<80"99c3l`1KS3]YBE\n_f(u/0&R]^S&;+;C5KPF_Yk)HpARDPZn
S=7O,\K%-caf1\FRBC_`5[gn5L;9Pg>-ME?URJEe$aZ;W_luT]lbn`j225.:!?2a]Nm='HC9i[>#>n
`S\L2R4KcD).X0-a=EtE+5nT5k\0ud8lYlJUbq?"'/7j"Qn#7>Y`70d#.TqFa(@=1dR@Z)f`7ic(VL
34BG=t9ebkbUEqgMV?@GK(bii<C03Sl#\4Dnf]b).@Jq_VAt<JZd^OKCH;jDqHmJ$8</uhq285E;!a
m'<\h'Lb%mP)3<\2[cDa5DRg+ACT\EnFi`MG$7MG4!ZcH3;9'YfN6q^e!q"kAn%kKO_FlW"UdFL!)u
g*2_@uBmo[Ke3I8d2S&V`OjZfecbE6(nD)7Me<8&,(e6!q?<Qr23MBV;R@)/n$pokF9-T(U\4P(ULL
VRc8gmoC_o!sjuIL)2Cr$Z6N,=C-LZr7+(A>=sIgVGKPL[0M+(0S4)[O9bL#`c@f)TDX.>^\d,2IuE
RKgU,Os&]2Ns-S.Ui%k*#4/s(AhHk$9*8-]K*&;*%!i!qtrX!qRn9-uL;q_T,(q6c;-0@ZL#ogV;"[
6[Gj_P'Pe8-]_&OF]^Iaq,q&_>GHRK1=nZBai'N&P+F0-j5).X_9pMl;*?1RGl_modLp,d=Z!uTE<O
9^^.79L`'1O+aI`(QY"BE.AYr1l.8E5_GG9(jPn`C2MOkfd>Wtd@O$_C6oRRjNi0N)'@NO0Sk\`3<F
fMcZcKmXr_HP$gmK/Q&Bc,l13dMd_]"Hc`XBMd@@A\K*b^p$mO=e/
xbtoa End N 1227 4cb E 6f S 24195 R d308a326

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu