jones@ingr.UUCP (Mark Jones) (09/21/88)
In article <8809092109.AA06071@tycho.yerkes.uchicago.edu>, pearce@TYCHO.YERKES.UCHICAGO.EDU ("Eric C. Pearce") writes: > > Personally, I prefer this "conventional" copy fragment: > > A = array1; > B = array2; > > n = Count / 8; > for (i = Count%8; i > 0; i--) > *A++ = *B++; > while (--n >= 0) { > *A++ = *B++; > *A++ = *B++; > *A++ = *B++; > *A++ = *B++; > *A++ = *B++; > *A++ = *B++; > *A++ = *B++; > *A++ = *B++; > } > > >From a performance standpoint, it falls only 2% behind duff device and > is 102% more readable and not "kludgy". How about this "conventional" copy fragment memcpy(array2,array1,sizeof(*A)*Count); /* copy array1 to array2 */ Almost every system has this call (I have never seen one that didn't) and if it was implemented well, it could be done in as few as 7 instructions, on a brain damaged microprocessor. mov cx,{Count| shl cx,1 /* for 2 byte objects */ shl cx,1 /* for 4 byte objects */ mov si,{address of array1} mov di,{address of array2} repnz movsw /* done */ or mov.l a0,?(a7) mov.l a1,?(a7) mov.? d0,Count; /* shift left once or twice */ loop: mov.b (a0)+,(a1)+ dbnz loop /* i don't remember the exact syntax here */ Mark Jones PS I don't guarantee syntax, but you should get the idea. Use the functions in the C library. If you need more speed, recode the library routines in assembly, and link them in ahead of the standard library. Don't write trashy C code to try to get better speed. It just ain't worth it, now or later. PPS Yes, I know the 68000 code could be made even better, very easily, (should check for alignment, use bigger moves, etc...) but this is only an example.
mouse@mcgill-vision.UUCP (der Mouse) (10/01/88)
In article <2585@ingr.UUCP>, jones@ingr.UUCP (Mark Jones) writes: > In article <8809092109.AA06071@tycho.yerkes.uchicago.edu>, pearce@TYCHO.YERKES.UCHICAGO.EDU ("Eric C. Pearce") writes: >> Personally, I prefer this "conventional" copy fragment: >> [code] > How about this "conventional" copy fragment > memcpy(array2,array1,sizeof(*A)*Count); As with most of these arguments, it really depends on the specific application. On some machines (such as the VAX), function call overhead is large enough that if you're copying small amounts of data (like ten or twelve bytes), it's faster to inline it. > Use the functions in the C library. If you need more speed, recode > the library routines in assembly, and link them in ahead of the > standard library. Not enough: even the function call overhead can swamp the time taken to perform the copy. > Don't write trashy C code to try to get better speed. It just ain't > worth it, now or later. Welcome to the real world, where we need to get the response packet out to the robot within 28 milliseconds or the whole thing comes to a screeching halt. If inlining a copy makes the difference between a working system and a not-quite-working system, it's worth it. This is not just my idea, either. I wrote a program to find out what's really happening. This was run on an otherwise free MicroVAX-II under 4.3BSD with the standard 4.3 pcc-based cc, with no optimization options enabled. The times given are real times, but they stayed stable across several runs, so they ought to be reasonably good. Here's what the various lines mean: Overhead: The time taken when there's no copy being done (ie, loop overhead). Fancy inline: The "conventional" copy recommended by "Eric C. Pearce", with all inner-loop variables declared register. Simple inline: An inline loop whose core is for (nb=...;nb>0;nb--) *to++ = *from++; where all three variables are declared register. asm inline: In-line asm() directives to implement a loop such as the one used for "Simple inline". Fxncall: Calling a function containing the same loop used by "Simple inline". Library: Calling memcpy(). All pointers are pointers to char, so copying copies one byte at a time. Here are the times resulting from 100000 iterations, copying 12 bytes each loop: Overhead = 2.3 usec/loop Fancy inline = 60.6 usec/loop Simple inline = 49.1 usec/loop asm inline = 40.1 usec/loop Fxncall = 74.3 usec/loop Library = 71.9 usec/loop And here, 10000 iterations, copying 512 bytes each loop: Overhead = 2 usec/loop Fancy inline = 979 usec/loop Simple inline = 1881 usec/loop asm inline = 1654 usec/loop Fxncall = 1877 usec/loop Library = 1794 usec/loop Moral: Nothing beats actually trying a few things and finding out what's best. Of course, "fastest" is not always "best", though if you care enough to start unrolling your block copies, speed is presumably important to you. In particular, note the differences depending on the amount of data being copied, so what you want to do depends on how much stuff you're typically copying. For the curious, the program I used appears below; to extract it, feed this through atob and uncompress. xbtoa Begin +.\K<d^U14V@K"L+F),(6RL%s(MJ1E#WT'H_1Zn#b]uK%9q*mL,u-R97Nb#F9V+@t1'L&S(bfBW/5s t+_C8'`'EM&UL("4m`h3V_#DAH]>C+HUZ6SS[U1.XYouRoH)*A#Y^r_5GN1;hD=O;?2_Msos$KRF3P @e6kU-a]g:uN.*;mgGt_L4Ihm+o'dl+@0CO:DQ&:%p%9@HL9]'KaP8J.YpXM\]tlJCr?Fc&*5YHH`M Gk&XJGI2&JX?ma+549F3'7"JAn#eY]n(C%_<NRjtb)aEbm[0K<h/9_Y\1Kh8,30uF>#,-Q'DS71P(8 \Y'2IPn*qf]_,\EIfhj1gp.m)H7K*8NKA#n6sTd1kCGMmb8@s'baP.-7(5n'U;&#VU].hg_@@=Fu.p 1$2SdOYu;kFFAR,+'kaX-K/HsiF!f=(ja8k$?Tr0)M]h2kTPg5Nn0nZR_]9`MIt('UQ$5TfHQ;4VKe kp56AqON]Om&A<0:d)'b%lp&RKpW2U!N*$2r47N&JU<nh*El.-nocN3cKIENLpdRmFtVc@f]7CUoke TmM5HJrjVB5W(<Kjo@PMWnH#oBga/IK2k!hgtY28Eb_g\R?Y97hXJqPuCkc5T"RLdA9RE<<oCIJs,p K:hQMak&G4KaW+l3@Q#THRs'[@#]74_7i5]cnAS6Ua8po!psn*<Xh!lRl`W&H)[!?>EI'1=]!;><#t b@O"16)))M8MB,hjgV_PXO=$fo<80"99c3l`1KS3]YBE\n_f(u/0&R]^S&;+;C5KPF_Yk)HpARDPZn S=7O,\K%-caf1\FRBC_`5[gn5L;9Pg>-ME?URJEe$aZ;W_luT]lbn`j225.:!?2a]Nm='HC9i[>#>n `S\L2R4KcD).X0-a=EtE+5nT5k\0ud8lYlJUbq?"'/7j"Qn#7>Y`70d#.TqFa(@=1dR@Z)f`7ic(VL 34BG=t9ebkbUEqgMV?@GK(bii<C03Sl#\4Dnf]b).@Jq_VAt<JZd^OKCH;jDqHmJ$8</uhq285E;!a m'<\h'Lb%mP)3<\2[cDa5DRg+ACT\EnFi`MG$7MG4!ZcH3;9'YfN6q^e!q"kAn%kKO_FlW"UdFL!)u g*2_@uBmo[Ke3I8d2S&V`OjZfecbE6(nD)7Me<8&,(e6!q?<Qr23MBV;R@)/n$pokF9-T(U\4P(ULL VRc8gmoC_o!sjuIL)2Cr$Z6N,=C-LZr7+(A>=sIgVGKPL[0M+(0S4)[O9bL#`c@f)TDX.>^\d,2IuE RKgU,Os&]2Ns-S.Ui%k*#4/s(AhHk$9*8-]K*&;*%!i!qtrX!qRn9-uL;q_T,(q6c;-0@ZL#ogV;"[ 6[Gj_P'Pe8-]_&OF]^Iaq,q&_>GHRK1=nZBai'N&P+F0-j5).X_9pMl;*?1RGl_modLp,d=Z!uTE<O 9^^.79L`'1O+aI`(QY"BE.AYr1l.8E5_GG9(jPn`C2MOkfd>Wtd@O$_C6oRRjNi0N)'@NO0Sk\`3<F fMcZcKmXr_HP$gmK/Q&Bc,l13dMd_]"Hc`XBMd@@A\K*b^p$mO=e/ xbtoa End N 1227 4cb E 6f S 24195 R d308a326 der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu