djsalomon@watdaisy.UUCP (Daniel J. Salomon) (03/08/85)
A recent article on language idioms gave the following C code
for a string copy:
while (*s++ = *t++);
Serious C hackers should know that on VAX 4.2 BSD UNIX this
code produces a loop with 50% more assembler instructions
than the slightly clearer sequence:
while ((*s = *t) != '\0')
{ s++;
t++;
}
This is true whether or not the object-code improver
is invoked, and may be true on other machines as well.
djsalomon@watdaisy.UUCP (Daniel J. Salomon) (03/08/85)
The VAX 4.2 BSD UNIX library routine 'strcpy' uses code equivalent to the less efficient sequence: while (*s++ = *t++); Perhaps it should be changed. In the mean time, if your application does a great deal of string copying, and you are want to minimize execution time, you should write your own string copy routine.
djsalomon@watdaisy.UUCP (Daniel J. Salomon) (03/08/85)
> The VAX 4.2 BSD UNIX library routine 'strcpy' uses code equivalent > to the less efficient sequence: while (*s++ = *t++); > Perhaps it should be changed. > SORRY for this error. The idiom "while (*s++ = *t++);" generates the fastest possible code if s and t are declared to be registers, which they are in the system version of strcpy. But note that if s and t are not in registers then the sequence: while (*s = *t) {s++; t++;} is more efficient.
henry@utzoo.UUCP (Henry Spencer) (03/08/85)
> In the mean time, if your application does a great deal of > string copying, and you are want to minimize execution time, > you should write your own string copy routine. If you do so, please put a comment on your private routine to explain why it's there, so the people maintaining it fifteen years from now don't tear their hair out trying to discover why you re- implemented a standard library routine. Also, please put a note about it in the documentation for the application. Your routine may well be slower on the next machine it gets ported to. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
ark@alice.UUCP (Andrew Koenig) (03/09/85)
If s and t are char pointers in registers, while (*s++ = *t++) ; generates the best code I could possibly imagine. while ((*s = *t) != '\0') {s++; t++;} is considerably worse. Try it with register variables on your compiler.
ggs@ulysses.UUCP (Griff Smith) (03/09/85)
1) I don't think the second idiom is any clearer. 2) On my VAX compiler, the code for the "easier to read" idiom is worse than that for the compact one. The only advantage of the long-winded idiom is that it doesn't change much if you neglect to declare register variables; it's a bit slow either way. The following test cases show the difference: ----- test(osp, isp) char *osp, *isp; { register char *ra, *rb; char *ma, *mb; ra = osp; rb = isp; /* case 1, register pointers */ while (*ra++ = *rb++); /* assembly code L16: movb (r10)+,(r11)+ jeql L17 jbr L16 L17: */ ma = osp; mb = isp; /* case 2, memory pointers */ while (*ma++ = *mb++); /* assembly code L18: movl -8(fp),r0 incl -8(fp) movl -4(fp),r1 incl -4(fp) movb (r0),(r1) jeql L19 jbr L18 L19: */ ra = osp; rb = isp; /* case 3, register pointers, "easy to read" loop */ while ((*ra = *rb) != '\0') { ra++; rb++; } /* assembly code L20: movb (r10),(r11) jeql L21 incl r11 incl r10 jbr L20 L21: */ ma = osp; mb = isp; /* case 4, memory pointers, "easy to read" loop */ while ((*ma = *mb) != '\0') { ma++; mb++; } /* assembly code L22: movb *-8(fp),*-4(fp) jeql L23 incl -4(fp) incl -8(fp) jbr L22 L23: */ } -- Griff Smith AT&T Bell Laboratories, Murray Hill Phone: (201) 582-7736 Internet: ggs@ulysses.uucp UUCP: ulysses!ggs ( {allegra|ihnp4}!ulysses!ggs )
chris@umcp-cs.UUCP (Chris Torek) (03/09/85)
> while (*s++ = *t++); > ... this code produces a loop with 50% more assembler instructions > than the slightly clearer sequence: > while ((*s = *t) != '\0') > { s++; > t++; > } Not necessarily. It depends on whether s and t are "register" variables. (The casual reader should type 'n' at this point . . . .) Proof: f(s, t) register char *s, *t; { while (*s++ = *t++); } generates (optimized): .globl _f _f: .word 0xc00 movl 4(ap),r11 movl 8(ap),r10 L16: movb (r10)+,(r11)+ jneq L16 ret while g(s,t) char *s, *t; { while (*s = *t) s++, t++; } generates (also optimized): .globl _g _g: .word 0 jbr L16 L2000001: incl 4(ap) incl 8(ap) L16: movb *8(ap),*4(ap) jneq L2000001 ret Changing s and t above to "register char *" gives .globl _g _g: .word 0xc00 movl 4(ap),r11 movl 8(ap),r10 jbr L16 L2000001: incl r11 incl r10 L16: movb (r10),(r11) jneq L2000001 ret which is faster most of the time (for strings of length 0 and 1 it's probably slower). It is true, however, that using postincrement on non-register pointer variables is generally less efficient than doing the same thing "by hand", since the compiler has to put the original value in a scratch register so that the increment doesn't clobber the condition codes. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251) UUCP: {seismo,allegra,brl-bmd}!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@maryland
joe@petsd.UUCP (Joe Orost) (03/09/85)
In article <3448@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes: >If s and t are char pointers in registers, > > while (*s++ = *t++) ; > >generates the best code I could possibly imagine. > > while ((*s = *t) != '\0') {s++; t++;} > >is considerably worse. Try it with register variables on your compiler. Ok, I did. The second sequence generates less code than the first sequence on our machine (Perkin-Elmer). This is due to the fact that our machine doesn't support auto-increment in hardware. The C compiler has to "fake" it. regards, joe -- Full-Name: Joseph M. Orost UUCP: ..!{decvax,ucbvax,ihnp4}!vax135!petsd!joe ARPA: vax135!petsd!joe@BERKELEY US Mail: MS 313; Perkin-Elmer; 106 Apple St; Tinton Falls, NJ 07724 Phone: (201) 870-5844 Location: 40 19'49" N / 74 04'37" W
gwyn@Brl-Vld.ARPA (VLD/VMB) (03/09/85)
Whoopee do. For long strings, (void)strcpy( s, t ); is often a big win and is even clearer.
gam@amdahl.UUCP (G A Moffett) (03/10/85)
As long as we're comparing compilers, the UTS C compiler produces basically the same code with either construction (with slightly different register usage), except that the instruction ordering is different. The code is the same size in either case. This seems like a reasonable thing to expect, since the code IS doing the same thing, only in slightly different ways. (370 architectures do not have auto-increment like DEC and other machines). -- Gordon A. Moffett ...!{ihnp4,hplabs,sun}!amdahl!gam
robert@gitpyr.UUCP (Robert Viduya) (03/11/85)
>< Posted from ark@alice.UUCP (Andrew Koenig) > If s and t are char pointers in registers, > > while (*s++ = *t++) ; > > generates the best code I could possibly imagine. > > while ((*s = *t) != '\0') {s++; t++;} > > is considerably worse. Try it with register variables on your compiler. Ok, here's the results. This was done on a Pyramid 90x ("-O" on the cc command; disassembled the results). ---First Method--- Code: while (*s++ = *t++) ; Assembly: 00000058: 01000870 movw lr1,tr0 0000005c: 14000061 addw $1,lr1 00000060: 01000831 movw lr0,tr1 00000064: 14000060 addw $1,lr0 00000068: 81100c31 movb (tr0),(tr1) 0000006c: 32200c70 cvtbw (tr1),tr0 00000070: f024fffa bfc z,0x58 ---Second Method--- Code: while ((*s = *t) != '\0') { s++; t++; } Assembly: 00000054: f0200003 br 0x60 00000058: 14000060 addw $1,lr0 0000005c: 14000061 addw $1,lr1 00000060: 81100860 movb (lr1),(lr0) 00000064: 32200830 cvtbw (lr0),tr0 00000068: f024fffc bfc z,0x58 ------ Seems that the second method (the longer C version) actually takes one less instruction and also uses one less register (the first one used tr0 & tr1; the second only needed tr0). [For the unfamilier, at any given time, the Pyramid has available 16 global registers (gr0-gr15), 16 parameter registers (pr0-pr15), 16 local registers (lr0-lr15) and 16 temporary registers (tr0-tr15).] I think what matters here is whether your machine has an auto-increment/ decrement type of instruction. I'm not sure if the Pyramid does or not, but obviously, the C compiler doesn't use it, so I think I can safely assume that it does not. Anyone want to try this on a 68000? robert -- Robert Viduya Georgia Institute of Technology ...!{akgua,allegra,amd,hplabs,ihnp4,masscomp,ut-ngp}!gatech!gitpyr!robert ...!{rlgvax,sb1,uf-cgrl,unmvax,ut-sally}!gatech!gitpyr!robert
shannon@sun.uucp (Bill Shannon) (03/11/85)
> In article <3448@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes: > >If s and t are char pointers in registers, > > > > while (*s++ = *t++) ; > > > >generates the best code I could possibly imagine. > > > > while ((*s = *t) != '\0') {s++; t++;} > > > >is considerably worse. Try it with register variables on your compiler. > > Ok, I did. The second sequence generates less code than the first sequence > on our machine (Perkin-Elmer). This is due to the fact that our machine > doesn't support auto-increment in hardware. The C compiler has to "fake" it. > > regards, > joe On the Sun the first generates the obvious two instruction loop while the second generates a five instruction loop: register char *s, *t; while (*s++ = *t++) ; L14: movb a4@+,a5@+ jne L14 while ((*s = *t) != '\0') {s++; t++;} L17: movb a4@,a5@ jeq LE12 addql #1,a5 addql #1,a4 jra L17 LE12: Note that the two loops differ in the values of s and t at loop termination. I consider the first loop to be more obvious, it is after all the standard C idiom for copying a string. If you "optimize" your code for one machine by writing the second loop, you may be pessimizing it for other machines. Such is the nature of C and C programmers. Bill Shannon Sun Microsystems, Inc.
moroney@jon.DEC (Mike Moroney) (03/12/85)
I think this is cute, how VAX/VMS beats Unix at its own game. The VMS C
compiler generates code as good as or better than anything I have seen posted
so far!
char *s,*t;
while (*s++ = *t++);
generates (on a VAX 780):
movb (r2)+,(r1)+
beql sym.2
sym.1:
movb (r2)+,(r1)+
bneq sym.1
sym.2:
This is as fast as you can get.
char *s,*t;
while ((*s = *t) != '\0')
{
s++;
t++;
}
generates:
movb (r2),(r1)
beql sym.4
sym.3:
incl r1
incl r2
movb (r2),(r1)
bneq sym.3
sym.4:
The default settings of the C compiler were used (that is I didn't select any
"generate warp speed code" flags). Notice I did NOT use "register char *"
since VAX C is at least as intelligent as you are when it decides what should
or should not go into registers. In fact version 1 of the C compiler ignored
"register" definitions for that reason. They put it back in V2.0 to appease
those who think they are smarter than the compiler (which treats it as a
"hint"). I have also seen a benchmark program where the identical C program
was compiled and run on identical VAX hardware, one running Unix, and one
running VMS. The Unix program took 3 times as long to run as the VMS. This
program (which did all integer arithmetic) used static variables, so it didn't
even have the benefit of automatically placing auto's in registers when
possible. I would think Unix, being 95% written in C, would at least have a
d@mn good C compiler. Want to improve the throughput of your Unix system?
Recompile it in VAX/VMS C!
These are not the views of Digital, although I am sure Digital agrees with me.
Mike Moroney
..!decwrl!rhea!jon!moroney
atbowler@watmath.UUCP (Alan T. Bowler [SDG]) (03/12/85)
Actually my favourite string copy idiom is strcpy(s, t); On most machines this generates a smaller sequence, and I can frequently count on the C implementor having supplied a code sequence, (possibly written in assembler) that has been optimized for the particular machine. At worst if performance is a real problem, and the implementor didn't, I can write my own assembler routine once for this machine, rather than having to find all the places I coded my own string copy. On most large machines I know there are character vector instructions that will allow implementation of a string copy that is faster than any sequence I can code in C.
anton@ucbvax.ARPA (Jeff Anton) (03/12/85)
Daniel J. Salomon writes: >The VAX 4.2 BSD UNIX library routine 'strcpy' uses code equivalent >to the less efficient sequence: while (*s++ = *t++); >Perhaps it should be changed. >In the mean time, if your application does a great deal of >string copying, and you are want to minimize execution time, >you should write your own string copy routine. I believe you are mistaken. The 4.2BSD VAX strcpy is an assembly routine that uses the VAX instruction movc3. Perhaps, your library was built improperly or you defined a new strcpy routine. If you looked at source code you must be carefull since routines in assembly often have C counter parts to be used if the assembly is suspect. -- C knows no bounds. Jeff Anton U.C.Berkeley ucbvax!anton anton@berkeley.ARPA
rpw3@redwood.UUCP (Rob Warnock) (03/13/85)
Please note that the semantics of while (*s++ = *t++) ; and while ((*s = *t) != '\0') {s++; t++;} are NOT the same; therefore, the generated code CANNOT be the same! (I noticed this while comparing the code generated on the 68000 compiler I use.) The first statement leaves "t" pointing at the byte AFTER the null, while the second leaves "t" pointing to the null. Auto-incrementing cannot be used in the second case, unless your compiler generates code to "back out" the final incrementation (an optimization I have on occasion applied by hand to tight assembly code, but have never seen a compiler use). The following two ARE equivalent (by the definition of "true" in boolean tests and due to the "usual conversions" applied to '\0' before the comparison), and the compiler I use indeed generates the same code for both cases: while (*s++ = *t++) ; and while ((*s++ = *t++) != '\0') ; Rob Warnock Systems Architecture Consultant UUCP: {ihnp4,ucbvax!dual}!fortune!redwood!rpw3 DDD: (415)572-2607 USPS: 510 Trinidad Lane, Foster City, CA 94404
minow@decvax.UUCP (Martin Minow) (03/13/85)
On a vax, the fastest string copy is probably the non-obvious length = strlen(input); strncpy(output, input, length); (Assuming both routines are expanded in-line so you're not hit by the subroutine call overhead.) Note the following VMS Macro code from a routine in the Decus C library. .entry strcpy,^M<r2,r3,r4,r5> movl 8(ap),r2 ; source string locc #0,#-1,(r2) ; find null at end of source subl2 r2,r1 ; length of source incl r1 ; plus 1 for the null byte movc3 r1,@8(ap),@4(ap); copy the string movl 4(ap),r0 ; r0 -> destination .end
hammond@petrus.UUCP (03/13/85)
> I think this is cute, how VAX/VMS beats Unix at its own game. The VMS C > compiler generates code as good as or better than anything I have seen posted > so far! > > ... I have also seen a benchmark program where the identical C program > was compiled and run on identical VAX hardware, one running Unix, and one > running VMS. The Unix program took 3 times as long to run as the VMS. This > program (which did all integer arithmetic) used static variables, so it didn't > even have the benefit of automatically placing auto's in registers when > possible. I would think Unix, being 95% written in C, would at least have a > d@mn good C compiler. Want to improve the throughput of your Unix system? > Recompile it in VAX/VMS C! > > These are not the views of Digital, although I am sure Digital agrees with me. > > Mike Moroney > ..!decwrl!rhea!jon!moroney Careful, things are not as simple as they seem. It turns out that while the early DEC VMS C compiler was better at compiling expressions and statements, it lost out to the UNIX C compiler in proceedure calls, since it followed the VMS standard (which either saves more registers or uses slower instructions, I'm not sure which). I expect that the newer VMS compilers would have the same problem, even if they were even better at optimizing code generation for expressions/statements. So, while your example program ran faster, troff/nroff runs slower if compiled with the early VMS C compiler (this info from Steve Johnson's course on PCC2 about 2 years ago). I am not sure what the net result would be if you recompiled all of UNIX with the VMS C compiler, but I wouldn't bet either way. Besides, if you were looking at the BSD C compiler, I am fairly certain that the newer compilers within BTL show improved performance. Rich Hammond {ihnp4 | decvax | ucbvax } !bellcore!hammond
regisc@tekgvs.UUCP (Regis Crinon) (03/13/85)
> >< > Posted from ark@alice.UUCP (Andrew Koenig) > > If s and t are char pointers in registers, > > > > while (*s++ = *t++) ; > > > > generates the best code I could possibly imagine. > > > > while ((*s = *t) != '\0') {s++; t++;} > > > > is considerably worse. Try it with register variables on your compiler. > > Ok, here's the results. This was done on a Pyramid 90x ("-O" on the cc command; > disassembled the results). > > ---First Method--- > Code: > while (*s++ = *t++) > ; > > Assembly: > 00000058: 01000870 movw lr1,tr0 > 0000005c: 14000061 addw $1,lr1 > 00000060: 01000831 movw lr0,tr1 > 00000064: 14000060 addw $1,lr0 > 00000068: 81100c31 movb (tr0),(tr1) > 0000006c: 32200c70 cvtbw (tr1),tr0 > 00000070: f024fffa bfc z,0x58 > > ---Second Method--- > Code: > while ((*s = *t) != '\0') { > s++; > t++; > } > > Assembly: > 00000054: f0200003 br 0x60 > 00000058: 14000060 addw $1,lr0 > 0000005c: 14000061 addw $1,lr1 > 00000060: 81100860 movb (lr1),(lr0) > 00000064: 32200830 cvtbw (lr0),tr0 > 00000068: f024fffc bfc z,0x58 > > ------ > Seems that the second method (the longer C version) actually takes one less > instruction and also uses one less register (the first one used tr0 & tr1; the > second only needed tr0). [For the unfamilier, at any given time, the Pyramid > has available 16 global registers (gr0-gr15), 16 parameter registers > (pr0-pr15), 16 local registers (lr0-lr15) and 16 temporary registers > (tr0-tr15).] > > I think what matters here is whether your machine has an auto-increment/ > decrement type of instruction. I'm not sure if the Pyramid does or not, > but obviously, the C compiler doesn't use it, so I think I can safely assume > that it does not. Anyone want to try this on a 68000? > > robert > -- > Robert Viduya > Georgia Institute of Technology > > ...!{akgua,allegra,amd,hplabs,ihnp4,masscomp,ut-ngp}!gatech!gitpyr!robert > ...!{rlgvax,sb1,uf-cgrl,unmvax,ut-sally}!gatech!gitpyr!robert *** REPLACE THIS LINE WITH YOUR MESSAGE *** Ok. Following are the 68000 compiler results: i) Using while(*s++ = *t++); 07530 12D8 MOVE.B (A0)+,(A1)+ 07532 6600FFFC BNE.L $007530 07536 ii) Using while((*s = *t) != '\0'){s++;t++} 07530 1290 MOVE.B (A0),(A1) 07532 6706 BEQ.S $00753A 07534 5286 ADDQ.L #1,A1 07536 5288 ADDQ.L #1,A0 07538 60F6 BRA.S $007530 0753A It seems to me that version i) is faster and requires less memory. -- crinon
oz@yetti.UUCP (Ozan Yigit) (03/14/85)
> > Careful, things are not as simple as they seem. It turns out that while > the early DEC VMS C compiler was better at compiling expressions and > statements, it lost out to the UNIX C compiler in proceedure calls, > since it followed the VMS standard (which either saves more registers > or uses slower instructions, I'm not sure which)... The procedure calls are perhaps slower in VMS C compiler, due to saving all the registers that VMS C compiler so greedily uses. One should remember, however, that if the programmer were to hand-specify as many registers as he possibly can, to improve his/her program, this would result in an identical procedure call on, say C compiler of 4.2., which would be just as slow or fast, however you look at it. > > So, while your example program ran faster, troff/nroff runs slower if > compiled with the early VMS C compiler (this info from Steve Johnson's > course on PCC2 about 2 years ago). I am not sure what the net result > would be if you recompiled all of UNIX with the VMS C compiler, but > I wouldn't bet either way. True, I would not bet on it either. But, aside from greedy register algorithm in VMS C, If the BSD C compiler was as smart as VMS C in optimization and architecture utilization, you would have a *MUCH* faster 4.2 !! Oz (wizard of something or another, no doubt) Dept. of Computer Science York University {utzoo | utcs}!yetti!oz
chris@umcp-cs.UUCP (Chris Torek) (03/16/85)
I didn't really want to make a fuss over it, but robert@gitpyr is correct; the reason "*s++ = *t++" is "harder" on Pyramids and other such machines is that they have no auto-increment addressing modes. If you write while (*s = *t) s++, t++; then both s and t point *to* the final '\0'; if you write while (*s++ = *t++) /* null */; then both s and t point *past* the final '\0'. The obvious way to generate the sequence "*s++" on a machine without autoincrement, when the value of the assignment might also be required, is copy s to temp incr s indirect using temp This is, of course, not always the best way. The increment can often be deferred, saving the temporary register (or top of stack) and the copy. In the particular code samples for strcpy() used earlier, a truly optimizing compiler would "realize" that both codings have the same semantics, and generate the most optimal instruction sequence for both. (They are the same since the values of s and t are not used except for indirections---at which time they have the same values---and, in theory, are inaccessible outside the function itself. In actuality they are usually in some registers or stack frame and thus accessible, but such possibilities are normally ignored during optimization [which of course is the reason for introducing the "volatile" keyword].) I am surprised that the VMS C compiler doesn't realize that all it needs to do is generate a loop of inline "locc" and "movc3" instructions :-). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251) UUCP: {seismo,allegra,brl-bmd}!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@maryland
chris@umcp-cs.UUCP (Chris Torek) (03/16/85)
> Note the following VMS Macro code from a routine in the Decus C library. > .entry strcpy,^M<r2,r3,r4,r5> > movl 8(ap),r2 ; source string > locc #0,#-1,(r2) ; find null at end of source > subl2 r2,r1 ; length of source > incl r1 ; plus 1 for the null byte > movc3 r1,@8(ap),@4(ap); copy the string > movl 4(ap),r0 ; r0 -> destination > .end Unfortunately, this fails for strings longer than 65535 characters. This can be corrected by looping over the locc and movc3 instructions until the final 64K segment is reached (I believe locc sets the condition codes depending on whether it found the character or not). Here's a stab at it (without referring to the "black book" so I don't know if I've got things right): _strcpy:.globl _strcpy .word 0x0 movl 8(ap),r3 # source => r3 movl 4(ap),r5 # dest => r5 (movc3 requires this arrangement) brb 1f # go count up length 0: movc3 $65534,(r3),(r5)# copy current 64K segment 1: locc $0,$65534,(r3) # find null bne 0b # if not found, copy next 64K seg subl2 r3,r1 # r1 = len incl r1 # +1 for null, without overflow (65534+1 max) movc3 r1,(r3),(r5) # move final chunk movl 4(ap),r0 # dest => r0 as return val ret (I'm not sure if movc3'ing 65532 bytes would be better, so as to stay longword aligned when the arguments were originally....) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251) UUCP: {seismo,allegra,brl-bmd}!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@maryland
cottrell@NBS-VMS (03/20/85)
/*
> Unfortunately, this fails for strings longer than 65535 characters.
Tsk tsk. Your strings are too long! Make them shorter please.
Also, doesn't the VAX have some kind of `translate & copy until a
given character is seen' instruxion?
jim cottrell@nbs
*/
dwd@ccice6.UUCP (David W. Donald) (03/21/85)
> > The idiom "while (*s++ = *t++);" generates the fastest possible code > if s and t are declared to be registers ... > Beware of saying things like "the fastest possible". I found a better way for for any string longer than two characters. The following is from 4.2BSD on a VAX. main() { register char *s, *d; while( *d++ = *s++ ); /* ok */ if (*d) do {} while ( *d++ = *s++ ); /* <-- FAST -- */ } _main: L16: /* ok: a 3 instruction loop */ movb (r11)+,(r10)+ jeql L17 jbr L16 L17: /* FAST: a two instruction loop */ tstb (r10) jeql L18 L21: L20: movb (r11)+,(r10)+ jneq L21 L19: L18: ret
guy@rlgvax.UUCP (Guy Harris) (03/21/85)
> Beware of saying things like "the fastest possible". > I found a better way for for any string longer than two characters. > The following is from 4.2BSD on a VAX. Geepers, didn't they teach you to use the "-O" option in school? Running your example through "cc" with the "-O" flag, I get the same code for both examples except that the second "FAST" version has an extra test at the beginning. "-O" does a fair bit of code motion, especially when strange sequences of jumps are involved; basically, it's covering PCC's tracks.
ron@brl-tgr.ARPA (Ron Natalie <ron>) (03/22/85)
> Beware of saying things like "the fastest possible". > I found a better way for for any string longer than two characters. > The following is from 4.2BSD on a VAX. > > main() > { > register char *s, *d; > while( *d++ = *s++ ); /* ok */ > if (*d) do {} while ( *d++ = *s++ ); /* <-- FAST -- */ > } > Beware of posting solutions that don't work. If *d == 0 you don't copy anything at all. I'm not quite sure why you put the test in, all it does is make the loop dependent on whats in the destination string before you copy into it. -Ron