[net.lang.c] String copy idiom.

djsalomon@watdaisy.UUCP (Daniel J. Salomon) (03/08/85)

A recent article on language idioms gave the following C code
for a string copy:

     while (*s++ = *t++);

Serious C hackers should know that on VAX 4.2 BSD UNIX this
code produces a loop with 50% more assembler instructions
than the slightly clearer sequence:

     while ((*s = *t) != '\0')
     {   s++;
	 t++;
     }

This is true whether or not the object-code improver
is invoked, and may be true on other machines as well.

djsalomon@watdaisy.UUCP (Daniel J. Salomon) (03/08/85)

The VAX 4.2 BSD UNIX library routine 'strcpy' uses code equivalent
to the less efficient sequence:  while (*s++ = *t++);
Perhaps it should be changed.
In the mean time, if your application does a great deal of
string copying, and you are want to minimize execution time,
you should write your own string copy routine.

djsalomon@watdaisy.UUCP (Daniel J. Salomon) (03/08/85)

> The VAX 4.2 BSD UNIX library routine 'strcpy' uses code equivalent
> to the less efficient sequence:  while (*s++ = *t++);
> Perhaps it should be changed.
> 

SORRY for this error.
The idiom "while (*s++ = *t++);" generates the fastest possible code
if s and t are declared to be registers, which they are in the
system version of strcpy.  But note that if s and t are not in
registers then the sequence:
    while (*s = *t) {s++; t++;}
is more efficient.

henry@utzoo.UUCP (Henry Spencer) (03/08/85)

> In the mean time, if your application does a great deal of
> string copying, and you are want to minimize execution time,
> you should write your own string copy routine.

If you do so, please put a comment on your private routine to
explain why it's there, so the people maintaining it fifteen years
from now don't tear their hair out trying to discover why you re-
implemented a standard library routine.  Also, please put a note about
it in the documentation for the application.  Your routine may well
be slower on the next machine it gets ported to.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

ark@alice.UUCP (Andrew Koenig) (03/09/85)

If s and t are char pointers in registers,

	while (*s++ = *t++) ;

generates the best code I could possibly imagine.

	while ((*s = *t) != '\0') {s++; t++;}

is considerably worse.  Try it with register variables on your compiler.

ggs@ulysses.UUCP (Griff Smith) (03/09/85)

1) I don't think the second idiom is any clearer.

2) On my VAX compiler, the code for the "easier to read" idiom is worse
   than that for the compact one.  The only advantage of the long-winded
   idiom is that it doesn't change much if you neglect to declare register
   variables; it's a bit slow either way.  The following test cases show
   the difference:

-----

test(osp, isp)
char *osp, *isp;
{
register char *ra, *rb;
char *ma, *mb;

ra = osp;
rb = isp;

/* case 1, register pointers */

while (*ra++ = *rb++);

/* assembly code

L16:
	movb	(r10)+,(r11)+
	jeql	L17
	jbr 	L16
L17:

*/

ma = osp;
mb = isp;

/* case 2, memory pointers */

while (*ma++ = *mb++);

/* assembly code

L18:
	movl	-8(fp),r0
	incl	-8(fp)
	movl	-4(fp),r1
	incl	-4(fp)
	movb	(r0),(r1)
	jeql	L19
	jbr 	L18
L19:

*/

ra = osp;
rb = isp;

/* case 3, register pointers, "easy to read" loop */

while ((*ra = *rb) != '\0')
{
    ra++;
    rb++;
}

/* assembly code

L20:
	movb	(r10),(r11)
	jeql	L21
	incl	r11
	incl	r10
	jbr 	L20
L21:

*/

ma = osp;
mb = isp;

/* case 4, memory pointers, "easy to read" loop */

while ((*ma = *mb) != '\0')
{
    ma++;
    mb++;
}

/* assembly code

L22:
	movb	*-8(fp),*-4(fp)
	jeql	L23
	incl	-4(fp)
	incl	-8(fp)
	jbr 	L22
L23:

*/

}
-- 

Griff Smith	AT&T Bell Laboratories, Murray Hill
Phone:		(201) 582-7736
Internet:	ggs@ulysses.uucp
UUCP:		ulysses!ggs  ( {allegra|ihnp4}!ulysses!ggs )

chris@umcp-cs.UUCP (Chris Torek) (03/09/85)

>    while (*s++ = *t++);

> ... this code produces a loop with 50% more assembler instructions
> than the slightly clearer sequence:

>    while ((*s = *t) != '\0')
>    {   s++;
>	 t++;
>    }

Not necessarily.  It depends on whether s and t are "register" variables.

(The casual reader should type 'n' at this point . . . .)

Proof:

f(s, t)
register char *s, *t;
{
	while (*s++ = *t++);
}

generates (optimized):

	.globl	_f
_f:	.word	0xc00
	movl	4(ap),r11
	movl	8(ap),r10
L16:	movb	(r10)+,(r11)+
	jneq	L16
	ret

while

g(s,t)
char *s, *t;
{
	while (*s = *t) s++, t++;
}

generates (also optimized):

	.globl	_g
_g:	.word	0
	jbr	L16
L2000001:
	incl	4(ap)
	incl	8(ap)
L16:	movb	*8(ap),*4(ap)
	jneq	L2000001
	ret

Changing s and t above to "register char *" gives

	.globl	_g
_g:	.word	0xc00
	movl	4(ap),r11
	movl	8(ap),r10
	jbr	L16
L2000001:
	incl	r11
	incl	r10
L16:	movb	(r10),(r11)
	jneq	L2000001
	ret

which is faster most of the time (for strings of length 0 and 1 it's
probably slower).

It is true, however, that using postincrement on non-register pointer
variables is generally less efficient than doing the same thing "by hand",
since the compiler has to put the original value in a scratch register
so that the increment doesn't clobber the condition codes.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251)
UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland

joe@petsd.UUCP (Joe Orost) (03/09/85)

In article <3448@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes:
>If s and t are char pointers in registers,
>
>	while (*s++ = *t++) ;
>
>generates the best code I could possibly imagine.
>
>	while ((*s = *t) != '\0') {s++; t++;}
>
>is considerably worse.  Try it with register variables on your compiler.

Ok, I did.  The second sequence generates less code than the first sequence
on our machine (Perkin-Elmer).  This is due to the fact that our machine
doesn't support auto-increment in hardware.  The C compiler has to "fake" it.

					regards,
					joe

--
Full-Name:  Joseph M. Orost
UUCP:       ..!{decvax,ucbvax,ihnp4}!vax135!petsd!joe
ARPA:	    vax135!petsd!joe@BERKELEY
US Mail:    MS 313; Perkin-Elmer; 106 Apple St; Tinton Falls, NJ 07724
Phone:      (201) 870-5844
Location:   40 19'49" N / 74 04'37" W

gwyn@Brl-Vld.ARPA (VLD/VMB) (03/09/85)

Whoopee do.  For long strings,
	(void)strcpy( s, t );
is often a big win and is even clearer.

gam@amdahl.UUCP (G A Moffett) (03/10/85)

As long as we're comparing compilers, the UTS C compiler produces
basically the same code with either construction (with slightly
different register usage), except that the instruction ordering
is different.  The code is the same size in either case.

This seems like a reasonable thing to expect, since the code IS
doing the same thing, only in slightly different ways.  (370
architectures do not have auto-increment like DEC and other
machines).
-- 
Gordon A. Moffett		...!{ihnp4,hplabs,sun}!amdahl!gam

robert@gitpyr.UUCP (Robert Viduya) (03/11/85)

><
Posted from  ark@alice.UUCP (Andrew Koenig)
> If s and t are char pointers in registers,
> 
> 	while (*s++ = *t++) ;
> 
> generates the best code I could possibly imagine.
> 
> 	while ((*s = *t) != '\0') {s++; t++;}
> 
> is considerably worse.  Try it with register variables on your compiler.

Ok, here's the results.  This was done on a Pyramid 90x ("-O" on the cc command;
disassembled the results).

---First Method---
Code:
    while (*s++ = *t++)
	;

Assembly:
 00000058: 01000870                movw     lr1,tr0
 0000005c: 14000061                addw     $1,lr1
 00000060: 01000831                movw     lr0,tr1
 00000064: 14000060                addw     $1,lr0
 00000068: 81100c31                movb     (tr0),(tr1)
 0000006c: 32200c70                cvtbw    (tr1),tr0
 00000070: f024fffa                bfc      z,0x58

---Second Method---
Code:
    while ((*s = *t) != '\0') {
	s++;
	t++;
    }

Assembly:
 00000054: f0200003                br       0x60
 00000058: 14000060                addw     $1,lr0
 0000005c: 14000061                addw     $1,lr1
 00000060: 81100860                movb     (lr1),(lr0)
 00000064: 32200830                cvtbw    (lr0),tr0
 00000068: f024fffc                bfc      z,0x58

------
 Seems that the second method (the longer C version) actually takes one less
 instruction and also uses one less register (the first one used tr0 & tr1; the
 second only needed tr0). [For the unfamilier, at any given time, the Pyramid
 has available 16 global registers (gr0-gr15), 16 parameter registers
 (pr0-pr15), 16 local registers (lr0-lr15) and 16 temporary registers
 (tr0-tr15).]

 I think what matters here is whether your machine has an auto-increment/
 decrement type of instruction.  I'm not sure if the Pyramid does or not,
 but obviously, the C compiler doesn't use it, so I think I can safely assume
 that it does not.  Anyone want to try this on a 68000?

				robert
-- 
Robert Viduya
Georgia Institute of Technology

...!{akgua,allegra,amd,hplabs,ihnp4,masscomp,ut-ngp}!gatech!gitpyr!robert
...!{rlgvax,sb1,uf-cgrl,unmvax,ut-sally}!gatech!gitpyr!robert

shannon@sun.uucp (Bill Shannon) (03/11/85)

> In article <3448@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes:
> >If s and t are char pointers in registers,
> >
> >	while (*s++ = *t++) ;
> >
> >generates the best code I could possibly imagine.
> >
> >	while ((*s = *t) != '\0') {s++; t++;}
> >
> >is considerably worse.  Try it with register variables on your compiler.
> 
> Ok, I did.  The second sequence generates less code than the first sequence
> on our machine (Perkin-Elmer).  This is due to the fact that our machine
> doesn't support auto-increment in hardware.  The C compiler has to "fake" it.
> 
> 					regards,
> 					joe

On the Sun the first generates the obvious two instruction loop
while the second generates a five instruction loop:

	register char *s, *t;
	while (*s++ = *t++) ;

L14:
	movb	a4@+,a5@+
	jne	L14

	while ((*s = *t) != '\0') {s++; t++;}
L17:
	movb	a4@,a5@
	jeq	LE12
	addql	#1,a5
	addql	#1,a4
	jra	L17
LE12:

Note that the two loops differ in the values of s and t at loop
termination.  I consider the first loop to be more obvious, it
is after all the standard C idiom for copying a string.  If you
"optimize" your code for one machine by writing the second loop,
you may be pessimizing it for other machines.  Such is the nature
of C and C programmers.

					Bill Shannon
					Sun Microsystems, Inc.

moroney@jon.DEC (Mike Moroney) (03/12/85)

I think this is cute, how VAX/VMS beats Unix at its own game.  The VMS C
compiler generates code as good as or better than anything I have seen posted
so far!

	char *s,*t;
	while (*s++ = *t++);

generates (on a VAX 780):

		movb	(r2)+,(r1)+
		beql	sym.2
	sym.1:
		movb	(r2)+,(r1)+
		bneq	sym.1
	sym.2:

This is as fast as you can get.

	char *s,*t;
	while ((*s = *t) != '\0')
	{
	s++;
	t++;
	}

generates:

		movb	(r2),(r1)
		beql	sym.4
	sym.3:
		incl	r1
		incl	r2
		movb	(r2),(r1)
		bneq	sym.3
	sym.4:

The default settings of the C compiler were used (that is I didn't select any
"generate warp speed code" flags).  Notice I did NOT use "register char *"
since VAX C is at least as intelligent as you are when it decides what should
or should not go into registers.  In fact version 1 of the C compiler ignored
"register" definitions for that reason.  They put it back in V2.0 to appease
those who think they are smarter than the compiler (which treats it as a
"hint").  I have also seen a benchmark program where the identical C program
was compiled and run on identical VAX hardware, one running Unix, and one
running VMS.  The Unix program took 3 times as long to run as the VMS.  This
program (which did all integer arithmetic) used static variables, so it didn't
even have the benefit of automatically placing auto's in registers when
possible. I would think Unix, being 95% written in C, would at least have a
d@mn good C compiler.  Want to improve the throughput of your Unix system?
Recompile it in VAX/VMS C!

These are not the views of Digital, although I am sure Digital agrees with me.

						Mike Moroney
					..!decwrl!rhea!jon!moroney

atbowler@watmath.UUCP (Alan T. Bowler [SDG]) (03/12/85)

Actually my favourite string copy idiom is
     strcpy(s, t);
On most machines this generates a smaller sequence, and I can frequently
count on the C implementor having supplied a code sequence,
(possibly written in assembler) that has been optimized for the particular
machine.  At worst if performance is a real problem, and the implementor
didn't, I can write my own assembler routine once for this machine,
rather than having to find all the places I coded my own string copy.
   On most large machines I know there are character vector instructions
that will allow implementation of a string copy that is faster than
any sequence I can code in C.

anton@ucbvax.ARPA (Jeff Anton) (03/12/85)

Daniel J. Salomon writes:
>The VAX 4.2 BSD UNIX library routine 'strcpy' uses code equivalent
>to the less efficient sequence:  while (*s++ = *t++);
>Perhaps it should be changed.
>In the mean time, if your application does a great deal of
>string copying, and you are want to minimize execution time,
>you should write your own string copy routine.

I believe you are mistaken.  The 4.2BSD VAX strcpy is an assembly routine
that uses the VAX instruction movc3.  Perhaps, your library was built
improperly or you defined a new strcpy routine.  If you looked at source
code you must be carefull since routines in assembly often have C counter
parts to be used if the assembly is suspect.
-- 
C knows no bounds.
					Jeff Anton
					U.C.Berkeley
					ucbvax!anton
					anton@berkeley.ARPA

rpw3@redwood.UUCP (Rob Warnock) (03/13/85)

Please note that the semantics of

	while (*s++ = *t++) ;
and
	while ((*s = *t) != '\0') {s++; t++;}

are NOT the same; therefore, the generated code CANNOT be the same!
(I noticed this while comparing the code generated on the 68000 compiler
I use.) The first statement leaves "t" pointing at the byte AFTER the null,
while the second leaves "t" pointing to the null. Auto-incrementing cannot
be used in the second case, unless your compiler generates code to "back out"
the final incrementation (an optimization I have on occasion applied by hand
to tight assembly code, but have never seen a compiler use).

The following two ARE equivalent (by the definition of "true" in boolean tests
and due to the "usual conversions" applied to '\0' before the comparison), and
the compiler I use indeed generates the same code for both cases:

	while (*s++ = *t++) ;
and
	while ((*s++ = *t++) != '\0') ;


Rob Warnock
Systems Architecture Consultant

UUCP:	{ihnp4,ucbvax!dual}!fortune!redwood!rpw3
DDD:	(415)572-2607
USPS:	510 Trinidad Lane, Foster City, CA  94404

minow@decvax.UUCP (Martin Minow) (03/13/85)

On a vax, the fastest string copy is probably the non-obvious

	length = strlen(input);
	strncpy(output, input, length);

(Assuming both routines are expanded in-line so you're not
hit by the subroutine call overhead.)  Note the following
VMS Macro code from a routine in the Decus C library.

	.entry	strcpy,^M<r2,r3,r4,r5>
	movl	8(ap),r2	; source string
	locc	#0,#-1,(r2)	; find null at end of source
	subl2	r2,r1		; length of source
	incl	r1		; plus 1 for the null byte
	movc3	r1,@8(ap),@4(ap); copy the string
	movl	4(ap),r0	; r0 -> destination
	.end

hammond@petrus.UUCP (03/13/85)

> I think this is cute, how VAX/VMS beats Unix at its own game.  The VMS C
> compiler generates code as good as or better than anything I have seen posted
> so far!
> 
> ...       I have also seen a benchmark program where the identical C program
> was compiled and run on identical VAX hardware, one running Unix, and one
> running VMS.  The Unix program took 3 times as long to run as the VMS.  This
> program (which did all integer arithmetic) used static variables, so it didn't
> even have the benefit of automatically placing auto's in registers when
> possible. I would think Unix, being 95% written in C, would at least have a
> d@mn good C compiler.  Want to improve the throughput of your Unix system?
> Recompile it in VAX/VMS C!
> 
> These are not the views of Digital, although I am sure Digital agrees with me.
> 
> 						Mike Moroney
> 					..!decwrl!rhea!jon!moroney

Careful, things are not as simple as they seem.  It turns out that while
the early DEC VMS C compiler was better at compiling expressions and
statements, it lost out to the UNIX C compiler in proceedure calls,
since it followed the VMS standard (which either saves more registers
or uses slower instructions, I'm not sure which).  I expect that the
newer VMS compilers would have the same problem, even if they were
even better at optimizing code generation for expressions/statements.

So, while your example program ran faster, troff/nroff runs slower if
compiled with the early VMS C compiler (this info from Steve Johnson's
course on PCC2 about 2 years ago).  I am not sure what the net result
would be if you recompiled all of UNIX with the VMS C compiler, but
I wouldn't bet either way.

Besides, if you were looking at the BSD C compiler, I am fairly certain
that the newer compilers within BTL show improved performance.

Rich Hammond	{ihnp4 | decvax | ucbvax } !bellcore!hammond

regisc@tekgvs.UUCP (Regis Crinon) (03/13/85)

> ><
> Posted from  ark@alice.UUCP (Andrew Koenig)
> > If s and t are char pointers in registers,
> > 
> > 	while (*s++ = *t++) ;
> > 
> > generates the best code I could possibly imagine.
> > 
> > 	while ((*s = *t) != '\0') {s++; t++;}
> > 
> > is considerably worse.  Try it with register variables on your compiler.
> 
> Ok, here's the results.  This was done on a Pyramid 90x ("-O" on the cc command;
> disassembled the results).
> 
> ---First Method---
> Code:
>     while (*s++ = *t++)
> 	;
> 
> Assembly:
>  00000058: 01000870                movw     lr1,tr0
>  0000005c: 14000061                addw     $1,lr1
>  00000060: 01000831                movw     lr0,tr1
>  00000064: 14000060                addw     $1,lr0
>  00000068: 81100c31                movb     (tr0),(tr1)
>  0000006c: 32200c70                cvtbw    (tr1),tr0
>  00000070: f024fffa                bfc      z,0x58
> 
> ---Second Method---
> Code:
>     while ((*s = *t) != '\0') {
> 	s++;
> 	t++;
>     }
> 
> Assembly:
>  00000054: f0200003                br       0x60
>  00000058: 14000060                addw     $1,lr0
>  0000005c: 14000061                addw     $1,lr1
>  00000060: 81100860                movb     (lr1),(lr0)
>  00000064: 32200830                cvtbw    (lr0),tr0
>  00000068: f024fffc                bfc      z,0x58
> 
> ------
>  Seems that the second method (the longer C version) actually takes one less
>  instruction and also uses one less register (the first one used tr0 & tr1; the
>  second only needed tr0). [For the unfamilier, at any given time, the Pyramid
>  has available 16 global registers (gr0-gr15), 16 parameter registers
>  (pr0-pr15), 16 local registers (lr0-lr15) and 16 temporary registers
>  (tr0-tr15).]
> 
>  I think what matters here is whether your machine has an auto-increment/
>  decrement type of instruction.  I'm not sure if the Pyramid does or not,
>  but obviously, the C compiler doesn't use it, so I think I can safely assume
>  that it does not.  Anyone want to try this on a 68000?
> 
> 				robert
> -- 
> Robert Viduya
> Georgia Institute of Technology
> 
> ...!{akgua,allegra,amd,hplabs,ihnp4,masscomp,ut-ngp}!gatech!gitpyr!robert
> ...!{rlgvax,sb1,uf-cgrl,unmvax,ut-sally}!gatech!gitpyr!robert

*** REPLACE THIS LINE WITH YOUR MESSAGE ***

Ok. Following are the 68000 compiler results:

	i) Using while(*s++ = *t++);

07530	12D8		MOVE.B	(A0)+,(A1)+
07532	6600FFFC	BNE.L	$007530
07536

	ii) Using while((*s = *t) != '\0'){s++;t++}

07530	1290		MOVE.B	(A0),(A1)
07532	6706		BEQ.S	$00753A
07534	5286		ADDQ.L	#1,A1
07536	5288		ADDQ.L	#1,A0
07538	60F6		BRA.S 	$007530
0753A

	It seems to me that version i) is faster and requires less memory.

-- 
crinon

oz@yetti.UUCP (Ozan Yigit) (03/14/85)

> 
> Careful, things are not as simple as they seem.  It turns out that while
> the early DEC VMS C compiler was better at compiling expressions and
> statements, it lost out to the UNIX C compiler in proceedure calls,
> since it followed the VMS standard (which either saves more registers
> or uses slower instructions, I'm not sure which)...
	The procedure calls are perhaps slower in VMS C compiler,
	due to saving all the registers that VMS C compiler so
	greedily uses. One should remember, however,
	that if the programmer were to hand-specify as many registers
	as he possibly can, to improve his/her program, this would result
	in an identical procedure call on, say C compiler of 4.2., which
	would be just as slow or fast, however you look at it.
> 
> So, while your example program ran faster, troff/nroff runs slower if
> compiled with the early VMS C compiler (this info from Steve Johnson's
> course on PCC2 about 2 years ago).  I am not sure what the net result
> would be if you recompiled all of UNIX with the VMS C compiler, but
> I wouldn't bet either way.

	True, I would not bet on it either. But, aside from greedy
	register algorithm in VMS C, If the BSD C compiler was as smart
	as VMS C in optimization and architecture utilization, you would
	have a *MUCH* faster 4.2 !!


	Oz	(wizard of something or another, no doubt)
	Dept. of Computer Science
	York University
	{utzoo | utcs}!yetti!oz

chris@umcp-cs.UUCP (Chris Torek) (03/16/85)

I didn't really want to make a fuss over it, but robert@gitpyr is
correct; the reason "*s++ = *t++" is "harder" on Pyramids and other
such machines is that they have no auto-increment addressing modes.  If
you write

    while (*s = *t) s++, t++;

then both s and t point *to* the final '\0'; if you write

    while (*s++ = *t++) /* null */;

then both s and t point *past* the final '\0'.  The obvious way to
generate the sequence "*s++" on a machine without autoincrement, when
the value of the assignment might also be required, is

    copy s to temp
    incr s
    indirect using temp

This is, of course, not always the best way.  The increment can often
be deferred, saving the temporary register (or top of stack) and the copy.

In the particular code samples for strcpy() used earlier, a truly
optimizing compiler would "realize" that both codings have the same
semantics, and generate the most optimal instruction sequence for
both.  (They are the same since the values of s and t are not used
except for indirections---at which time they have the same
values---and, in theory, are inaccessible outside the function itself.
In actuality they are usually in some registers or stack frame and thus
accessible, but such possibilities are normally ignored during
optimization [which of course is the reason for introducing the
"volatile" keyword].)

I am surprised that the VMS C compiler doesn't realize that all it
needs to do is generate a loop of inline "locc" and "movc3"
instructions :-).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251)
UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland

chris@umcp-cs.UUCP (Chris Torek) (03/16/85)

> Note the following VMS Macro code from a routine in the Decus C library.

>	.entry	strcpy,^M<r2,r3,r4,r5>
>	movl	8(ap),r2	; source string
>	locc	#0,#-1,(r2)	; find null at end of source
>	subl2	r2,r1		; length of source
>	incl	r1		; plus 1 for the null byte
>	movc3	r1,@8(ap),@4(ap); copy the string
>	movl	4(ap),r0	; r0 -> destination
>	.end

Unfortunately, this fails for strings longer than 65535 characters.
This can be corrected by looping over the locc and movc3 instructions
until the final 64K segment is reached (I believe locc sets the
condition codes depending on whether it found the character or not).

Here's a stab at it (without referring to the "black book" so I don't
know if I've got things right):

_strcpy:.globl	_strcpy
	.word	0x0
	movl	8(ap),r3	# source => r3
	movl	4(ap),r5	# dest => r5 (movc3 requires this arrangement)
	brb	1f		# go count up length
0:	movc3	$65534,(r3),(r5)# copy current 64K segment
1:	locc	$0,$65534,(r3)	# find null
	bne	0b		# if not found, copy next 64K seg
	subl2	r3,r1		# r1 = len
	incl	r1		# +1 for null, without overflow (65534+1 max)
	movc3	r1,(r3),(r5)	# move final chunk
	movl	4(ap),r0	# dest => r0 as return val
	ret

(I'm not sure if movc3'ing 65532 bytes would be better, so as to stay
longword aligned when the arguments were originally....)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251)
UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland

cottrell@NBS-VMS (03/20/85)

/*
> Unfortunately, this fails for strings longer than 65535 characters.

Tsk tsk. Your strings are too long! Make them shorter please.
Also, doesn't the VAX have some kind of `translate & copy until a
given character is seen' instruxion?

	jim		cottrell@nbs
*/

dwd@ccice6.UUCP (David W. Donald) (03/21/85)

>
> The idiom "while (*s++ = *t++);" generates the fastest possible code
> if s and t are declared to be registers ...
>

Beware of saying things like "the fastest possible".
I found a better way for for any string longer than two characters.
The following is from 4.2BSD on a VAX.

main()
{
	register char *s, *d;
	while( *d++ = *s++ );			/* ok */
	if (*d) do {} while ( *d++ = *s++ );	/* <-- FAST -- */
}

_main:
L16:	/* ok: a 3 instruction loop */
	movb	(r11)+,(r10)+
	jeql	L17
	jbr 	L16
L17:
	/* FAST: a two instruction loop */
	tstb	(r10)
	jeql	L18
L21:
L20:
	movb	(r11)+,(r10)+
	jneq	L21
L19:
L18:
	ret

guy@rlgvax.UUCP (Guy Harris) (03/21/85)

> Beware of saying things like "the fastest possible".
> I found a better way for for any string longer than two characters.
> The following is from 4.2BSD on a VAX.

Geepers, didn't they teach you to use the "-O" option in school?  Running
your example through "cc" with the "-O" flag, I get the same code for
both examples except that the second "FAST" version has an extra test
at the beginning.

"-O" does a fair bit of code motion, especially when strange sequences
of jumps are involved; basically, it's covering PCC's tracks.

ron@brl-tgr.ARPA (Ron Natalie <ron>) (03/22/85)

> Beware of saying things like "the fastest possible".
> I found a better way for for any string longer than two characters.
> The following is from 4.2BSD on a VAX.
> 
> main()
> {
> 	register char *s, *d;
> 	while( *d++ = *s++ );			/* ok */
> 	if (*d) do {} while ( *d++ = *s++ );	/* <-- FAST -- */
> }
> 
Beware of posting solutions that don't work.  If *d == 0 you don't
copy anything at all.  I'm not quite sure why you put the test in,
all it does is make the loop dependent on whats in the destination
string before you copy into it.

-Ron