[comp.arch] String Handling -- Incompetence of run-time libraries

reiter@endor.UUCP (03/31/87)

In article <1944@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) pointed out
that printf could be recoded to run much faster.

This made me curious, and I started playing around with alternate versions of
strcpy.  I did some timings on a microVAX II (cc -O, 4.3BSD) using the library
strcpy, a straightforward version of my own (strcpy2), and a version which
tried to minimize branches out of the instruction sequence (strcpy3).  I ran
strcpy2 and strcpy3 both as procedure calls and as in-line expansions.  The
timings, for 1,000,000 executions of copying "Test string 1", were

routine	  inline?	time (sec)

strcpy	  no		134.7
strcpy2   no		 65.2
strcpy3   no		 51.6
strcpy2   yes		 43.3
strcpy3   yes		 30.8

The most shocking thing about the above is that the library strcpy is half the
speed of a very straightforward C implementation of the routine!  Perhaps
we should spend less time arguing about hardware support for string-handling
routines, and more time making sure that the people who implement the
run-time library are somewhat competent.  It also certainly appears that
in-line expansion of simple library routines is a big win.

I also ran some tests on a SUN-3/180 (SUN UNIX 4.2, cc -O).  The results were:

routine	  inline?	time (sec)

strcpy	  no		 22.0
strcpy2	  no		 21.1
strcpy3	  no		 20.1
strcpy2	  yes		 15.1
strcpy3   yes		 13.7

The SUN library routines seem much more competently coded, but even here,
perhaps a bit of tuning would help (note that my timing was pretty sloppy,
so small differences may not be significant), and in-line expansion would
certainly be a big win.

Routine codes:

strcpy2(ss1,ss2)
char *ss1,*ss2;
{	register char *s1,*s2;

	s1 = ss1;
	s2 = ss2;
	while (*s1++ = *s2++);	}

strcpy3(ss1,ss2)
char *ss1,*ss2;
{	register char *s1,*s2;

	s1 = ss1;
	s2 = ss2;
	while (*s1++ = *s2++)
	{	if (!(*s1++ = *s2++)) break;
		if (!(*s1++ = *s2++)) break;
	< the above statement is repeated 20 times >
			}	}

***************************************************************************
				Ehud Reiter
				reiter@harvard	(ARPA, BITNET, UUCP)

reiter@endor.UUCP (03/31/87)

As a brief addition to my previous posting, when I used my own strcpy and strcmp
routines, the Dhrystone rating of my microVAX II (cc -O, 4.3BSD) jumped from
1000 (library strcpy, strcmp) to 1400 (my strcpy, strcmp).  Dhrystone seems to
be highly sensitive to the quality of the library's string-handling routines.

					Ehud Reiter
					reiter@harvard	(ARPA,BITNET,UUCP)

avie@wb1.cs.cmu.edu.UUCP (03/31/87)

In article <1531@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes:
>As a brief addition to my previous posting, when I used my own strcpy and strcmp
>routines, the Dhrystone rating of my microVAX II (cc -O, 4.3BSD) jumped from
>1000 (library strcpy, strcmp) to 1400 (my strcpy, strcmp).  Dhrystone seems to
>be highly sensitive to the quality of the library's string-handling routines.
>
>					Ehud Reiter
>					reiter@harvard	(ARPA,BITNET,UUCP)

You've got to be careful when you run benchmarks.  If you are indeed running
4.3, then you are most likely using the 4.3 libc, which has been carefully
optimized to use the fancy VAX instructions for the string routines.
Unfortunately, some of these instructions are not implemented by the
MicroVAX-II hardware.  As it turns out, what is happening is that your tests
(including Dhrystone) are causing kernel traps to emulate those
instructions!

	Avie

guy@gorodish.UUCP (03/31/87)

> This made me curious, and I started playing around with alternate versions of
> strcpy.  I did some timings on a microVAX II (cc -O, 4.3BSD) using the
> library strcpy...

Note that the library "strcpy" uses "locc" to find the length of the
source string and then does a "movc3" to copy it.  This requires two
passes over the source string.  Whether the whizzo VAX string
twiddling instructions are a win or not depends on how long the
strings are.  (Also, which of the whizzo string instructions does the
microVAX II implement in microcode and which are handled in
software?)

> I also ran some tests on a SUN-3/180 (SUN UNIX 4.2

Which release?   The "UNIX 4.2" is a conventional phrase.  The 4.2
refers to 4.2BSD; it's not the release number.

> routine	  inline?	time (sec)

> strcpy	  no		 22.0
> strcpy2	  no		 21.1

This is almost certainly not significant.  The code to the SunOS 3.0
version of "strcpy":

	char *
	strcpy(s1, s2)
		register char *s1, *s2;
	{
		register char *os1;

		os1 = s1;
		while (*s1++ = *s2++)
			;
		return (os1);
	}

The only difference is that yours doesn't return a pointer to the
original string (which it has to if it's to be compatible), so the
differences are almost certainly insignificant.  The SunOS 3.2
version is:

	ENTRY(strcpy)
		movl	PARAM,a0	| s1
		movl	PARAM2,a1	| s2
		movl	a0,d0		| return s1 at the end

		moveq	#-1,d1		| count of 65535
	| The following loop runs in loop mode on 68010
	1$:	movb	a1@+,a0@+	| move byte from s2 to s1
		dbeq	d1,1$
		bne	1$		| if zero byte seen, done

		RET

which is more-or-less the same thing, only using the 68010's moral
equivalent of whizzo string instructions.  In the cases I tested it
on, it was faster than the C version.  (Thanks to John Gilmore and
Vaughan Pratt for the little "bne 1$" trick at the end there.)
Unrolling the loop, as you did, might be a bigger win, especially on
the 68020 where even the unrolled loop would probably fit in the
instruction cache.

davidsen@steinmetz.UUCP (03/31/87)

In article <1530@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes:
$In article <1944@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) pointed out
$that printf could be recoded to run much faster.
...
$strcpy2(ss1,ss2)
$char *ss1,*ss2;<---------------------------- put "register" on this line
${	register char *s1,*s2;<-------------- as many compilers will use this
$					      as local variables overriding
$	s1 = ss1;			      the arguments.
$	s2 = ss2;
$	while (*s1++ = *s2++);	}

I'm going to try these on some of the machines I use, and see if this
will speed up things.

-- 
bill davidsen			sixhub \
      ihnp4!seismo!rochester!steinmetz ->  crdos1!davidsen
				chinet /
ARPA: davidsen%crdos1.uucp@ge-crd.ARPA (or davidsen@ge-crd.ARPA)

reiter@endor.UUCP (03/31/87)

The point I was trying to make in my previous posting was the general one
that while lots of people worry about optimizing hardware architectures
and compilers, fewer people seem to worry about optimizing run-time libraries,
with the microVAX II 4.3BSD library apparently being a particularly bad
example.  Is it too much too ask that, for common routines (certainly printf,
the str* routines, etc), the following be done:

1) As has been suggested previously, complex routines like printf should be
optimized to run fastest in the most common cases.

2) Simple routines like strcpy should be adjusted to perform well on a
particular architecture (if the microVAX doesn't have a hardware locc
instruction, then is it too much to ask that the run-time library supplied
for the microVAX be changed not to use locc, at least in small and frequently
used routines like strcpy?).  A moderate degree of loop unrolling should be
considered.

3) Simple routines like strcpy should be recoded in assembler, at least to
the degree of having their procedure prologues simplified, and so that they
use registers which don't have to be restored.

4) In-line expansion of common (and simple) library routines should be
considered.

The above points are all fairly simple and obvious, but I suspect that
in many cases they have not been done.  My impression has been that far more
care and thought go into hardware design and compiler design than go into
run-time library design, although it may be easier (perhaps because of this!)
to squeeze an extra 10-20% performance from the system by improving the library
than by improving hardware or the compiler.

					Ehud Reiter
					reiter@harvard	(ARPA,BITNET,UUCP)

P.S. A few people have pointed out that saying a SUN-3 runs 4.2 isn't saying
much.  Sorry about that!  My SUN-3/180 runs 3.1.

bright@dataio.UUCP (03/31/87)

In article <5@wb1.cs.cmu.edu] avie@wb1.cs.cmu.edu (Avadis Tevanian) writes:
] In article <1531@husc6.UUCP] reiter@harvard.UUCP (Ehud Reiter) writes:
]] As a brief addition to my previous posting, when I used my own strcpy and strcmp
]] routines, the Dhrystone rating of my microVAX II (cc -O, 4.3BSD) jumped from
]] 1000 (library strcpy, strcmp) to 1400 (my strcpy, strcmp).
] You've got to be careful when you run benchmarks.  If you are indeed running
] 4.3, then you are most likely using the 4.3 libc, which has been carefully
] optimized to use the fancy VAX instructions for the string routines.
] Unfortunately, some of these instructions are not implemented by the
] MicroVAX-II hardware.  As it turns out, what is happening is that your tests
] (including Dhrystone) are causing kernel traps to emulate those
] instructions!

The fix, of course, is:
	o Have the startup module determine what version of the CPU is running
	  and set a global.
	o Have the string functions use different instruction sequences,
	  depending on which CPU is being used.
This is similar to how the 8087 is used on the PC.

robert@sri-spam.UUCP (04/01/87)

In article <1359@steinmetz.steinmetz.UUCP> davidsen@kbsvax.steinmetz.UUCP (William E. Davidsen Jr) writes:
>$In article <1944@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) pointed out
>$that printf could be recoded to run much faster.
>...
>$strcpy2(ss1,ss2)
>$char *ss1,*ss2;<---------------------------- put "register" on this line
>${	register char *s1,*s2;<-------------- as many compilers will use this
>$					      as local variables overriding
>$	s1 = ss1;			      the arguments.
>$	s2 = ss2;
>$	while (*s1++ = *s2++);	}
>
>I'm going to try these on some of the machines I use, and see if this
>will speed up things.

To further speed this up you could write it as:

	strcpy2(ss1,ss2)
	register char *ss1, *ss2;
	{
	    while (*s1++ = *s2++)
	    ;
	}

thus eliminating two assignments and two local variables.

-- 
---------------------------------------------------------
    Glover (upon discovering that Gibson is his new partner):
	"God hates me, that's what the problem is!"

    Gibson (blowing cigarette smoke out his nose):
	"Hate 'em back, it works for me!"

    Gibson (later in the film, with handgun):
	"PUNTA, PUNTA, PUNTA!!"

Robert Allen,
robert@spam.istc.sri.com
---------------------------------------------------------

msb@sq.UUCP (04/01/87)

> Note that the library "strcpy" uses "locc" to find the length of the
> source string and then does a "movc3" to copy it.  This requires two
> passes over the source string.  Whether the whizzo VAX string
> twiddling instructions are a win or not depends on how long the
> strings are.

Note also that a two-pass strcpy() should not be used in any environment
where two processes may share data space... or in "ANSI C" jargon, where
the source string is volatile.  If you did that, you might find that you
have scribbled on bytes after the terminating null, or produced a
destination string with no terminating null at all.  (The latter effect
can of course be avoided by inserting the null separately rather than
copying it.)

When I had to do some work on VMS a couple of years ago, we found that
the VMS implementation of strcpy() was considerably slower than locc-movc3
for the string lengths we were working on, and we decided that this was
the reason.

Mark Brader

chris@mimsy.UUCP (04/01/87)

[To comp.arch readers who just want to read about architectures:
skip this.  It is just correcting various points in other articles,
none of which had much to do with architecture in the first place.]

In some article, someone adds these arrows and notes:
>>$strcpy2(ss1,ss2)
>>$char *ss1,*ss2;<---------------------------- put "register" on this line
>>${	register char *s1,*s2;<-------------- as many compilers will use this
>>$					      as local variables overriding
>>$	s1 = ss1;			      the arguments.
	[etc]

In article <9996@sri-spam.istc.sri.com> robert@sri-spam.istc.sri.com
(Robert Allen) writes:
>To further speed this up you could write it as:
>	strcpy2(ss1,ss2)
>	register char *ss1, *ss2;
>	{
>	    while (*s1++ = *s2++)
>	    ;
>	}
>thus eliminating two assignments and two local variables.

Now these people all seem to be talking about Vaxen; I feel I must
point out the various errors here.  It is clear that neither the
original someone nor Robert Allen actually compiled these to
assembly.  (Use `cc -O -S' to do this.)

First, the notes attached to the arrows are wrong.  Compilers cannot
override the arguments when the names differ.  The loop copies from
s2 to s1; the parameters were ss2 and ss1.  Second, the code produced
for the second example (once corrected to `while (*ss1++ = *ss2++)')
is *identical* to the code for the first!  (Vax running 32V, 3BSD,
4BSD, and, no doubt, Sys3, Sys5, V8, and V9.)  The two assignments
the second version attempts to eliminate must still occur, for
`ss1' and `ss2' are passed on the stack, and must be copied into
the two registers---the same two registers allocated in the first
version.

The proper way to speed strcpy() on a MicroVAX-II is no doubt to
use the following assembly code:

	_strcpy:.globl	_strcpy
		.word	0		# save no registers
		movq	4(ap),r1	# get s1 and s2 into r1 and r2
		movl	r1,r0		# save s1
	1:	movb	(r2)+,(r1)+	# *s1++ = *s2++
		bneq	1b		# loop until a zero is moved
		ret			# return original s1 in r0

Note that this is remarkably similar to the compiler's output
for the original code, modified to have the proper return value:

	char *
	strcpy(ss1, ss2)
		char *ss1;
		char *ss2;
	{
		register char *s1 = ss1, *s2 = ss2;

		while ((*s1++ = *s2++) != 0)
			/* void */;
		return (ss1);		/* must return the original value */
	}

		.globl	_strcpy
	_strcpy:
		.word	0xc00		# save r11 and r10
		movl	4(ap),r11	# here is s1
		movl	8(ap),r10	# and s2
	L16:	movb	(r10)+,(r11)+	# *s1++ = *s2++
		jneq	L16		# (this assembles to a `bneq')
		movl	4(ap),r0	# return the original s1
		ret

All one can improve on the locc-poor MicroVAX-II is the register
usage and the parameter grabbing.  (c2, at least from 32V to 4.3BSD,
will never turn two `movl's into a `movq'.  Ah well.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP:	seismo!mimsy!chris	ARPA/CSNet:	chris@mimsy.umd.edu

radford@calgary.UUCP (04/02/87)

In article <6071@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:

> The proper way to speed strcpy() on a MicroVAX-II is no doubt to
> use the following assembly code:
> 
> 	_strcpy:.globl	_strcpy
> 		.word	0		# save no registers
> 		movq	4(ap),r1	# get s1 and s2 into r1 and r2
> 		movl	r1,r0		# save s1
> 	1:	movb	(r2)+,(r1)+	# *s1++ = *s2++
> 		bneq	1b		# loop until a zero is moved
> 		ret			# return original s1 in r0
> 
> Note that this is remarkably similar to the compiler's output
> for the original code, modified to have the proper return value:

> All one can improve on the locc-poor MicroVAX-II is the register
> usage and the parameter grabbing.  (c2, at least from 32V to 4.3BSD,
> will never turn two `movl's into a `movq'.  Ah well.)

Replacing your loop with:

	1:	movb	(r2)+,(r1)+
		bequ	2f
		movb	(r2)+,(r1)+
		bneq	1b
	2:

will almost certainly speed things up (say 15%). I haven't actually
tried it, but I've tried entirely analogous cases. Loop unrolling can
produce speed-up even when the instruction count is unchanged, if
taken branches are replaced by untaken branches. 

    Radford Neal

henry@utzoo.UUCP (Henry Spencer) (04/03/87)

> Unrolling the loop, as you did, might be a bigger win, especially on
> the 68020 where even the unrolled loop would probably fit in the
> instruction cache.

Mmmm, not necessarily.  I had to study this sort of thing in another
context recently.  When the number of iterations is short -- remembering
that strcpy is often used for quite short strings -- an unrolled loop can
be slower than a simple one.  The extra instruction fetches on the first
iteration of the unrolled loop can cost more than the extra loop control
in the simple loop.

(When the number of iterations is long, there is an optimal degree of
unrolling.  For simple cache designs, the curve of "time taken for a long
copy" against the log of the unrolling factor is beautifully symmetrical,
with a definite and specific minimum.  I haven't analyzed the situation
for more complex caches, although I conjecture vaguely similar results.)
-- 
"We must choose: the stars or	Henry Spencer @ U of Toronto Zoology
the dust.  Which shall it be?"	{allegra,ihnp4,decvax,pyramid}!utzoo!henry

levy@ttrdc.UUCP (04/05/87)

In the discussion about strcpy, nobody has noticed that strcpy is SUPPOSED
TO RETURN THE FIRST POINTER PASSED TO IT.  This is one reason why the library
version is "slower" than 

	void strcpy(s1,s2) register char *s1,*s2; { while (*s1++ = *s2++); }


-- 
|------------dan levy------------|  Path: ..!{akgua,homxb,ihnp4,ltuxa,mvuxa,
|         an engihacker @        |		vax135}!ttrdc!ttrda!levy
| at&t computer systems division |  Disclaimer:  try datclaimer.
|--------skokie, illinois--------|

adam@miduet.UUCP (04/10/87)

In article <9996@sri-spam.istc.sri.com> robert@sri-spam.UUCP (Robert Allen) writes:
>
>To further speed this up you could write it as:
>
>	strcpy2(ss1,ss2)
>	register char *ss1, *ss2;
>	{
>	    while (*s1++ = *s2++)
>	    ;
>	}
>
>thus eliminating two assignments and two local variables.
>
If your machine doesn't complain at char-int alignment, try

strcmp2(s1, s2)
register short int *s1, *s2;
{
	while (*s1 == *s2++)
		if (!((*s1++ & 255) && (*(char *)s1)))
			return(0);
	return(*s1 - *--s2);
}

This relies on the fact that it doesn't matter if you compare the NULLs at the
end of two equal strings or not. We go along the string twice as fast because
of the 16 bit operations, although there is a slight penalty due to the extra
test. The alignment problem only comes in if either of 
the strings begins on an odd boundary, which is fairly unusual anyway. 
Also watch the last part of the 'if' condition - this won't work on 
some machines, you may have to add 1 to the pointer.
-- 
       -Adam.

/* If at first it don't compile, kludge, kludge again.*/

aeusemrs@csun.UUCP (Mike Stump) (04/13/87)

In article <1987Apr1.121330.15976@sq.uucp> msb@sq.UUCP (Mark Brader) writes:
|Note also that a two-pass strcpy() should not be used in any environment
|where two processes may share data space... or in "ANSI C" jargon, where
|the source string is volatile.  If you did that, you might find that you
|have scribbled on bytes after the terminating null, or produced a
|destination string with no terminating null at all.  (The latter effect
|can of course be avoided by inserting the null separately rather than
|copying it.)
|
|Mark Brader


Well, not to pick on you, but, have you ever done ANY programming using shared
memory, succesfully, and bug free?  Tell me, how would you do something like a
strcpy in such an environment?  Please include an exact example, so that I can
show you what is wrong with it.  In orther words, why limit a two-pass strcpy
to the above?  You should just say:  ``No one should use shared memory, unless
he/she knows what they are doing''
-- 
Mike Stump, Cal State Univ, Northridge Comp Sci Department
uucp: {sdcrdcf, ihnp4, hplabs, ttidca, psivax, csustan}!csun!aeusemrs

john@viper.UUCP (04/14/87)

In article <531@gec-mi-at.co.uk> adam@gec-mi-at.co.uk (Adam Quantrill) writes:
 >
 >strcmp2(s1, s2)
 >register short int *s1, *s2;
 >{
 >	while (*s1 == *s2++)
 >		if (!((*s1++ & 255) && (*(char *)s1)))
 >			return(0);
 >	return(*s1 - *--s2);
 >}
 >
 >This relies on the fact that it doesn't matter if you compare the NULLs at the
 >end of two equal strings or not. We go along the string twice as fast because
 >of the 16 bit operations, although there is a slight penalty due to the extra
 >test. The alignment problem only comes in if either of 
 >the strings begins on an odd boundary, which is fairly unusual anyway. 
 >Also watch the last part of the 'if' condition - this won't work on 
 >some machines, you may have to add 1 to the pointer.
 >-- 
 >       -Adam.
 >
 >/* If at first it don't compile, kludge, kludge again.*/

This might work on some machines, but don't even think about it being
portable..  It isn't...  The most obvious problem will occur on any
machine where the order of bytes in an int variable is [low, high].
This will cause the last like to fail to return the correct greater/
lessthan result....

There's also the problem you mentioned about the odd int boundry.  It may
be "fairly unusual anyway" because your coding style tends to place
strings on even boundrys without even thinking about it.  I admit that
it hasn't happend in most of my code either, but if you just once tried
porting something that assumed even string boundrys to a compiler that
packs the static stings end to end, you'd realize this isn't a good idea.

}

--- 
John Stanley (john@viper.UUCP)
Software Consultant - DynaSoft Systems
UUCP: ...{amdahl,ihnp4,rutgers}!{meccts,dayton}!viper!john

flaps@utcsri.UUCP (04/15/87)

In article <531@gec-mi-at.co.uk> adam@gec-mi-at.co.uk (Adam Quantrill) writes:
>The alignment problem only comes in if either of 
>the strings begins on an odd boundary, which is fairly unusual anyway. 

This is extremely usual!  Suppose you have a string where the first character
indicates what kind of string it is, I have this in a big program I'm working
on now.  I therefore often say "dosomething(&string[1])".

In other words, the problem with these kinds of assumptions is that you can
add odd numbers to char pointers to get final substrings.

-- 

Alan J Rosenthal

flaps@csri.toronto.edu, {seismo!utai or utzoo}!utcsri!flaps,
flaps@toronto on csnet, flaps at utorgpu on bitnet.

"Probably the best operating system in the world is the [operating system]
made for the PDP-11 by Bell Laboratories." - Ted Nelson, October 1977