reiter@endor.UUCP (03/31/87)
In article <1944@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) pointed out that printf could be recoded to run much faster. This made me curious, and I started playing around with alternate versions of strcpy. I did some timings on a microVAX II (cc -O, 4.3BSD) using the library strcpy, a straightforward version of my own (strcpy2), and a version which tried to minimize branches out of the instruction sequence (strcpy3). I ran strcpy2 and strcpy3 both as procedure calls and as in-line expansions. The timings, for 1,000,000 executions of copying "Test string 1", were routine inline? time (sec) strcpy no 134.7 strcpy2 no 65.2 strcpy3 no 51.6 strcpy2 yes 43.3 strcpy3 yes 30.8 The most shocking thing about the above is that the library strcpy is half the speed of a very straightforward C implementation of the routine! Perhaps we should spend less time arguing about hardware support for string-handling routines, and more time making sure that the people who implement the run-time library are somewhat competent. It also certainly appears that in-line expansion of simple library routines is a big win. I also ran some tests on a SUN-3/180 (SUN UNIX 4.2, cc -O). The results were: routine inline? time (sec) strcpy no 22.0 strcpy2 no 21.1 strcpy3 no 20.1 strcpy2 yes 15.1 strcpy3 yes 13.7 The SUN library routines seem much more competently coded, but even here, perhaps a bit of tuning would help (note that my timing was pretty sloppy, so small differences may not be significant), and in-line expansion would certainly be a big win. Routine codes: strcpy2(ss1,ss2) char *ss1,*ss2; { register char *s1,*s2; s1 = ss1; s2 = ss2; while (*s1++ = *s2++); } strcpy3(ss1,ss2) char *ss1,*ss2; { register char *s1,*s2; s1 = ss1; s2 = ss2; while (*s1++ = *s2++) { if (!(*s1++ = *s2++)) break; if (!(*s1++ = *s2++)) break; < the above statement is repeated 20 times > } } *************************************************************************** Ehud Reiter reiter@harvard (ARPA, BITNET, UUCP)
reiter@endor.UUCP (03/31/87)
As a brief addition to my previous posting, when I used my own strcpy and strcmp routines, the Dhrystone rating of my microVAX II (cc -O, 4.3BSD) jumped from 1000 (library strcpy, strcmp) to 1400 (my strcpy, strcmp). Dhrystone seems to be highly sensitive to the quality of the library's string-handling routines. Ehud Reiter reiter@harvard (ARPA,BITNET,UUCP)
avie@wb1.cs.cmu.edu.UUCP (03/31/87)
In article <1531@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes: >As a brief addition to my previous posting, when I used my own strcpy and strcmp >routines, the Dhrystone rating of my microVAX II (cc -O, 4.3BSD) jumped from >1000 (library strcpy, strcmp) to 1400 (my strcpy, strcmp). Dhrystone seems to >be highly sensitive to the quality of the library's string-handling routines. > > Ehud Reiter > reiter@harvard (ARPA,BITNET,UUCP) You've got to be careful when you run benchmarks. If you are indeed running 4.3, then you are most likely using the 4.3 libc, which has been carefully optimized to use the fancy VAX instructions for the string routines. Unfortunately, some of these instructions are not implemented by the MicroVAX-II hardware. As it turns out, what is happening is that your tests (including Dhrystone) are causing kernel traps to emulate those instructions! Avie
guy@gorodish.UUCP (03/31/87)
> This made me curious, and I started playing around with alternate versions of > strcpy. I did some timings on a microVAX II (cc -O, 4.3BSD) using the > library strcpy... Note that the library "strcpy" uses "locc" to find the length of the source string and then does a "movc3" to copy it. This requires two passes over the source string. Whether the whizzo VAX string twiddling instructions are a win or not depends on how long the strings are. (Also, which of the whizzo string instructions does the microVAX II implement in microcode and which are handled in software?) > I also ran some tests on a SUN-3/180 (SUN UNIX 4.2 Which release? The "UNIX 4.2" is a conventional phrase. The 4.2 refers to 4.2BSD; it's not the release number. > routine inline? time (sec) > strcpy no 22.0 > strcpy2 no 21.1 This is almost certainly not significant. The code to the SunOS 3.0 version of "strcpy": char * strcpy(s1, s2) register char *s1, *s2; { register char *os1; os1 = s1; while (*s1++ = *s2++) ; return (os1); } The only difference is that yours doesn't return a pointer to the original string (which it has to if it's to be compatible), so the differences are almost certainly insignificant. The SunOS 3.2 version is: ENTRY(strcpy) movl PARAM,a0 | s1 movl PARAM2,a1 | s2 movl a0,d0 | return s1 at the end moveq #-1,d1 | count of 65535 | The following loop runs in loop mode on 68010 1$: movb a1@+,a0@+ | move byte from s2 to s1 dbeq d1,1$ bne 1$ | if zero byte seen, done RET which is more-or-less the same thing, only using the 68010's moral equivalent of whizzo string instructions. In the cases I tested it on, it was faster than the C version. (Thanks to John Gilmore and Vaughan Pratt for the little "bne 1$" trick at the end there.) Unrolling the loop, as you did, might be a bigger win, especially on the 68020 where even the unrolled loop would probably fit in the instruction cache.
davidsen@steinmetz.UUCP (03/31/87)
In article <1530@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes:
$In article <1944@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) pointed out
$that printf could be recoded to run much faster.
...
$strcpy2(ss1,ss2)
$char *ss1,*ss2;<---------------------------- put "register" on this line
${ register char *s1,*s2;<-------------- as many compilers will use this
$ as local variables overriding
$ s1 = ss1; the arguments.
$ s2 = ss2;
$ while (*s1++ = *s2++); }
I'm going to try these on some of the machines I use, and see if this
will speed up things.
--
bill davidsen sixhub \
ihnp4!seismo!rochester!steinmetz -> crdos1!davidsen
chinet /
ARPA: davidsen%crdos1.uucp@ge-crd.ARPA (or davidsen@ge-crd.ARPA)
reiter@endor.UUCP (03/31/87)
The point I was trying to make in my previous posting was the general one that while lots of people worry about optimizing hardware architectures and compilers, fewer people seem to worry about optimizing run-time libraries, with the microVAX II 4.3BSD library apparently being a particularly bad example. Is it too much too ask that, for common routines (certainly printf, the str* routines, etc), the following be done: 1) As has been suggested previously, complex routines like printf should be optimized to run fastest in the most common cases. 2) Simple routines like strcpy should be adjusted to perform well on a particular architecture (if the microVAX doesn't have a hardware locc instruction, then is it too much to ask that the run-time library supplied for the microVAX be changed not to use locc, at least in small and frequently used routines like strcpy?). A moderate degree of loop unrolling should be considered. 3) Simple routines like strcpy should be recoded in assembler, at least to the degree of having their procedure prologues simplified, and so that they use registers which don't have to be restored. 4) In-line expansion of common (and simple) library routines should be considered. The above points are all fairly simple and obvious, but I suspect that in many cases they have not been done. My impression has been that far more care and thought go into hardware design and compiler design than go into run-time library design, although it may be easier (perhaps because of this!) to squeeze an extra 10-20% performance from the system by improving the library than by improving hardware or the compiler. Ehud Reiter reiter@harvard (ARPA,BITNET,UUCP) P.S. A few people have pointed out that saying a SUN-3 runs 4.2 isn't saying much. Sorry about that! My SUN-3/180 runs 3.1.
bright@dataio.UUCP (03/31/87)
In article <5@wb1.cs.cmu.edu] avie@wb1.cs.cmu.edu (Avadis Tevanian) writes: ] In article <1531@husc6.UUCP] reiter@harvard.UUCP (Ehud Reiter) writes: ]] As a brief addition to my previous posting, when I used my own strcpy and strcmp ]] routines, the Dhrystone rating of my microVAX II (cc -O, 4.3BSD) jumped from ]] 1000 (library strcpy, strcmp) to 1400 (my strcpy, strcmp). ] You've got to be careful when you run benchmarks. If you are indeed running ] 4.3, then you are most likely using the 4.3 libc, which has been carefully ] optimized to use the fancy VAX instructions for the string routines. ] Unfortunately, some of these instructions are not implemented by the ] MicroVAX-II hardware. As it turns out, what is happening is that your tests ] (including Dhrystone) are causing kernel traps to emulate those ] instructions! The fix, of course, is: o Have the startup module determine what version of the CPU is running and set a global. o Have the string functions use different instruction sequences, depending on which CPU is being used. This is similar to how the 8087 is used on the PC.
robert@sri-spam.UUCP (04/01/87)
In article <1359@steinmetz.steinmetz.UUCP> davidsen@kbsvax.steinmetz.UUCP (William E. Davidsen Jr) writes: >$In article <1944@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) pointed out >$that printf could be recoded to run much faster. >... >$strcpy2(ss1,ss2) >$char *ss1,*ss2;<---------------------------- put "register" on this line >${ register char *s1,*s2;<-------------- as many compilers will use this >$ as local variables overriding >$ s1 = ss1; the arguments. >$ s2 = ss2; >$ while (*s1++ = *s2++); } > >I'm going to try these on some of the machines I use, and see if this >will speed up things. To further speed this up you could write it as: strcpy2(ss1,ss2) register char *ss1, *ss2; { while (*s1++ = *s2++) ; } thus eliminating two assignments and two local variables. -- --------------------------------------------------------- Glover (upon discovering that Gibson is his new partner): "God hates me, that's what the problem is!" Gibson (blowing cigarette smoke out his nose): "Hate 'em back, it works for me!" Gibson (later in the film, with handgun): "PUNTA, PUNTA, PUNTA!!" Robert Allen, robert@spam.istc.sri.com ---------------------------------------------------------
msb@sq.UUCP (04/01/87)
> Note that the library "strcpy" uses "locc" to find the length of the > source string and then does a "movc3" to copy it. This requires two > passes over the source string. Whether the whizzo VAX string > twiddling instructions are a win or not depends on how long the > strings are. Note also that a two-pass strcpy() should not be used in any environment where two processes may share data space... or in "ANSI C" jargon, where the source string is volatile. If you did that, you might find that you have scribbled on bytes after the terminating null, or produced a destination string with no terminating null at all. (The latter effect can of course be avoided by inserting the null separately rather than copying it.) When I had to do some work on VMS a couple of years ago, we found that the VMS implementation of strcpy() was considerably slower than locc-movc3 for the string lengths we were working on, and we decided that this was the reason. Mark Brader
chris@mimsy.UUCP (04/01/87)
[To comp.arch readers who just want to read about architectures: skip this. It is just correcting various points in other articles, none of which had much to do with architecture in the first place.] In some article, someone adds these arrows and notes: >>$strcpy2(ss1,ss2) >>$char *ss1,*ss2;<---------------------------- put "register" on this line >>${ register char *s1,*s2;<-------------- as many compilers will use this >>$ as local variables overriding >>$ s1 = ss1; the arguments. [etc] In article <9996@sri-spam.istc.sri.com> robert@sri-spam.istc.sri.com (Robert Allen) writes: >To further speed this up you could write it as: > strcpy2(ss1,ss2) > register char *ss1, *ss2; > { > while (*s1++ = *s2++) > ; > } >thus eliminating two assignments and two local variables. Now these people all seem to be talking about Vaxen; I feel I must point out the various errors here. It is clear that neither the original someone nor Robert Allen actually compiled these to assembly. (Use `cc -O -S' to do this.) First, the notes attached to the arrows are wrong. Compilers cannot override the arguments when the names differ. The loop copies from s2 to s1; the parameters were ss2 and ss1. Second, the code produced for the second example (once corrected to `while (*ss1++ = *ss2++)') is *identical* to the code for the first! (Vax running 32V, 3BSD, 4BSD, and, no doubt, Sys3, Sys5, V8, and V9.) The two assignments the second version attempts to eliminate must still occur, for `ss1' and `ss2' are passed on the stack, and must be copied into the two registers---the same two registers allocated in the first version. The proper way to speed strcpy() on a MicroVAX-II is no doubt to use the following assembly code: _strcpy:.globl _strcpy .word 0 # save no registers movq 4(ap),r1 # get s1 and s2 into r1 and r2 movl r1,r0 # save s1 1: movb (r2)+,(r1)+ # *s1++ = *s2++ bneq 1b # loop until a zero is moved ret # return original s1 in r0 Note that this is remarkably similar to the compiler's output for the original code, modified to have the proper return value: char * strcpy(ss1, ss2) char *ss1; char *ss2; { register char *s1 = ss1, *s2 = ss2; while ((*s1++ = *s2++) != 0) /* void */; return (ss1); /* must return the original value */ } .globl _strcpy _strcpy: .word 0xc00 # save r11 and r10 movl 4(ap),r11 # here is s1 movl 8(ap),r10 # and s2 L16: movb (r10)+,(r11)+ # *s1++ = *s2++ jneq L16 # (this assembles to a `bneq') movl 4(ap),r0 # return the original s1 ret All one can improve on the locc-poor MicroVAX-II is the register usage and the parameter grabbing. (c2, at least from 32V to 4.3BSD, will never turn two `movl's into a `movq'. Ah well.) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690) UUCP: seismo!mimsy!chris ARPA/CSNet: chris@mimsy.umd.edu
radford@calgary.UUCP (04/02/87)
In article <6071@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: > The proper way to speed strcpy() on a MicroVAX-II is no doubt to > use the following assembly code: > > _strcpy:.globl _strcpy > .word 0 # save no registers > movq 4(ap),r1 # get s1 and s2 into r1 and r2 > movl r1,r0 # save s1 > 1: movb (r2)+,(r1)+ # *s1++ = *s2++ > bneq 1b # loop until a zero is moved > ret # return original s1 in r0 > > Note that this is remarkably similar to the compiler's output > for the original code, modified to have the proper return value: > All one can improve on the locc-poor MicroVAX-II is the register > usage and the parameter grabbing. (c2, at least from 32V to 4.3BSD, > will never turn two `movl's into a `movq'. Ah well.) Replacing your loop with: 1: movb (r2)+,(r1)+ bequ 2f movb (r2)+,(r1)+ bneq 1b 2: will almost certainly speed things up (say 15%). I haven't actually tried it, but I've tried entirely analogous cases. Loop unrolling can produce speed-up even when the instruction count is unchanged, if taken branches are replaced by untaken branches. Radford Neal
henry@utzoo.UUCP (Henry Spencer) (04/03/87)
> Unrolling the loop, as you did, might be a bigger win, especially on > the 68020 where even the unrolled loop would probably fit in the > instruction cache. Mmmm, not necessarily. I had to study this sort of thing in another context recently. When the number of iterations is short -- remembering that strcpy is often used for quite short strings -- an unrolled loop can be slower than a simple one. The extra instruction fetches on the first iteration of the unrolled loop can cost more than the extra loop control in the simple loop. (When the number of iterations is long, there is an optimal degree of unrolling. For simple cache designs, the curve of "time taken for a long copy" against the log of the unrolling factor is beautifully symmetrical, with a definite and specific minimum. I haven't analyzed the situation for more complex caches, although I conjecture vaguely similar results.) -- "We must choose: the stars or Henry Spencer @ U of Toronto Zoology the dust. Which shall it be?" {allegra,ihnp4,decvax,pyramid}!utzoo!henry
levy@ttrdc.UUCP (04/05/87)
In the discussion about strcpy, nobody has noticed that strcpy is SUPPOSED TO RETURN THE FIRST POINTER PASSED TO IT. This is one reason why the library version is "slower" than void strcpy(s1,s2) register char *s1,*s2; { while (*s1++ = *s2++); } -- |------------dan levy------------| Path: ..!{akgua,homxb,ihnp4,ltuxa,mvuxa, | an engihacker @ | vax135}!ttrdc!ttrda!levy | at&t computer systems division | Disclaimer: try datclaimer. |--------skokie, illinois--------|
adam@miduet.UUCP (04/10/87)
In article <9996@sri-spam.istc.sri.com> robert@sri-spam.UUCP (Robert Allen) writes: > >To further speed this up you could write it as: > > strcpy2(ss1,ss2) > register char *ss1, *ss2; > { > while (*s1++ = *s2++) > ; > } > >thus eliminating two assignments and two local variables. > If your machine doesn't complain at char-int alignment, try strcmp2(s1, s2) register short int *s1, *s2; { while (*s1 == *s2++) if (!((*s1++ & 255) && (*(char *)s1))) return(0); return(*s1 - *--s2); } This relies on the fact that it doesn't matter if you compare the NULLs at the end of two equal strings or not. We go along the string twice as fast because of the 16 bit operations, although there is a slight penalty due to the extra test. The alignment problem only comes in if either of the strings begins on an odd boundary, which is fairly unusual anyway. Also watch the last part of the 'if' condition - this won't work on some machines, you may have to add 1 to the pointer. -- -Adam. /* If at first it don't compile, kludge, kludge again.*/
aeusemrs@csun.UUCP (Mike Stump) (04/13/87)
In article <1987Apr1.121330.15976@sq.uucp> msb@sq.UUCP (Mark Brader) writes: |Note also that a two-pass strcpy() should not be used in any environment |where two processes may share data space... or in "ANSI C" jargon, where |the source string is volatile. If you did that, you might find that you |have scribbled on bytes after the terminating null, or produced a |destination string with no terminating null at all. (The latter effect |can of course be avoided by inserting the null separately rather than |copying it.) | |Mark Brader Well, not to pick on you, but, have you ever done ANY programming using shared memory, succesfully, and bug free? Tell me, how would you do something like a strcpy in such an environment? Please include an exact example, so that I can show you what is wrong with it. In orther words, why limit a two-pass strcpy to the above? You should just say: ``No one should use shared memory, unless he/she knows what they are doing'' -- Mike Stump, Cal State Univ, Northridge Comp Sci Department uucp: {sdcrdcf, ihnp4, hplabs, ttidca, psivax, csustan}!csun!aeusemrs
john@viper.UUCP (04/14/87)
In article <531@gec-mi-at.co.uk> adam@gec-mi-at.co.uk (Adam Quantrill) writes: > >strcmp2(s1, s2) >register short int *s1, *s2; >{ > while (*s1 == *s2++) > if (!((*s1++ & 255) && (*(char *)s1))) > return(0); > return(*s1 - *--s2); >} > >This relies on the fact that it doesn't matter if you compare the NULLs at the >end of two equal strings or not. We go along the string twice as fast because >of the 16 bit operations, although there is a slight penalty due to the extra >test. The alignment problem only comes in if either of >the strings begins on an odd boundary, which is fairly unusual anyway. >Also watch the last part of the 'if' condition - this won't work on >some machines, you may have to add 1 to the pointer. >-- > -Adam. > >/* If at first it don't compile, kludge, kludge again.*/ This might work on some machines, but don't even think about it being portable.. It isn't... The most obvious problem will occur on any machine where the order of bytes in an int variable is [low, high]. This will cause the last like to fail to return the correct greater/ lessthan result.... There's also the problem you mentioned about the odd int boundry. It may be "fairly unusual anyway" because your coding style tends to place strings on even boundrys without even thinking about it. I admit that it hasn't happend in most of my code either, but if you just once tried porting something that assumed even string boundrys to a compiler that packs the static stings end to end, you'd realize this isn't a good idea. } --- John Stanley (john@viper.UUCP) Software Consultant - DynaSoft Systems UUCP: ...{amdahl,ihnp4,rutgers}!{meccts,dayton}!viper!john
flaps@utcsri.UUCP (04/15/87)
In article <531@gec-mi-at.co.uk> adam@gec-mi-at.co.uk (Adam Quantrill) writes: >The alignment problem only comes in if either of >the strings begins on an odd boundary, which is fairly unusual anyway. This is extremely usual! Suppose you have a string where the first character indicates what kind of string it is, I have this in a big program I'm working on now. I therefore often say "dosomething(&string[1])". In other words, the problem with these kinds of assumptions is that you can add odd numbers to char pointers to get final substrings. -- Alan J Rosenthal flaps@csri.toronto.edu, {seismo!utai or utzoo}!utcsri!flaps, flaps@toronto on csnet, flaps at utorgpu on bitnet. "Probably the best operating system in the world is the [operating system] made for the PDP-11 by Bell Laboratories." - Ted Nelson, October 1977