[comp.lang.c] String Processing Instruction

bcase@amdcad.UUCP (03/25/87)

There was a discussion a few months ago about processing strings more
efficiently than a byte at a time.  The Am29000 takes one of the possible
approaches to improving string processing performance....


One unique feature of the Am29000 architecture is a special instruction.
This instruction is intended to be used to speed-up string processing,
but my guess is that other uses will be discovered.  The instruction is
called "compare-bytes" and works like this:

Compare bytes specifies two source register operands and one destination
register operand.  The 4 pairs of corresponding bytes of the two 32-bit
source operands are compared for equality (i.e., the two most-significant
bytes are compared, the two next-most-significant bytes are compared, etc.).
If any of the four pairs are equal, then the destination register is set
to the value "TRUE" (which on the Am29000 is a one in the most-significant
bit with all other bits cleared to zero).  If none of the four pairs are
equal, then the destination register is set to "FALSE" (all bits cleared).
(Am29000 conditional branch instructions test only the most significant bit of
a register, condition codes are not used; we get a free "test for negative.")

So, if one of the source operands is set to all zeros (four null characters)
(which can be specified in the instruction by choosing the second operand
as the zero-extended eight-bit constant zero) and the other operand is a
word of the character string being dealt with (say for copying or comparing),
the Am29000 can, in one cycle (not counting the branch), determine if the word
contains the end of string character (according to the C language definition
of string).  If the word does not contain the end of string character, then
the four bytes in the word can be manipulated (e.g. loaded or stored) as a
unit.  Word operations on the Am29000 are much more efficient than character
operations (this is true of most machines though).

There are, of course, special circumstances to deal with (such as misaligned
strings, and we have a funnel shifter to help in those cases), but by using
the compare-bytes instruction in the library routines strcpy() and strcmp()
(and strlen() too, but we haven't bothered since it seems to never be used in
the programs we have encountered), significant improvements in the run-time
of many C programs can be realized.  Another thing which really helps is to
have the compiler word-align literal strings (and I have implemented this),
but even with word-alignment, some substrings will begin on strange boundaries
and must be dealt with correctly.

My approach to using this instruction consisted of re-writing the library
routines in C with function calls wherever the compare-bytes instruction
should go.  I compiled this C code with my compiler, changed the assembly code
to eliminate the function calls in favor of the compare-bytes instruction, and
assembled it into the library (actually a module of code that gets included in
all final links, but that is just a detail of our simple environment).  Since
most C programs (especially utilities and other systems programs) do a lot of
string processing, this one instruction is really worth the small
implementation cost.  It often improves run times by 15% to 20% (just goes to
show that the impact of processing C language strings has been long- ignored).
It implements just the right semantics and probably has other applications for
specialized pattern matching.  

I just thought some of you would be interested.

   bcase

lamaster@pioneer.UUCP (03/26/87)

In article <15292@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>There was a discussion a few months ago about processing strings more
>efficiently than a byte at a time.  The Am29000 takes one of the possible
>approaches to improving string processing performance....
>

:

>
>I just thought some of you would be interested.
>
>   bcase

I appreciate seeing submissions like this.  Designing a computer instruction
set is a little like backpacking: There is an art to figuring out exactly
what you really need to take with you. 

My question is this:  How likely is it that a compiler itself will be able to
detect the case when it can use an instruction like this and generate code
automatically to use it.  One of the positive points to the RISC debate is
that it brought out the point that useful instructions which are hard for a
compiler to generate are not always a win.  The old STAR-100 had a number of
instructions, including string processing instructions, which were left out of
the Cyber 205 because it turned out that the compilers never generated them.
They were neat instructions though.  So, my question is, is this instruction
easily usable by a C, Fortran, Pascal, or ADA compiler and if so, under what
conditions?

  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

"In order to promise genuine progress, the acronym RISC should stand 
for REGULAR (not reduced) instruction set computer." - Wirth

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")

baum@apple.UUCP (03/26/87)

--------
[]
>
>One unique feature of the Am29000 architecture is a special instruction.
>This instruction is intended to be used to speed-up string processing,
>but my guess is that other uses will be discovered.  The instruction is
>called "compare-bytes" and works like this:.....

The HP Spectrum has three instructions that are very similar.
 Unit XOR will Exlusive-OR two registers and let you skip if any byte or
   halfword is equal or not equal.

 Unit AddComplement[&Trap on Condition] will add the ones complement of
   one register to another. This is sort of subtraction without the forced
   carry. That way, all bytes, halfwords & nibbles essentially have the
   same carry-in (although a carry out of one stage could force a carry-in
   to another, it turns out this doesn't affect the major uses of the 
   instruction). Then, either a trap will occur or the next instruction will
   be skipped if the conditions are met. The conditions are any halfword/
   byte/nibble carry, any halfword/byte zero, and their complements.

Among other things, these instructions can be used for string searching a
word at a time, decimal/byte/halfword range checking/validation a byte at
a time, and decimal addition pre-biasing.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

mark@mimsy.UUCP (03/26/87)

In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:
>...How likely is it that a compiler [can] detect the case when it can use 
> an instruction like this and generate code automatically to use it?

This is of course a good point, and it seems that this is going to be
one of those hard cases for compiler writers.  But it opens the issue
of subroutine libraries as providing an environment which also needs
support in the instruction set.  The Unix string routines are a good
example, because they are even now often recoded in assembly for efficiency.
Various pieces of Unix kernel frequently suffer the same fate
(e.g. tty buffers).  But once one is into instructions which are good
for certain operating systems or environments, rather than certain languages,
one has narrowed one's market.  For instance, the Am29000 special string
compare is useless if strings are encoded as a length followed by the chars.  
-mark
Background: I happen to think Unix-style strings are a big market, and I love
things which make Unix faster (because there is a positive feedback cycle
which makes more Unix yet more available and desirable.).
-- 
Spoken: Mark Weiser 	ARPA:	mark@mimsy.umd.edu	Phone: +1-301-454-7817
After May 1, 1987: weiser@xerox.com

bcase@amdcad.UUCP (03/26/87)

In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:
>I appreciate seeing submissions like this.  Designing a computer instruction
>set is a little like backpacking: There is an art to figuring out exactly
>what you really need to take with you. 
Thanks for the positive feedback.  An interesting point of view!

>My question is this:  How likely is it that a compiler itself will be able to
>detect the case when it can use an instruction like this and generate code
>automatically to use it.  One of the positive points to the RISC debate is
>that it brought out the point that useful instructions which are hard for a
>compiler to generate are not always a win.  The old STAR-100 had a number of
>instructions, including string processing instructions, which were left out of
>the Cyber 205 because it turned out that the compilers never generated them.
>They were neat instructions though.  So, my question is, is this instruction
>easily usable by a C, Fortran, Pascal, or ADA compiler and if so, under what
>conditions?

This is an excellent question, one that I intended to address in the original
posting.  However, I forgot to include it when the damn article got kinda
long.

The fact of the matter is:  A C compiler is going to have a real hard time
generating this instruction.  A Pascal Compiler might have a slightly
easier time since the conept of "string" is a little more integrated with
the langauge.  In general, a compiler for a language without built-in types
and functions for dealing with strings will have a hard time with our
compare-bytes instruction.  We realized this when we added it to the
architecture (and it was truly a last-minute addition).  However, it has
such a significant impact on C programs in general that we were not deterred
by the realization that this instruction will probably occur only a few
times in the strcpy(), strcmp(), and strlen() library functions.  Out of
all the megabytes of code in a UNIX system, this instruction will account
for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough
guess).  But if it improves the running time of all the string-oriented
system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems
worth it.  And the implementation cost was so small.  Also, there are
some instructions that must be present just to administer the system,
like return from interrupt, move-from-special-register, etc.  These
are not generated by a compiler either.  Just to reiterate a point:  RISC
is not reducing the instruction set but is improving performance.

Ok, so you don't believe the above?  How about "It improved dhrystone
a hell of lot."

    bcase

guy@gorodish.UUCP (03/26/87)

>My question is this:  How likely is it that a compiler itself will be able to
>detect the case when it can use an instruction like this and generate code
>automatically to use it.

In most C implementations, "strcpy", "strcmp", etc. are library
routines; if the speedup from these instructions is sufficiently
large, it doesn't matter if the compiler can generate them because
the compiler can generate calls to those routines, and those routines
can be written in assembler language.  You could also imagine a
compiler that would recognize names like "_builtin_strcpy" and
generate the assembler code for those routines in line, and a
<string.h> that contains

	#define	strcpy	_builtin_strcpy

(see the footnote on page 86 of the October 1, 1986 ANSI C draft).
Another possibility might be to have something like the WE32000 C
compiler's "asm" directive, which permits you to define "functions"
that are expanded in line to assembler code.

Similar techniques can probably be used in other languages.  If they
do string processing in routines, the routines can be written in
assembler language to use these instructions.  If they generate
stereotyped sequences for these string operations, these stereotyped
seqences can use these instructions.

However, I don't know how difficult it would be to recognize various
special cases and generate non-stereotyped sequences that use these
instructions, or how much of a speedup this would give (if these
special cases weren't common, speeding them up might not buy you
enough to worry about it).

jaw@ames.UUCP (03/27/87)

> [....] Since
> most C programs (especially utilities and other systems programs) do a lot of
> string processing, this one instruction is really worth the small
> implementation cost.  It often improves run times by 15% to 20% (just goes to
> show that the impact of processing C language strings has been long- ignored).

just curious which unix utilities use str(cpy|cmp|len) in their inner loops?
certainly, 'vn' comes to mind as devoting much cpu time to these functions.

is the 15-20% claimed due at all to either

	(a) significantly slow byte addressing to begin with (ala cray)?
	(b) in-line compilation of the string(3) stuff into the application?

if (a), then improving memory byte access speed in the architecture is a
more general solution with more payoff overall than the compare gate hack.
what is the risc chip cost for byte vs. word addressibility, anyway?

if (b), then maybe function call speed is the culprit rather than dearth of
the specialized instruction.

at any rate,
for cray unix, buffer copy ops in the kernel were vastly improved when
re-written for words instead of bytes, even more so when vectorized
(the only place in the kernel with vectorization, i think).

of course, table lookup using only 2^16 locations would be a joke
software solution for super-intensive null-char-in-16-bit-smallword
compare code.  drastic, but saves a test the amd chip appears worried about.
personally, i'm a fan of branch folding ...

ames!jaw

mash@mips.UUCP (03/27/87)

In article <15304@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>...
>for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough
>guess).  But if it improves the running time of all the string-oriented
>system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems
>worth it.  And the implementation cost was so small.  Also, there are
This is clearly true.
>some instructions that must be present just to administer the system,
>like return from interrupt, move-from-special-register, etc.  These
>are not generated by a compiler either.  Just to reiterate a point:  RISC
>is not reducing the instruction set but is improving performance.
Absolutely! There are almost always "structural" instructions that
are hard to generate, but are needed.
>
>Ok, so you don't believe the above?  How about "It improved dhrystone
>a hell of lot."

Unfortunately, Dhrystone is an artificial benchmark.  I couldn't resist
doing a quick test [we have some pretty neat profiling tools for taking
an already-compiled program & turning it into one that does procedure &
statement counts, then gives you instruction cycle counts [which would be,
typically 60% of the total cycles, the other 40% being cache misses,
TLB-miss overhead, memory path interference, etc.]
OK: here's a quiz: I did some simple tests [ones we use in our standard
Performance Brief suite, which includes diff, grep yacc, and nroff],
plus csh.  I won't pretend this is definitive, since I just whipped
it up.  How much time would you guess these programs spend doing
strlen, strcmp. strcpy [the high-runners]? [Just guess the instruction
cycle %].  Try 0-1%, 2-5%, 6-10%, 11-20%, 21%-up.



















ANSWERS:
program		strlen		strcmp		strcpy
		% cycs	cy/call	% cycs	cy/call	% cycs	cy/call
diff		.03%	-	0	-	0	-
grep		none
yacc		.04%	-	.59%	-	0	-
nroff		0	-	0	-	<0.1%	-
csh		 1.71%	20	 1.27%	9	 1.84%	21
  % of total
  func calls		3.76%		6.11%		3.75%

Dhrystone	<.01%	19	16.94%	103	22.36%	136

Bottom-line: 
	1) Dhrystone isn't remotely representative of some
	common string-pushing programs in UNIX.
	2) most of these, the total is <1% of instruction cycles,
	hence <0.6% of fully-degraded cycles. Maybe you can save 20%
	of this, or about .1%.
	3) For csh: what's going on is that these routines are called
	very frequently, but for short strings: 3-6 characters;
	strcmp's obviously fail quickly [2nd or 3rd character].
	I think the implication is that maybe you can get rid of 20% of
	the cycles, which would be a 1% instruction cycle saving,
	or about <0.6% full-degraded cycle saving for csh.
	4) Given all of this, maybe what you get can be grossly estimated
	as about .3%, maybe.  [Again, this was somethign whipped up in
	half an hour, so hardly definitive].
	5) Note that Dhrystone spends a huge lot of its time copying and
	comparing long strings.  Hence, it's well worth a little extra
	setup time for Dhrystone to lessen the cost per loop.
	[In fact, we only got around to installing an assembler strcpy
	when we noticed how much time was spent there in Dhrystone.]

Thus, I'd say it's still an open issue [for those of you who happen
to be designing computer architectures at this instant!]
	Dhrystone: says it's important, but is totally unrepresentative.
	csh: says it might get you .6%
	others: surprise you by using str* so little
	general evidence: says it might be useful, but if you follow our
		1% rule [and HP used something similar, as I recall],
		there is as yet insufficient evidence to include it.
If I were right now designing a system, I'd be tempted to do a lot
of data-gathering.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

shebanow@ji.Berkeley.EDU.UUCP (03/27/87)

Every time I trace a program which uses stdio to any great degree, I find
printf to be the worst offender (with regard to time consumed). In fact,
I have seen programs which spend 30% of their time in printf (actually
_doprnt). In the code I have optimized, I usually try to eliminate the
use of printf (which can make the code rather opaque). This does result
in 10-20% improvements in speed. The most obvious use of new string
processing instructions would be to improve the performance of printf.
Has anyone done this?

Mike Shebanow

cmcmanis@sun.UUCP (03/27/87)

In article <15304@amdcad.UUCP>, bcase@amdcad.UUCP (Brian Case) writes:
> 
> Ok, so you don't believe the above?  How about "It improved dhrystone
> a hell of lot."
> 
>     bcase

This I believe, I was wondering when someone would point that out. I 
think it is time for some *real* optimizing compilers that can detect
when they are compiling a benchmark and optimize for the appropriate 
instruction! :-) :-) :-)

-- 
--Chuck McManis
uucp: {anywhere}!sun!cmcmanis   BIX: cmcmanis  ARPAnet: cmcmanis@sun.com
These opinions are my own and no one elses, but you knew that didn't you.

metzger@heathcliff.columbia.edu.UUCP (03/27/87)

In article <18036@ucbvax.BERKELEY.EDU> shebanow@ji.Berkeley.EDU.UUCP (Mike Shebanow) writes:
>Every time I trace a program which uses stdio to any great degree, I find
>printf to be the worst offender (with regard to time consumed). In fact,
>I have seen programs which spend 30% of their time in printf (actually
>_doprnt). In the code I have optimized, I usually try to eliminate the
>use of printf (which can make the code rather opaque). This does result
>in 10-20% improvements in speed. The most obvious use of new string
>processing instructions would be to improve the performance of printf.
>Has anyone done this?

Here at Project Synthesis we found that out a while back. It is pretty
strange to think of it that way, but yes, Printf and friends are
compute bound, not I/O bound. We have experimented with a number of
solutions, including a hardware printf coprocessor for the 68020.
Predictably, it improved performance quite a bit. We also have tried
using specially coded versions of printf in assembly language that we
passed through a (very time expensive) exhaustive search optimiser and
the results were also pretty good.

Perry Metzger

bcase@amdcad.UUCP (03/28/87)

In article <1001@ames.UUCP> jaw@ames.UUCP (James A. Woods) writes:
>just curious which unix utilities use str(cpy|cmp|len) in their inner loops?
>certainly, 'vn' comes to mind as devoting much cpu time to these functions.
>
>is the 15-20% claimed due at all to either
>
>	(a) significantly slow byte addressing to begin with (ala cray)?
>	(b) in-line compilation of the string(3) stuff into the application?
>
>if (a), then improving memory byte access speed in the architecture is a
>more general solution with more payoff overall than the compare gate hack.
>what is the risc chip cost for byte vs. word addressibility, anyway?
>
>if (b), then maybe function call speed is the culprit rather than dearth of
>the specialized instruction.
>
>at any rate,
>for cray unix, buffer copy ops in the kernel were vastly improved when
>re-written for words instead of bytes, even more so when vectorized
>(the only place in the kernel with vectorization, i think).
>
>of course, table lookup using only 2^16 locations would be a joke
>software solution for super-intensive null-char-in-16-bit-smallword
>compare code.  drastic, but saves a test the amd chip appears worried about.
>personally, i'm a fan of branch folding ...
>

Re:  (a) above.  NO.  Implementing a byte addressable (especially writable)
memory slows down all memory accesses for a slight improvement in byte
processing efficiency.  For my own part, I can say that from the beginning
of the Am29000 project, I was firmly against anything but a word-oriented
interface.  You have to realize that byte accessing requires an alignment
network somewhere:  it will add some nanoseconds to all memory accesses; you
can put the alignment network in its own pipeline stage, but even then, it
will *always* slow down every memory access, there is nothing you can do about
it!  (The same reasoning leads to our addressing-mode-less load and store
instructions:  if addressing modes are there, then the instructions always
pay the cost even when the compiler knows better!)  Thus, this is not strictly
a RISC issue, but a maximum through-put issue regardless of RISCness.  (Note
that the Am29000 extract-byte and insert-byte instructions are essentially
putting the alignment network in a pipeline stage, but the compiler can
decided to pay or not pay the penalty for a given memory access (depending
upon whether it is a byte access or not)).

Re:  (b) above.  NO.  We do not do inlining of the strcmp(), strcpy(), or
strlen() routines (I wish we could, it would be even better!).  The Am29000
has one of the fastest calling conventions around.  The performance
improvement we saw with the special instruction is an order of magnatude
(ok, I'm taking a guess here, but it is probably pretty close) greater than
what would have been gained by in-lining in this case.

Re:  Branch folding.  I like all optimizations.  Give me more MIPS!  More.
MORE.  MORE!

    bcase
---------------

Back again by popular demand:

Sung to the tune of "Timewarp:"

	It's just a shift the left,
	And then a shift to the ri-i-ight!
	Get a load of those MIPS
	And the code's real ti-i-ght!
	This lock-step pipeline
	Really drives me insa-a-a-e-a-ane
	Let's do the I-fetch again!

bcase@amdcad.UUCP (03/28/87)

In article <5969@mimsy.UUCP> mark@mimsy.UUCP (Mark Weiser) writes:
>In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:
>>...How likely is it that a compiler [can] detect the case when it can use 
>> an instruction like this and generate code automatically to use it?
>
>This is of course a good point, and it seems that this is going to be
>one of those hard cases for compiler writers.  But it opens the issue
>of subroutine libraries as providing an environment which also needs
>support in the instruction set.  The Unix string routines are a good
>example, because they are even now often recoded in assembly for efficiency.
>Various pieces of Unix kernel frequently suffer the same fate
>(e.g. tty buffers).  But once one is into instructions which are good
>for certain operating systems or environments, rather than certain languages,
>one has narrowed one's market.  For instance, the Am29000 special string
>compare is useless if strings are encoded as a length followed by the chars.  

Yes, you are right:  when the length is known, things are *much* easier.
But, the instruction is still a useful primitive:  it can be used to find
out, in one cycle, whether or not a 32-bit word contains a byte of
interest.  The compiler may not generate it, but it can be used in hand-
coded inner loops for this purpose.

    bcase

clif@intelca.UUCP (03/28/87)

> In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:
> >conditions?


[ Discussion of best history of string instructions]


> The fact of the matter is:  A C compiler is going to have a real hard time
> generating this instruction.  A Pascal Compiler might have a slightly
> easier time since the conept of "string" is a little more integrated with
> the langauge.  In general, a compiler for a language without built-in types
> and functions for dealing with strings will have a hard time with our
> compare-bytes instruction.  We realized this when we added it to the
> architecture (and it was truly a last-minute addition).  However, it has
> such a significant impact on C programs in general that we were not deterred
> by the realization that this instruction will probably occur only a few
> times in the strcpy(), strcmp(), and strlen() library functions.  Out of
> all the megabytes of code in a UNIX system, this instruction will account
> for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough
> guess).  But if it improves the running time of all the string-oriented
> system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems
> worth it.  And the implementation cost was so small.  Also, there are
> some instructions that must be present just to administer the system,
> like return from interrupt, move-from-special-register, etc.  These
> are not generated by a compiler either.  Just to reiterate a point:  RISC
> is not reducing the instruction set but is improving performance.
> 
> Ok, so you don't believe the above?  How about "It improved dhrystone
> a hell of lot."
> 
>     bcase

I think that it makes sense to add instructions like this even if
they only get put into the standard libraries, presumably they will
prove to be modest win for most applications written in C.  I 
vaguely remember reading that  calls to string routines represented
about .5% of all C code.  

As for Brian's second statement about Dhrystone being helped by 
string instructions:  He is certainly right about that.

-- 
Clif Purkiser, Intel, Santa Clara, Ca.
{pur-ee,hplabs,amd,scgvaxd,dual,idi,omsvax}!intelca!clif

These views are my own property.  However anyone who wants them can have 
them for a nominal fee.

rpw3@amdcad.UUCP (03/28/87)

Well, John Mashey's numbers were certainly counter-intuitive to ME, given
that those programs bang away at characters all the time, but we should
look at one more thing:

When were nroff/grep/yacc written compared to when str* was standardized
in the library? After all, the str* libraries didn't spring full-blown on
day one with those exact sematics and calling sequences (I'm guessing).
I'll also guess (could easily be wrong) that there is a lot more string
stuff going on than you see in the str* statistics, and that it's hand
coded "in line" (in C, of course).

Counter point: What do similar NEW programs look like? Take some code
written by someone (or a team) that started with the "modern" libraries.
Run the statistics on that, and see...

Finally, don't forget str{,r}chr (a.k.a. {,r}index). Things like "ls -l"
beat on those really hard (at least in System-V), since way down at the
bottom a lot of time is spent (I'm guessing) tearing apart /etc/passwd
entries (go look at "getpwent" & friends).

I know that I sometimes avoid str* because I'm afraid it might be slow,
particularly when I know that keeping a count (a.k.a. length) can avoid
a lot of searching for the end. The solution is usually a high-level change
in the applications basic algorithms so that so much copying & searching
is simply avoided. A simple example is when you are building various
strings up out of pieces (as in "speeding up" shell scripts by recoding
in C). Even USING the libraries, sometimes instead of:

	strcpy(buf, "piece");
	strcat(buf, "another piece");
	strcat(buf, "still another piece");

I find myself doing:

	char *p = &buf[0];
	int n;

	strcpy(p, "piece");		p += strlen(p);
	strcpy(p, "another piece");	p += strlen(p);
	sprintf(p, "format%d%s", args...); p+= strlen(p);
	strcpy(p, "still another piece"); p += strlen(p);
	etc...

Then when you're done, you can assert that strlen(buf) == (p - buf);


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun,attmail}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

guy@gorodish.UUCP (03/28/87)

>>The most obvious use of new string processing instructions would be to
>>improve the performance of printf.  Has anyone done this?

The version of "_doprnt" that comes with 4BSD and comes in the VAX System III
source (althought whether it's the one that's in the library or not,
I don't know), and may very well have come with UNIX 32V, is written
in assembler.  The 4.3BSD version, at least, does use the string
*and* EDITPC instructions, although it doesn't use the string
instructions for handling things like "%s".

VAX System V doesn't use this version (and no, it's probably not
because the assembly version doesn't handle all the formatting
operators in the S5 version - I know it handles most of them, and I
think it handles all of them), so I'm curious whether they dropped it
because it didn't buy much.

>We have experimented with a number of solutions, including a hardware
>printf coprocessor for the 68020.  Predictably, it improved performance
>quite a bit. We also have tried using specially coded versions of printf
>in assembly language that we passed through a (very time expensive)
>exhaustive search optimiser and the results were also pretty good.

How much did the coprocessor cost, and how much better was it than the
assembly-language version?

mash@mips.UUCP (03/29/87)

In article <15317@amdcad.UUCP> rpw3@amdcad.UUCP (Rob Warnock) writes:
>Well, John Mashey's numbers were certainly counter-intuitive to ME, given
>that those programs bang away at characters all the time, but we should
>look at one more thing:
>
>When were nroff/grep/yacc written compared to when str* was standardized
>in the library? After all, the str* libraries didn't spring full-blown on
>day one with those exact sematics and calling sequences (I'm guessing).
>I'll also guess (could easily be wrong) that there is a lot more string
>stuff going on than you see in the str* statistics, and that it's hand
>coded "in line" (in C, of course).
Yes, those were old programs.  I'd be happy to hear of suggestions for
specific programs to check.  It's absolutely sure that there was plenty
of string stuff going on, coded in C. [Of coruse, some of the string
stuff is algorithmic work that hardware isn't likely to support for
a while, i.e., regular-expression matching.]
>
>Finally, don't forget str{,r}chr (a.k.a. {,r}index). Things like "ls -l"
>beat on those really hard (at least in System-V), since way down at the
>bottom a lot of time is spent (I'm guessing) tearing apart /etc/passwd
>entries (go look at "getpwent" & friends).
I generally looked for str* routines; the others were generally in the noise.
Following are the stats for SYS V : ls -l /usr
Profile listing generated Sat Mar 28 14:19:29 1987 with:
----------------------------------------------------------------------------
*  -p[rocedures] using basic-block counts;                                 *
*  sorted in descending order by the number of cycles executed in each     *
*  procedure; unexecuted procedures are excluded                           *
----------------------------------------------------------------------------
1139801 cycles

    cycles %cycles  cum %     cycles  bytes procedure (file)
                               /call  /line

    511480   44.87  44.87       6558     22 getname (ls.c)
    310413   27.23  72.11         16     40 fgetc (../fgetc.c)
     99006    8.69  80.79         43     28 _flsbuf (../flsbuf.c)
     79820    7.00  87.80        340     17 _doprnt (../doprnt.c)
     18830    1.65  89.45         54     23 select (ls.c)
     15073    1.32  90.77        321     19 gmtime (../ctime.c)
....
      1655    0.15  98.79         10      5 strcmp (../strcmp.s)
....
      1338    0.12  99.18         18      5 strlen (../strlen.s)
....
        63    0.01  99.96          8     10 strchr (../strchr.c)

[Surprise! str* stuff isn't on the map in this one.  Just goes to show
that even experienced [and Rob is] folks' intuition needs checking.
>Well, this surprised me also.]

This whole string-processing discussion has been quite worthwhile,
if only to remind us that we often discuss pretty esoteric things
in this newsgroup, while at the same time, we sometimes don't have
the foggiest ideas on even basic program behavior!!
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

rbutterworth@watmath.UUCP (03/29/87)

In article <15317@amdcad.UUCP>, rpw3@amdcad.UUCP (Rob Warnock) writes:
> I find myself doing:
> 	char *p = &buf[0];
> 	int n;
> 	strcpy(p, "piece");		p += strlen(p);
> 	strcpy(p, "another piece");	p += strlen(p);
> 	sprintf(p, "format%d%s", args...); p+= strlen(p);
> 	strcpy(p, "still another piece"); p += strlen(p);
> 	etc...

Wouldn't it have been nice if strcpy(), strcat(), sprintf(), etc.
had been designed to return a pointer to the END of the string,
(or a count of the number of characters in the string)?

They already have this information for free, but instead they return
their first argument, something that the caller already knows.

earl@mips.UUCP (03/29/87)

Actually the MIPS architecture does contain string processing
instructions.  They go by the names ADDU and XOR.  For example, the
inner loop of my strlen routine, which searches for a word with a
nonzero byte, is

l1:	lw	t2,1(a1)
	addu	a1,4
	addu	t3,t2,t0
	xor	t3,t2
	and	t3,t1
	beq	t3,t1,l1

Quiz: what are the magic constants that need to be loaded into t0 and
t1 to make this work?

This is 7 cycles per word, or 1.75 cycles per byte.

-Earl

gemini@homxb.UUCP (03/30/87)

Options to "ls" have a great affect upon "str*" usages.  The following
results were obtained on an Amdahl 580 running uts 5.2.5:

./ls /bin /usr/bin [listed 64 times]
 %Time Seconds Cumsecs  #Calls   msec/call  Name
  22.1   10.72   10.72   80190      0.1336  strcmp
   3.3    1.60   40.25   21577      0.0742  memcpy
   0.7    0.33   47.65   10688      0.0312  strlen
   0.3    0.13   47.97                      strchr
   0.1    0.05   48.40                      memchr
   0.1    0.03   48.53                      strcpy

./ls -CF /bin /usr/bin [listed 64 times]
 %Time Seconds Cumsecs  #Calls   msec/call  Name
   7.9   10.67  100.55   80191      0.1330  strcmp
   1.2    1.58  128.38   32234      0.0491  memcpy
   0.3    0.42  133.08   21440      0.0194  strlen
   0.1    0.12  134.05       2     58.      strchr
   0.1    0.08  134.23                      memchr
   0.0    0.02  134.50                      strcpy

./ls -l /bin /usr/bin [listed 8 times]
 %Time Seconds Cumsecs  #Calls   msec/call  Name
   0.4    1.42  338.83   10023      0.1413  strcmp
   0.3    1.18  341.32   18761      0.0631  memcpy
   0.0    0.15  345.45    1336      0.112   strlen
   0.0    0.10  345.92                      memchr
   0.0    0.08  346.18                      strcpy
   0.0    0.03  346.35       1     33.      strchr

Note: on the Amdahl, strcmp and friends are in assembler.

Rick Richardson, PC Research, Inc: (201) 922-1134  ..!ihnp4!castor!pcrat!rick
	         when at AT&T-CPL: (201) 834-1378  ..!ihnp4!castor!pollux!rer

karl@haddock.UUCP (03/31/87)

In article <15317@amdcad.UUCP> rpw3@amdcad.UUCP (Rob Warnock) writes:
>[Sometimes instead of using strcat] I find myself doing:
>	char *p = &buf[0];
>	strcpy(p, "piece");		p += strlen(p);
>	strcpy(p, "another piece");	p += strlen(p);
>	sprintf(p, "format%d%s", args...); p+= strlen(p);
>	strcpy(p, "still another piece"); p += strlen(p);
>	etc...

Me too.  I wish strcpy et al would return a pointer to the *end* of the copied
string (fgets too, as someone else brought up).  In fact, one could argue that
the low-level routine should not even bother to write the final '\0'; then
your example would become
	*_strcpy(_sprintf(_strcpy(_strcpy(p, "piece"), "another piece"),
		"format%d%s", args...), "still another piece") = '\0';

(Which can still be written using temporaries if the code gets too opaque!)

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

tim@amdcad.UUCP (03/31/87)

In article <985@mips.UUCP>, earl@mips.UUCP (Earl Killian) writes:
> Quiz: what are the magic constants that need to be loaded into t0 and
> t1 to make this work?

Let's see... How about

	t0 = 0x7efefeff
	t1 = 0x81010100

Is that right?  If so, it appears to work only for 7 bit characters (but
we admit that such a restriction is reasonable in real life)...

(This is a fun problem but probably a bit too hard for use in interviews.)

Mike Johnson
Brian Case
Tim Olson

henry@utzoo.UUCP (Henry Spencer) (03/31/87)

> ... I think it is time for some *real* optimizing compilers that can detect
> when they are compiling a benchmark and optimize for the appropriate 
> instruction! :-) :-) :-)

Don't laugh!  It's not unknown for an optimizing-compiler group to put extra
attention into making common benchmarks run fast.  I'm told that there is at
least one compiler that has a specialized loop-unrolling optimization whose
major purpose is to quietly turn the "looping" version of Linpack into the
"unrolled" version.
-- 
"We must choose: the stars or	Henry Spencer @ U of Toronto Zoology
the dust.  Which shall it be?"	{allegra,ihnp4,decvax,pyramid}!utzoo!henry

rpw3@amdcad.UUCP (03/31/87)

In article <238@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>Following are the stats for SYS V : ls -l /usr
>    cycles %cycles  cum %     cycles  bytes procedure (file)
>                               /call  /line
>    511480   44.87  44.87       6558     22 getname (ls.c)
>    310413   27.23  72.11         16     40 fgetc (../fgetc.c)
>     99006    8.69  80.79         43     28 _flsbuf (../flsbuf.c)
>     79820    7.00  87.80        340     17 _doprnt (../doprnt.c)
>	...	...	...	...	 ...	...

Hey, John, wanna know something really interesting? Go look at 4.2 BSD
"ls", at routine getname(), and you'll discover it's full of getpwent()'s
(plus {set,end}pwent(), etc.) and strncpy()'s (hidden in the "SCPYN" macro).
And of course getpwent() uses the str* routines, though it also has its
own pwskip() instead if index().  (Have you got a 4.{2,3} BSD "ls" that you
could run your profiler on?)

Then look at the Sys V "ls", and lo & behold, all of those neat library
routines have been unfolded into the getname() routine, and hand-coded
into a most glorious mess, which calls only (guess)...  fgetc()!!!

(No wonder my jaw dropped when I saw your numbers!)

Now other parts of Sys V use {set,get,end}pwent(), so I guess somebody
decided that "ls" was running too slow (or had an "intuition"?) and ripped
it apart. Given the looks of the code, I'm surprised that even fgetc()
survived.

This reminds me of Gerry Weinberg's "Railroad Paradox":

	"We don't need fast strlib routines, because no one ever uses them."

Yeah... 'cause they're not fast, everyone codes around them!

We (me, too, I just realized I got suckered in also) must be *VERY* careful
to look inside any major piece of code we are going to try and use to prove
something or other, in order to avoid simply confirming the {good,bad,weird}
practices of the past.

I am now completely unsure whether we can get ANY reasonable statistics on
what the proper usage (percent, etc.) of "str*" should/could be in NEW code,
if we base it on code that has been hacked on as long as the various Unices.

Anybody want to cough up a "Stryngstone" benchmark for us to throw rocks at?

Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun,attmail}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

ricker@bunker.UUCP (03/31/87)

>>My question is this:  How likely is it that a compiler itself will be able to
>>detect the case when it can use an instruction like this and generate code
>>automatically to use it....

The Wang Labs VS processor also has some very useful assembler (not machine)
instructions for string processing. The manner in which this is handled is
by adding an attribute to functions called "builtin". This enables the C
(or other language) compiler to generate in-line code for functions that are
in the instruction set.

Example:

builtin char	*memcpy();

main()
{
	struct	stbuf {
		char	buf[8];
		int		i;
	} structure1;

	struct stbuf	structure2;

	/** initialize structure1 with code here **/

	/** This function generates a single VS assembler instruction **/
	memcpy(&structure2,&structure2,sizeof(structure1));

}

The list of builtins is clearly listed in the reference manual and includes all
the standard System 5 string functions. If one does not wish to use builtins,
then the function is declared explicitly without the builtin attribute.
I am not sure, but I think I recall that if the function is not declared
explicitly and the function is a builtin, then it is treated as a builtin.
Portability is not a problem with this scheme.

Your basic ricker.

earl@mips.UUCP (04/01/87)

In article <15331@amdcad.UUCP>, tim@amdcad.UUCP (Tim Olson) writes:
> > Quiz: what are the magic constants that need to be loaded into t0 and
> > t1 to make this work?
> Let's see... How about
> 	t0 = 0x7efefeff
> 	t1 = 0x81010100
> Is that right?  If so, it appears to work only for 7 bit characters (but
> we admit that such a restriction is reasonable in real life)...

Yes, you've got it.  Congratulations. (Gee, do I owe you lunch or
something now?)  Actually, it works for 8-bit characters in bytes 0,
1, and 2 and a 7-bit one in byte 3 (I'm using little-endian byte
numbers here).  Thus it may accept '\200' as a zero in byte 3.  But
this is easily handled in the follow-on code that has to detect which
byte is nonzero by actually testing byte 3 and branching back to the
loop if it is nonzero.  Actually, here is that code.  Note how after
the word loop the carries also conveniently let you detect the byte.
This part only works for little-endian byte ordering.  (Note: my
personal convention is to indent instructions in branch delay slots
once space):

	## search for word with zero byte
l1:	lw	t2,1(a1)	# read next word
	addu	a1,4
	addu	t3,t2,t0	# cause carry out of any nonzero byte
	xor	t3,t2		# xor to find carry bits
	and	t3,t1		# look at carries out of bytes
	beq	t3,t1,l1	# all bytes carried => none zero
	## found word with zero byte, now find the byte
	 sll	t4,t3,23	# carry out of byte 0?
l3:	beq	t4,0,x0		# if not, go compute length
	 sll	t4,t3,15	# carry out of byte 1?
	bge	t4,0,x1		# if not, go compute length
	 sll	t4,t3,7		# carry out of byte 2?
	bge	t4,0,x2		# if not, go compute length
	 srl	t4,t2,24	# byte 3 zero?
	bne	t4,0,l1		# no, re-enter loop

john@frog.UUCP (04/01/87)

> Well, John Mashey's numbers were certainly counter-intuitive to ME...

Really sophisticated programs spend most of their time doing really
sophisticated things; simple grunt routines like str*() aren't universally
applicable.

> When were nroff/grep/yacc written compared to when str* was standardized
> in the library?  After all, the str* libraries didn't spring full-blown on
> day one...

I don't know (offhand) about these, but I recall once being curious about
why lex was slow.  It turned out to use subroutines for isalpha(), isspace(),
isdigit(), etc, instead of the (presumably newer) ctype.h macros.  HOWEVER,
even after making that change to a version of lex, it didn't help all that
much, because lex has so much else to do.

> Even USING the libraries, sometimes instead of:
> 	strcpy(buf, "piece");
> 	strcat(buf, "another piece");
> 	strcat(buf, "still another piece");
> I find myself doing:
> 	char *p = &buf[0];
> 	int n;
> 	strcpy(p, "piece");		p += strlen(p);
> 	strcpy(p, "another piece");	p += strlen(p);
> 	sprintf(p, "format%d%s", args...); p+= strlen(p);
> 	strcpy(p, "still another piece"); p += strlen(p);
> 	etc...
> 
WHOOPS!  That's about as much work (first you copy the string, then
you step over it) as the use of strcat().  If you don't mind being
non-portable to oddball architectures, you can write a routine like
"strcatl()", which takes a destination buffer as a first argument, and
a NULL-terminated list of strings to concatenate into that buffer (or
you can be slightly portable and do it with <varargs.h>).   The easiest
way might be, however, to use:

	char *strcpya(p1, p2)	/* strcat and return end address */
	char *p1, *p2;
	{
		while (*p1++ = *p2++) /**/;
		return p1-1;
	}

and then
	strcpya(strcpya(strcpya(buf, "first piece"), " second piece"),
		" and the last");
or if you prefer
	char *p,buf[127 /* a prime number*/ ];
	p = strcpya(buf, "first piece");
	p = strcpya(p, " second piece");
	p = strcpya(p, " third piece");

Either way, you only touch each byte once.

Sometimes, the question you need to ask yourself about library routines is
not "Could they be better coded with this screwball assembly instruction?"
but rather "Does it *really* do what I want?".

--
John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101
...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA

"Happiness is the planet Earth
	in your rear-view mirror."
		- Sam Hurt, in "Eyebeam, Therefore I Am"

meissner@dg_rtp.UUCP (04/02/87)

In article <6448@watmath.UUCP> rbutterworth@watmath.UUCP writes:
> Wouldn't it have been nice if strcpy(), strcat(), sprintf(), etc.
> had been designed to return a pointer to the END of the string,
> (or a count of the number of characters in the string)?

The System V.3 (and ANSI) *{scanf printf} routines have a a format specifier
%n which stores the number of bytes {read, written} so far into the int
pointed to by the current argument.  Also in System V.xx and ANSI, sprintf
returns the total number of bytes written (not the pointer to the string
like 4.2 BSD does).
-- 
	Michael Meissner, Data General	Uucp: ...mcnc!rti-sel!dg_rtp!meissner

It is 11pm, do you know what your sendmail and uucico are doing?

stubbs@ncr-sd.UUCP (04/04/87)

In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:
>
>My question is this:  How likely is it that a compiler itself will be able to
>detect the case when it can use an instruction like this and generate code
>automatically to use it.  

In this case, these instructions would be used in the hand coded
strcopy and strcmp C library routines, so compilers need not be concerned.
My guess is they are still a win, because these libraries are used a bunch.

Jan Stubbs    ....sdcsvax!ncr-sd!stubbs
619 485-3052
NCR Corporation
Advanced Development
16550 W. Bernardo Drive MS4010
San Diego, CA. 92127

pec@necntc.UUCP (04/06/87)

How likely is it that a compiler can make use of string copy
instructions?  NEC's re-targeting of the portable C compiler to the 
V60 architecture frequently takes advantage of one of the V60's string 
copy instructions.  It uses a string copy for  structure assignments!  

Of course, these instructions are also used in the library routines.   
Another use that comes to mind is when loader zeros out a BSS area...

paulsc@orca.UUCP (04/06/87)

In article <7843@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>> ... I think it is time for some *real* optimizing compilers that can detect
>> when they are compiling a benchmark and optimize for the appropriate 
>> instruction! :-) :-) :-)
>Don't laugh!  It's not unknown for an optimizing-compiler group to put extra
>attention into making common benchmarks run fast.  I'm told that there is at
>least one compiler that has a specialized loop-unrolling optimization whose
>major purpose is to quietly turn the "looping" version of Linpack into the
>"unrolled" version.

It isn't always a conscious effort to trick customers into
buying the compiler.  The compiler group might only have a
handful of "test programs" for their compiler.  If the compiler
does any peep hole optimizations, the compiler group probably
compiles whatever test programs they have to see how their
compiler might generate better code.  If any of their test
programs are benchmarks, they will be spending time making the
benchmarks run fast.

Paul Scherf, Tektronix, Box 1000, MS 61-033, Wilsonville, OR, USA
tektronix!orca!paulsc

eugene@pioneer.UUCP (04/07/87)

In article <2344@orca.TEK.COM> paulsc@orca.UUCP (Paul Scherf) writes:
>
>It isn't always a conscious effort to trick customers into
>buying the compiler.  The compiler group might only have a
>handful of "test programs" for their compiler.
>
>Paul Scherf, Tektronix, Box 1000, MS 61-033, Wilsonville, OR, USA

This is part of the point of my research: to get everyone writing
compilers to have a large consistent set of programs to check
optimizations just like we have programs (validation suites) to check
conformance to language standards.  Hope, I can get the various companies
interested in a set of open performance measurement tool.

From the Rock of Ages Home for Retired Hackers:

--eugene miya
  NASA Ames Research Center
  eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene

am@cl-jenny.UUCP (04/08/87)

In article <15292@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>One unique feature of the Am29000 architecture is a special instruction....
> ... [ to speed up detecting null bytes in words].
You might be interested to know that such detection of null bytes in words
can be done in 3 or 4 instructions on almost any hardware (nay even in C).
(Code that follows relies on x being a 32 bit unsigned (or 2's complement
int with overflow ignored)...)
   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
Then if e is an expression without side effects (e.g. variable)
   has_nullbyte_(e)
is nonzero iff the value of e has a null byte.
(One can view this as explicit programming of the Manchester carry chain
present in many processors which is hidden from the instruction level).

On Acorn's ARM risc machine using this feature as Brian Case suggests speeded
up dhrystone by 20-25%, which may comment on dhrystone or risc. This gives
some 5-6K dhrystones on the slow clocked pre-production 3 micron chips.
(Wonder what state of the art fabrication can give....)

karl@haddock.UUCP (04/11/87)

In article <486@danews.ATT.COM> lvc@danews.ATT.COM (Larry Cipriani) writes:
>In article <15317@amdcad.UUCP>, rpw3@amdcad.UUCP (Rob Warnock) writes:
>> I find myself doing:
>> 	char *p = &buf[0];
>> 	int n;
>> 	strcpy(p, "piece");		p += strlen(p);
>> 	strcpy(p, "another piece");	p += strlen(p);
>> 	sprintf(p, "format%d%s", args...); p+= strlen(p);
>> 	strcpy(p, "still another piece"); p += strlen(p);
>> 	etc...
>
>It seems to me that the code would be far more readable and faster
>if it were written as:   p = cat(p1, p2, ..., (char*)0);

Or, to better mimic the original semantics,  p = cat(buf, p1, ...) with no
malloc involved.  (Yeah, the malloc version is useful too; I've got one of my
own.)

However, this is only useful if each operation is a strcpy or strcat; it
doesn't work if there are intermixed one-char stores (*p++ = c) or more
complex stuff (the embedded sprintf in the original example).

Of course, one could always write the whole thing with sprintf, using "%s%s%s"
format instead of strcat(strcat(strcpy(...))), but sometimes it's better to
use the low-level stuff.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

stevew@karau.UUCP (04/11/87)

In article <486@danews.ATT.COM>, lvc@danews.ATT.COM (Larry Cipriani) writes:
> In article <15317@amdcad.UUCP>, rpw3@amdcad.UUCP (Rob Warnock) writes:
> > I find myself doing:
> > 	char *p = &buf[0];
> > 	int n;
> > 	strcpy(p, "piece");		p += strlen(p);
> > 	strcpy(p, "another piece");	p += strlen(p);
> > 	etc...
> It seems to me that the code would be far more readable and faster
> if it were written as:
> 	p = cat(p1, p2, ..., (char*)0);
> where all the p's are pointers to characters.  Cat will malloc
> enough space to store a concatenation of all the pn's and a '\0'.

Another possible solution would be:
	cat(p, n, p1, p2, ..., (char*)0);
Where p == destination string,  n == amount of space available,
p1 == the first pointer,  etc...

You might have two versions one with n, one without.  The debate is now
what it should return:
	1. pointer to the initial string.
	2. the number of characters cat()'ed.
	3. a pointer to the terminating null.

jack@mcvax.UUCP (04/15/87)

In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes:
>You might be interested to know that such detection of null bytes in words
>can be done in 3 or 4 instructions on almost any hardware (nay even in C).
>(Code that follows relies on x being a 32 bit unsigned (or 2's complement
>int with overflow ignored)...)
>   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
>Then if e is an expression without side effects (e.g. variable)
>   has_nullbyte_(e)
>is nonzero iff the value of e has a null byte.

I was so impressed by this new trick (well, to *me* it is new:-)
that I immedeately decided to try it. my Whitechapel MG-1,
a 32016 based machine, the results were impressive.

I coded strcpy() using this methods, and the results were
great. Break-even with normal strcpy() at 4-char strings, performance
slightly worse with 5/6/7-char strings, and getting better and
better from there on. For strings with length 4N (N>=4) performance
was twice that from old strcpy(). This is the routine:

#define hasnull(x) ((x-0x01010101) & ~(x) & 0x80808080)

strcpy(at,f)
    long *at;
    register long *f;
{
    register long d;
    register long *t = at;
    register char *fc, *tc;

    do {
	d = *f++;
	if( !hasnull(d) ) {
	    *t++ = d;
	    continue;
	}
	tc = (char *)t;
	fc = (char *)(f-1);
	while( *tc++ = *fc++);
	return;
    } while(1);
    return(at);
}

Coding in assembler caused a 30% decrease in time for small (10-char)
strings (less registers to save, t/tc and f/fc in the same reg, etc).
Something I haven't explained yet is that unaligned strings give the
*same* performance. Maybe the extra fetches are noise wrt the
number of instruction fetches?

Note that the 32016 is a 32 bit machine with a 16 bit bus, 
so that is probably why I found twice the speed, in stead of four
times.

Anyway, the next thing I thought of is "Wow! This is *great* for
strcmp() on big-endians. Comparing 4 bytes in one go through
the loop!". But, of course, I don't have a big-endian handy.

Anyone care to try this?
-- 
	Jack Jansen, jack@cwi.nl (or jack@mcvax.uucp)
	The shell is my oyster.

greg@utcsri.UUCP (04/17/87)

In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes:
>   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
>Then if e is an expression without side effects (e.g. variable)
>   has_nullbyte_(e)
>is nonzero iff the value of e has a null byte.

If one of the bytes contains 0x80, then has_nullbyte() will
say 'yes'. This can be circumvented by a more thorough test
after this one to see if there really is a null there.

Jack Jensen's subroutine does this apparently by accident;
there is a 'while( *tc++ = *fc++ );' after the test finds a 'null'
so the only effect of having a 0x80 byte will be to revert to
the normal strcpy for the rest of the string.

Someone posted a similar but usually better test on comp.arch ( I think )
a little while ago:

#define has_nullbyte(e)  ((e+0x7efefeff)^e)&0x81010100) != 0x81010100

This one is only 'fooled' by an 0x80 in the most significant byte.
which makes the following test much simpler ( a sign test ).
In either case ( especially in this one with two identical
constants ) it helps a lot if the constants are in registers while
the loop is running. When using C, you can do this explicitly:

register k1=0x7efefeff, k2=0x81010100;

-- 
----------------------------------------------------------------------
Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg
Have vAX, will hack...

lcc.rich-wiz@CS.UCLA.EDU (Richard Mathews) (04/20/87)

> In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes:
> >You might be interested to know that such detection of null bytes in words
> >can be done in 3 or 4 instructions on almost any hardware (nay even in C).
> >(Code that follows relies on x being a 32 bit unsigned (or 2's complement
> >int with overflow ignored)...)
> >   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
> >Then if e is an expression without side effects (e.g. variable)
> >   has_nullbyte_(e)
> >is nonzero iff the value of e has a null byte.

> I was so impressed by this new trick (well, to *me* it is new:-)
> that I immedeately decided to try it. my Whitechapel MG-1,
> a 32016 based machine, the results were impressive.

Be careful if you use this.  It does not work correctly for strings near
the end of memory.  Consider an (unaligned) string which is 21 bytes long
and is right at the end of data space (or any data segment on machines
which have such nonsense).  The posted code will fault on the last read.
On machines where data space is always a multiple of 4 bytes long, this
method will work if you first copy enough bytes to get the source pointer
to a longword boundary and then start copying 4 bytes at a time.

Someone else claimed that the has_nullbyte macro above does not work
correctly if it contains a byte equal to 0x80.  I do not see this.
Consider x==0x17801717.
	x - 0x01010101  == 0x167f1616
	~x		== 0xe87fe8e8
	(x - ONES) & ~x == 0x007f0000
	has_nullbyte(x) == 0x00000000
The claim that has_nullbyte(x) will be TRUE appears to be incorrect.
(I was able to prove to myself that the macro always gives the correct
answer, but I won't bother to write it up and post it.)

Richard M. Mathews
Locus Computing Corporation		       lcc.richard@LOCUS.UCLA.EDU
					       lcc.richard@UCLA-CS
				 {ihnp4,trwrb}!lcc!richard
       {randvax,sdcrdcf,ucbvax,trwspp}!ucla-cs!lcc!richard

bjorn@alberta.UUCP (04/22/87)

In article <4605@utcsri.UUCP>, greg@utcsri.UUCP writes:
>In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes:
>>   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
>>Then if e is an expression without side effects (e.g. variable)
>>   has_nullbyte_(e)
>>is nonzero iff the value of e has a null byte.
>
>If one of the bytes contains 0x80, then has_nullbyte() will
>say 'yes'. This can be circumvented by a more thorough test
>after this one to see if there really is a null there.

You're mistaken the "has_nullbyte_(x)" defined above works
as advertised for all 32 bit x.  That is to say it returns
a nonzero result if and only if x contains a null byte.

>Someone posted a similar but usually better test on comp.arch ( I think )
>a little while ago:
>
>#define has_nullbyte(e)  ((e+0x7efefeff)^e)&0x81010100) != 0x81010100
>
>This one is only 'fooled' by an 0x80 in the most significant byte.
>which makes the following test much simpler ( a sign test ).

You are right this one does not always tell the truth.  Besides
it's a lot more effort.

However for tricks like these to work reliably (especially on
systems with memory protection) you had best get help from
your local linker and 'malloc'.  Otherwise one day a program
is going to read off the edge of an accessible hunk of memory
and flush your program straight down the drain.  Making sure
that no data item ends closer than n-1 bytes (if you're reading
n bytes at time) from a boundary (between accessible and
inaccessible memory) fixes this.

			Bjorn R. Bjornsson
			{mnetor,ihnp4}!alberta!bjorn

greg@utcsri.UUCP (04/23/87)

In article <299@pembina.UUCP> bjorn@alberta.UUCP (Bjorn R. Bjornsson) writes:
>In article <4605@utcsri.UUCP>, greg@utcsri.UUCP writes:
>>In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes:
>>>   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
>>>Then if e is an expression without side effects (e.g. variable)
>>>   has_nullbyte_(e)
>>>is nonzero iff the value of e has a null byte.
>>
>>If one of the bytes contains 0x80, then has_nullbyte() will	[BUNK]
>>say 'yes'. This can be circumvented by a more thorough test	[BUNK]
>>after this one to see if there really is a null there.	[BUNK]
>
>You're mistaken the "has_nullbyte_(x)" defined above works
>as advertised for all 32 bit x.  That is to say it returns
>a nonzero result if and only if x contains a null byte.

Right. I apologize for any inconvenience I may have caused anybody.
I also apologize for posting bunk in a sleepy haze and then vanishing
for a week.
>
>>Someone posted a similar but usually better test on comp.arch ( I think )
>>a little while ago:
>>
>>#define has_nullbyte(e)  ((e+0x7efefeff)^e)&0x81010100) != 0x81010100
>>
>>This one is only 'fooled' by an 0x80 in the most significant byte.
>>which makes the following test much simpler ( a sign test ).
>
>You are right this one does not always tell the truth.  Besides
>it's a lot more effort.

I seems to be about the same, in the loop. Below is the subtract version, and
the add version on an NS32k:

	movd	0(r0),r1
	subd	r2,r1
	bicd	0(r0),r1
	andd	r3,r1		;r2 & r3 contain constants
	cmpqd	$0,r1		;need this on ns32k

	movd	0(r0),r1
	addd	r2,r1
	xord	0(r0),r1
	andd	r3,r1
	cmpd	r3,r1

I can't remember why I thought it was more efficient. I guess it
is if you haven't got 'bic'. ( and if 'and' doesn't set the Z flag :-)
The original poster of this method was describing a RISC which
had a 'compare registers and branch if equal' op.
>
>However for tricks like these to work reliably (especially on
>systems with memory protection) you had best get help from
>your local linker and 'malloc'.  Otherwise one day a program
>is going to read off the edge of an accessible hunk of memory
>and flush your program straight down the drain.  Making sure
>that no data item ends closer than n-1 bytes (if you're reading
>n bytes at time) from a boundary (between accessible and
>inaccessible memory) fixes this.

Not generally a problem. The whole point of this is that the 32-bit
accesses are constrained to 32-bit boundaries to make them fast.
Furthermore, in most systems, your access-controlled segments or
whatever are rounded up to a size at least as round as the machine
word. (The MMU does not usually deal with address lines a0,a1 on a
32-bit machine). Thus if it is legal to read the first byte in a word,
it is legal to read the whole word, and thus the last byte.
-- 
----------------------------------------------------------------------
Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg
Have vAX, will hack...

am@cl.cam.ac.uk (Alan Mycroft) (04/27/87)

In article <4605@utcsri.UUCP> greg@utcsri.UUCP (Gregory Smith) writes:
>In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes:
>>   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
>>Then if e is an expression without side effects (e.g. variable)
>>   has_nullbyte_(e)
>>is nonzero iff the value of e has a null byte.
>
>If one of the bytes contains 0x80, then has_nullbyte() will
>say 'yes'. This can be circumvented by a more thorough test
>after this one to see if there really is a null there.

Untrue.  It would be true if I had mistakenly used ^ instead of &~.