[comp.arch] String Processing Instruction

bcase@amdcad.UUCP (03/25/87)

There was a discussion a few months ago about processing strings more
efficiently than a byte at a time.  The Am29000 takes one of the possible
approaches to improving string processing performance....


One unique feature of the Am29000 architecture is a special instruction.
This instruction is intended to be used to speed-up string processing,
but my guess is that other uses will be discovered.  The instruction is
called "compare-bytes" and works like this:

Compare bytes specifies two source register operands and one destination
register operand.  The 4 pairs of corresponding bytes of the two 32-bit
source operands are compared for equality (i.e., the two most-significant
bytes are compared, the two next-most-significant bytes are compared, etc.).
If any of the four pairs are equal, then the destination register is set
to the value "TRUE" (which on the Am29000 is a one in the most-significant
bit with all other bits cleared to zero).  If none of the four pairs are
equal, then the destination register is set to "FALSE" (all bits cleared).
(Am29000 conditional branch instructions test only the most significant bit of
a register, condition codes are not used; we get a free "test for negative.")

So, if one of the source operands is set to all zeros (four null characters)
(which can be specified in the instruction by choosing the second operand
as the zero-extended eight-bit constant zero) and the other operand is a
word of the character string being dealt with (say for copying or comparing),
the Am29000 can, in one cycle (not counting the branch), determine if the word
contains the end of string character (according to the C language definition
of string).  If the word does not contain the end of string character, then
the four bytes in the word can be manipulated (e.g. loaded or stored) as a
unit.  Word operations on the Am29000 are much more efficient than character
operations (this is true of most machines though).

There are, of course, special circumstances to deal with (such as misaligned
strings, and we have a funnel shifter to help in those cases), but by using
the compare-bytes instruction in the library routines strcpy() and strcmp()
(and strlen() too, but we haven't bothered since it seems to never be used in
the programs we have encountered), significant improvements in the run-time
of many C programs can be realized.  Another thing which really helps is to
have the compiler word-align literal strings (and I have implemented this),
but even with word-alignment, some substrings will begin on strange boundaries
and must be dealt with correctly.

My approach to using this instruction consisted of re-writing the library
routines in C with function calls wherever the compare-bytes instruction
should go.  I compiled this C code with my compiler, changed the assembly code
to eliminate the function calls in favor of the compare-bytes instruction, and
assembled it into the library (actually a module of code that gets included in
all final links, but that is just a detail of our simple environment).  Since
most C programs (especially utilities and other systems programs) do a lot of
string processing, this one instruction is really worth the small
implementation cost.  It often improves run times by 15% to 20% (just goes to
show that the impact of processing C language strings has been long- ignored).
It implements just the right semantics and probably has other applications for
specialized pattern matching.  

I just thought some of you would be interested.

   bcase

lamaster@pioneer.UUCP (03/26/87)

In article <15292@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>There was a discussion a few months ago about processing strings more
>efficiently than a byte at a time.  The Am29000 takes one of the possible
>approaches to improving string processing performance....
>

:

>
>I just thought some of you would be interested.
>
>   bcase

I appreciate seeing submissions like this.  Designing a computer instruction
set is a little like backpacking: There is an art to figuring out exactly
what you really need to take with you. 

My question is this:  How likely is it that a compiler itself will be able to
detect the case when it can use an instruction like this and generate code
automatically to use it.  One of the positive points to the RISC debate is
that it brought out the point that useful instructions which are hard for a
compiler to generate are not always a win.  The old STAR-100 had a number of
instructions, including string processing instructions, which were left out of
the Cyber 205 because it turned out that the compilers never generated them.
They were neat instructions though.  So, my question is, is this instruction
easily usable by a C, Fortran, Pascal, or ADA compiler and if so, under what
conditions?


  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

"In order to promise genuine progress, the acronym RISC should stand 
for REGULAR (not reduced) instruction set computer." - Wirth

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")

baum@apple.UUCP (03/26/87)

--------
[]
>
>One unique feature of the Am29000 architecture is a special instruction.
>This instruction is intended to be used to speed-up string processing,
>but my guess is that other uses will be discovered.  The instruction is
>called "compare-bytes" and works like this:.....

The HP Spectrum has three instructions that are very similar.
 Unit XOR will Exlusive-OR two registers and let you skip if any byte or
   halfword is equal or not equal.

 Unit AddComplement[&Trap on Condition] will add the ones complement of
   one register to another. This is sort of subtraction without the forced
   carry. That way, all bytes, halfwords & nibbles essentially have the
   same carry-in (although a carry out of one stage could force a carry-in
   to another, it turns out this doesn't affect the major uses of the 
   instruction). Then, either a trap will occur or the next instruction will
   be skipped if the conditions are met. The conditions are any halfword/
   byte/nibble carry, any halfword/byte zero, and their complements.

Among other things, these instructions can be used for string searching a
word at a time, decimal/byte/halfword range checking/validation a byte at
a time, and decimal addition pre-biasing.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

mark@mimsy.UUCP (03/26/87)

In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:
>...How likely is it that a compiler [can] detect the case when it can use 
> an instruction like this and generate code automatically to use it?

This is of course a good point, and it seems that this is going to be
one of those hard cases for compiler writers.  But it opens the issue
of subroutine libraries as providing an environment which also needs
support in the instruction set.  The Unix string routines are a good
example, because they are even now often recoded in assembly for efficiency.
Various pieces of Unix kernel frequently suffer the same fate
(e.g. tty buffers).  But once one is into instructions which are good
for certain operating systems or environments, rather than certain languages,
one has narrowed one's market.  For instance, the Am29000 special string
compare is useless if strings are encoded as a length followed by the chars.  
-mark
Background: I happen to think Unix-style strings are a big market, and I love
things which make Unix faster (because there is a positive feedback cycle
which makes more Unix yet more available and desirable.).
-- 
Spoken: Mark Weiser 	ARPA:	mark@mimsy.umd.edu	Phone: +1-301-454-7817
After May 1, 1987: weiser@xerox.com

freeman@spar.UUCP (03/26/87)

<*munch*>

In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:

> My question is this:  How likely is it that a compiler itself will be able to
> detect the case when it can use an instruction like this and generate code 
> automatically to use it.  One of the positive points to the RISC debate is
> that it brought out the point that useful instructions which are hard for a
> compiler to generate are not always a win.  

Note that this particular instruction may well win even if no compiler EVER
generates it:  BCase's posting stated that hand-modifying the object code
for two string-processing routines in the standard C library to use the
instruction, and arranging that the compiler word-align strings where
possible, effected consequent frequent improvements in run time of 15-20
percent.  That sounds like a substantial win for C programming, and surely a
fair amount of the code that does get written for this processor will be in
C.  That may be reason enough for the instruction's existence.

Although the point from the RISC debate, that LaMaster summarizes above, is
carefully qualified, it is perhaps nonetheless too narrow a view:  Surely
there are many other cases in which a well-planned library of calls to
functions coded with particular care, will allow efficient use of highly
specialized instructions.  Proper analysis of whether or not to include a
given instruction in an instruction set should include the effect of such
cases.

						-- Jay Freeman

bcase@amdcad.UUCP (03/26/87)

In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:
>I appreciate seeing submissions like this.  Designing a computer instruction
>set is a little like backpacking: There is an art to figuring out exactly
>what you really need to take with you. 
Thanks for the positive feedback.  An interesting point of view!

>My question is this:  How likely is it that a compiler itself will be able to
>detect the case when it can use an instruction like this and generate code
>automatically to use it.  One of the positive points to the RISC debate is
>that it brought out the point that useful instructions which are hard for a
>compiler to generate are not always a win.  The old STAR-100 had a number of
>instructions, including string processing instructions, which were left out of
>the Cyber 205 because it turned out that the compilers never generated them.
>They were neat instructions though.  So, my question is, is this instruction
>easily usable by a C, Fortran, Pascal, or ADA compiler and if so, under what
>conditions?

This is an excellent question, one that I intended to address in the original
posting.  However, I forgot to include it when the damn article got kinda
long.

The fact of the matter is:  A C compiler is going to have a real hard time
generating this instruction.  A Pascal Compiler might have a slightly
easier time since the conept of "string" is a little more integrated with
the langauge.  In general, a compiler for a language without built-in types
and functions for dealing with strings will have a hard time with our
compare-bytes instruction.  We realized this when we added it to the
architecture (and it was truly a last-minute addition).  However, it has
such a significant impact on C programs in general that we were not deterred
by the realization that this instruction will probably occur only a few
times in the strcpy(), strcmp(), and strlen() library functions.  Out of
all the megabytes of code in a UNIX system, this instruction will account
for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough
guess).  But if it improves the running time of all the string-oriented
system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems
worth it.  And the implementation cost was so small.  Also, there are
some instructions that must be present just to administer the system,
like return from interrupt, move-from-special-register, etc.  These
are not generated by a compiler either.  Just to reiterate a point:  RISC
is not reducing the instruction set but is improving performance.

Ok, so you don't believe the above?  How about "It improved dhrystone
a hell of lot."

    bcase

guy@gorodish.UUCP (03/26/87)

>My question is this:  How likely is it that a compiler itself will be able to
>detect the case when it can use an instruction like this and generate code
>automatically to use it.

In most C implementations, "strcpy", "strcmp", etc. are library
routines; if the speedup from these instructions is sufficiently
large, it doesn't matter if the compiler can generate them because
the compiler can generate calls to those routines, and those routines
can be written in assembler language.  You could also imagine a
compiler that would recognize names like "_builtin_strcpy" and
generate the assembler code for those routines in line, and a
<string.h> that contains

	#define	strcpy	_builtin_strcpy

(see the footnote on page 86 of the October 1, 1986 ANSI C draft).
Another possibility might be to have something like the WE32000 C
compiler's "asm" directive, which permits you to define "functions"
that are expanded in line to assembler code.

Similar techniques can probably be used in other languages.  If they
do string processing in routines, the routines can be written in
assembler language to use these instructions.  If they generate
stereotyped sequences for these string operations, these stereotyped
seqences can use these instructions.

However, I don't know how difficult it would be to recognize various
special cases and generate non-stereotyped sequences that use these
instructions, or how much of a speedup this would give (if these
special cases weren't common, speeding them up might not buy you
enough to worry about it).

jaw@ames.UUCP (03/27/87)

> [....] Since
> most C programs (especially utilities and other systems programs) do a lot of
> string processing, this one instruction is really worth the small
> implementation cost.  It often improves run times by 15% to 20% (just goes to
> show that the impact of processing C language strings has been long- ignored).

just curious which unix utilities use str(cpy|cmp|len) in their inner loops?
certainly, 'vn' comes to mind as devoting much cpu time to these functions.

is the 15-20% claimed due at all to either

	(a) significantly slow byte addressing to begin with (ala cray)?
	(b) in-line compilation of the string(3) stuff into the application?

if (a), then improving memory byte access speed in the architecture is a
more general solution with more payoff overall than the compare gate hack.
what is the risc chip cost for byte vs. word addressibility, anyway?

if (b), then maybe function call speed is the culprit rather than dearth of
the specialized instruction.

at any rate,
for cray unix, buffer copy ops in the kernel were vastly improved when
re-written for words instead of bytes, even more so when vectorized
(the only place in the kernel with vectorization, i think).

of course, table lookup using only 2^16 locations would be a joke
software solution for super-intensive null-char-in-16-bit-smallword
compare code.  drastic, but saves a test the amd chip appears worried about.
personally, i'm a fan of branch folding ...

ames!jaw

mash@mips.UUCP (03/27/87)

r
Keywords: 

In article <909@spar.SPAR.SLB.COM> freeman@spar.UUCP (Jay Freeman) writes:
>In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:
>
>> My question is this:  How likely is it that a compiler itself will be able to
>> detect the case when it can use an instruction like this and generate code 
>> automatically to use it.  One of the positive points to the RISC debate is
>> that it brought out the point that useful instructions which are hard for a
>> compiler to generate are not always a win.  
>
>Note that this particular instruction may well win even if no compiler EVER
>generates it:  BCase's posting stated that hand-modifying the object code
>for two string-processing routines in the standard C library to use the
>instruction, and arranging that the compiler word-align strings where
>possible, effected consequent frequent improvements in run time of 15-20
>percent.  That sounds like a substantial win for C programming, and surely a
>fair amount of the code that does get written for this processor will be in
>C.  That may be reason enough for the instruction's existence.
>
>Although the point from the RISC debate, that LaMaster summarizes above, is
>carefully qualified, it is perhaps nonetheless too narrow a view:  Surely
>there are many other cases in which a well-planned library of calls to
>functions coded with particular care, will allow efficient use of highly
>specialized instructions.  Proper analysis of whether or not to include a
>given instruction in an instruction set should include the effect of such
>cases.

This is a fairly interesting and useful discussion, especially if we
can generate a few answers. Amusingly enough, neither MIPS nor IBM
put this instruction in, although Marty Hopkins (IBM) once offered a desire to
have it (even though it disobeyed the "must be useful from compilers
rule"), and we've thought about it off and on.

At one point, I did some hunting around for statistics on the use of the
str* routines, how much time was spent in them, the distribution of
string lengths, etc [you may have seen postings to such effect in
the last few years in this group].  Weirdly enough, either nobody knows
very much about the global nature of this, or nobody's telling.
We did some rummaging and profiling of code, and were astonished to find
the actual amount of time spent in the str* routines to be pretty small.
[This was very non-intuitive to me.] Maybe we looked at the wrong things.
So, now a few questions:
1) for Brian: could you say which programs you got the big improvements on?
It would be very interesting for people to be able to try the test cases
on some different machines and see how mcuh time is spent on these.
2) for anybody:
	a) What percentage of your machine's daily work is spent doing
	str* routines?
	b) Of that, how much is getting there and getting set up, and
	how much is in the fundamental loops?
	c) For each of strcpy, strcmp, and strlen, what is the distribution
	of lengths of argument?
	d) For strcmp, what is the fraction of times it finds equality.
	e) What fraction of strings begin on non-word boundaries/
	f) For strcpy and strcmp, what fraction of calls start on word
	boundaries, what fraction don't start on words, but at least have
	similar alignment within the word, and what fraction are completely
	misaligned.
[I think that there is actually a useful MS thesis around in here if done
well.]

Finally, regarding the inclusion of hand-coded cases, i.e., of instructions
that can't be gotten to from compilers, but may still be useful, our
sense is that there aren't very many like that.  This one is one of the
few like that [the others we've generally considered have often been
compilable] that at least has seemed plausible to lots of us.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.UUCP (03/27/87)

In article <15304@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>...
>for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough
>guess).  But if it improves the running time of all the string-oriented
>system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems
>worth it.  And the implementation cost was so small.  Also, there are
This is clearly true.
>some instructions that must be present just to administer the system,
>like return from interrupt, move-from-special-register, etc.  These
>are not generated by a compiler either.  Just to reiterate a point:  RISC
>is not reducing the instruction set but is improving performance.
Absolutely! There are almost always "structural" instructions that
are hard to generate, but are needed.
>
>Ok, so you don't believe the above?  How about "It improved dhrystone
>a hell of lot."

Unfortunately, Dhrystone is an artificial benchmark.  I couldn't resist
doing a quick test [we have some pretty neat profiling tools for taking
an already-compiled program & turning it into one that does procedure &
statement counts, then gives you instruction cycle counts [which would be,
typically 60% of the total cycles, the other 40% being cache misses,
TLB-miss overhead, memory path interference, etc.]
OK: here's a quiz: I did some simple tests [ones we use in our standard
Performance Brief suite, which includes diff, grep yacc, and nroff],
plus csh.  I won't pretend this is definitive, since I just whipped
it up.  How much time would you guess these programs spend doing
strlen, strcmp. strcpy [the high-runners]? [Just guess the instruction
cycle %].  Try 0-1%, 2-5%, 6-10%, 11-20%, 21%-up.



















ANSWERS:
program		strlen		strcmp		strcpy
		% cycs	cy/call	% cycs	cy/call	% cycs	cy/call
diff		.03%	-	0	-	0	-
grep		none
yacc		.04%	-	.59%	-	0	-
nroff		0	-	0	-	<0.1%	-
csh		 1.71%	20	 1.27%	9	 1.84%	21
  % of total
  func calls		3.76%		6.11%		3.75%

Dhrystone	<.01%	19	16.94%	103	22.36%	136

Bottom-line: 
	1) Dhrystone isn't remotely representative of some
	common string-pushing programs in UNIX.
	2) most of these, the total is <1% of instruction cycles,
	hence <0.6% of fully-degraded cycles. Maybe you can save 20%
	of this, or about .1%.
	3) For csh: what's going on is that these routines are called
	very frequently, but for short strings: 3-6 characters;
	strcmp's obviously fail quickly [2nd or 3rd character].
	I think the implication is that maybe you can get rid of 20% of
	the cycles, which would be a 1% instruction cycle saving,
	or about <0.6% full-degraded cycle saving for csh.
	4) Given all of this, maybe what you get can be grossly estimated
	as about .3%, maybe.  [Again, this was somethign whipped up in
	half an hour, so hardly definitive].
	5) Note that Dhrystone spends a huge lot of its time copying and
	comparing long strings.  Hence, it's well worth a little extra
	setup time for Dhrystone to lessen the cost per loop.
	[In fact, we only got around to installing an assembler strcpy
	when we noticed how much time was spent there in Dhrystone.]

Thus, I'd say it's still an open issue [for those of you who happen
to be designing computer architectures at this instant!]
	Dhrystone: says it's important, but is totally unrepresentative.
	csh: says it might get you .6%
	others: surprise you by using str* so little
	general evidence: says it might be useful, but if you follow our
		1% rule [and HP used something similar, as I recall],
		there is as yet insufficient evidence to include it.
If I were right now designing a system, I'd be tempted to do a lot
of data-gathering.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

shebanow@ji.Berkeley.EDU.UUCP (03/27/87)

Every time I trace a program which uses stdio to any great degree, I find
printf to be the worst offender (with regard to time consumed). In fact,
I have seen programs which spend 30% of their time in printf (actually
_doprnt). In the code I have optimized, I usually try to eliminate the
use of printf (which can make the code rather opaque). This does result
in 10-20% improvements in speed. The most obvious use of new string
processing instructions would be to improve the performance of printf.
Has anyone done this?

Mike Shebanow

cmcmanis@sun.UUCP (03/27/87)

In article <15304@amdcad.UUCP>, bcase@amdcad.UUCP (Brian Case) writes:
> 
> Ok, so you don't believe the above?  How about "It improved dhrystone
> a hell of lot."
> 
>     bcase

This I believe, I was wondering when someone would point that out. I 
think it is time for some *real* optimizing compilers that can detect
when they are compiling a benchmark and optimize for the appropriate 
instruction! :-) :-) :-)

-- 
--Chuck McManis
uucp: {anywhere}!sun!cmcmanis   BIX: cmcmanis  ARPAnet: cmcmanis@sun.com
These opinions are my own and no one elses, but you knew that didn't you.

metzger@heathcliff.columbia.edu.UUCP (03/27/87)

In article <18036@ucbvax.BERKELEY.EDU> shebanow@ji.Berkeley.EDU.UUCP (Mike Shebanow) writes:
>Every time I trace a program which uses stdio to any great degree, I find
>printf to be the worst offender (with regard to time consumed). In fact,
>I have seen programs which spend 30% of their time in printf (actually
>_doprnt). In the code I have optimized, I usually try to eliminate the
>use of printf (which can make the code rather opaque). This does result
>in 10-20% improvements in speed. The most obvious use of new string
>processing instructions would be to improve the performance of printf.
>Has anyone done this?

Here at Project Synthesis we found that out a while back. It is pretty
strange to think of it that way, but yes, Printf and friends are
compute bound, not I/O bound. We have experimented with a number of
solutions, including a hardware printf coprocessor for the 68020.
Predictably, it improved performance quite a bit. We also have tried
using specially coded versions of printf in assembly language that we
passed through a (very time expensive) exhaustive search optimiser and
the results were also pretty good.

Perry Metzger

bcase@amdcad.UUCP (03/28/87)

In article <1001@ames.UUCP> jaw@ames.UUCP (James A. Woods) writes:
>just curious which unix utilities use str(cpy|cmp|len) in their inner loops?
>certainly, 'vn' comes to mind as devoting much cpu time to these functions.
>
>is the 15-20% claimed due at all to either
>
>	(a) significantly slow byte addressing to begin with (ala cray)?
>	(b) in-line compilation of the string(3) stuff into the application?
>
>if (a), then improving memory byte access speed in the architecture is a
>more general solution with more payoff overall than the compare gate hack.
>what is the risc chip cost for byte vs. word addressibility, anyway?
>
>if (b), then maybe function call speed is the culprit rather than dearth of
>the specialized instruction.
>
>at any rate,
>for cray unix, buffer copy ops in the kernel were vastly improved when
>re-written for words instead of bytes, even more so when vectorized
>(the only place in the kernel with vectorization, i think).
>
>of course, table lookup using only 2^16 locations would be a joke
>software solution for super-intensive null-char-in-16-bit-smallword
>compare code.  drastic, but saves a test the amd chip appears worried about.
>personally, i'm a fan of branch folding ...
>

Re:  (a) above.  NO.  Implementing a byte addressable (especially writable)
memory slows down all memory accesses for a slight improvement in byte
processing efficiency.  For my own part, I can say that from the beginning
of the Am29000 project, I was firmly against anything but a word-oriented
interface.  You have to realize that byte accessing requires an alignment
network somewhere:  it will add some nanoseconds to all memory accesses; you
can put the alignment network in its own pipeline stage, but even then, it
will *always* slow down every memory access, there is nothing you can do about
it!  (The same reasoning leads to our addressing-mode-less load and store
instructions:  if addressing modes are there, then the instructions always
pay the cost even when the compiler knows better!)  Thus, this is not strictly
a RISC issue, but a maximum through-put issue regardless of RISCness.  (Note
that the Am29000 extract-byte and insert-byte instructions are essentially
putting the alignment network in a pipeline stage, but the compiler can
decided to pay or not pay the penalty for a given memory access (depending
upon whether it is a byte access or not)).

Re:  (b) above.  NO.  We do not do inlining of the strcmp(), strcpy(), or
strlen() routines (I wish we could, it would be even better!).  The Am29000
has one of the fastest calling conventions around.  The performance
improvement we saw with the special instruction is an order of magnatude
(ok, I'm taking a guess here, but it is probably pretty close) greater than
what would have been gained by in-lining in this case.

Re:  Branch folding.  I like all optimizations.  Give me more MIPS!  More.
MORE.  MORE!

    bcase
---------------

Back again by popular demand:

Sung to the tune of "Timewarp:"

	It's just a shift the left,
	And then a shift to the ri-i-ight!
	Get a load of those MIPS
	And the code's real ti-i-ght!
	This lock-step pipeline
	Really drives me insa-a-a-e-a-ane
	Let's do the I-fetch again!

bcase@amdcad.UUCP (03/28/87)

In article <5969@mimsy.UUCP> mark@mimsy.UUCP (Mark Weiser) writes:
>In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:
>>...How likely is it that a compiler [can] detect the case when it can use 
>> an instruction like this and generate code automatically to use it?
>
>This is of course a good point, and it seems that this is going to be
>one of those hard cases for compiler writers.  But it opens the issue
>of subroutine libraries as providing an environment which also needs
>support in the instruction set.  The Unix string routines are a good
>example, because they are even now often recoded in assembly for efficiency.
>Various pieces of Unix kernel frequently suffer the same fate
>(e.g. tty buffers).  But once one is into instructions which are good
>for certain operating systems or environments, rather than certain languages,
>one has narrowed one's market.  For instance, the Am29000 special string
>compare is useless if strings are encoded as a length followed by the chars.  

Yes, you are right:  when the length is known, things are *much* easier.
But, the instruction is still a useful primitive:  it can be used to find
out, in one cycle, whether or not a 32-bit word contains a byte of
interest.  The compiler may not generate it, but it can be used in hand-
coded inner loops for this purpose.

    bcase

bcase@amdcad.UUCP (03/28/87)

In article <230@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>At one point, I did some hunting around for statistics on the use of the
>str* routines, how much time was spent in them, the distribution of
>string lengths, etc [you may have seen postings to such effect in
>the last few years in this group].  Weirdly enough, either nobody knows
>very much about the global nature of this, or nobody's telling.
>We did some rummaging and profiling of code, and were astonished to find
>the actual amount of time spent in the str* routines to be pretty small.
>[This was very non-intuitive to me.] Maybe we looked at the wrong things.
>So, now a few questions:
>1) for Brian: could you say which programs you got the big improvements on?

Well, there is, of course, dhrystone.  Now, now, calm down, we here know
very well that dhrystone is *quite* skewed away from reality.  But we
have to pay homage to dhrystone for pointing this out to us and making us
find a solution (I think it is so typical that something we thought was
at least a little original happens to have been thought of by many other
people long before us!  Oh well, "All the good ideas have already been
had.")  We are at this moment dredging up our early results.  However,
most have been deleted by this time.  But, and I think I remember
correctly, dhrystone and our assembler were the two big ones.  However,
I think the compiler will also show a benefit as will any program that
does some comparing against fixed strings (e.g. I usually implement opcode
lookups and such with a binary search through a sorted array of strings).

So, I would like to turn the question around, John, and ask you to do
what we might find difficult:  look at the dynamic execution of your
compiler.

Well, Tim just resimulated nroff:  the conclusion is that strcpy() isn't
worth talking about.

Sigh, it may be that there are just a few important programs that will
benefit.  Out of the 100 most popular programs, maybe two benefit sub-
stantially from the compare-bytes instruction.  But if those two are the
most popular (or close to it), it might still be worthwhile to have the
instruction.  Dhrystone is either important or not depending upon how
much time you spend running benchmarks or how paranoid one is. :-)

Ok, so now Tim tells me that the assembler spends about 17% of its time
in strcmp() (strcpy() is less than 1% in this case).  And note that this
17% is *including* the effect of our compare-bytes instruction, without
the instruction the percentage would be much larger (how much?  I don't
think we have the old version of strcmp() sitting around.  Sorry).

I agree, there is an MS thesis here (and the CS community would thank
you a lot!).

    bcase

clif@intelca.UUCP (03/28/87)

> In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:
> >conditions?


[ Discussion of best history of string instructions]


> The fact of the matter is:  A C compiler is going to have a real hard time
> generating this instruction.  A Pascal Compiler might have a slightly
> easier time since the conept of "string" is a little more integrated with
> the langauge.  In general, a compiler for a language without built-in types
> and functions for dealing with strings will have a hard time with our
> compare-bytes instruction.  We realized this when we added it to the
> architecture (and it was truly a last-minute addition).  However, it has
> such a significant impact on C programs in general that we were not deterred
> by the realization that this instruction will probably occur only a few
> times in the strcpy(), strcmp(), and strlen() library functions.  Out of
> all the megabytes of code in a UNIX system, this instruction will account
> for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough
> guess).  But if it improves the running time of all the string-oriented
> system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems
> worth it.  And the implementation cost was so small.  Also, there are
> some instructions that must be present just to administer the system,
> like return from interrupt, move-from-special-register, etc.  These
> are not generated by a compiler either.  Just to reiterate a point:  RISC
> is not reducing the instruction set but is improving performance.
> 
> Ok, so you don't believe the above?  How about "It improved dhrystone
> a hell of lot."
> 
>     bcase

I think that it makes sense to add instructions like this even if
they only get put into the standard libraries, presumably they will
prove to be modest win for most applications written in C.  I 
vaguely remember reading that  calls to string routines represented
about .5% of all C code.  

As for Brian's second statement about Dhrystone being helped by 
string instructions:  He is certainly right about that.

-- 
Clif Purkiser, Intel, Santa Clara, Ca.
{pur-ee,hplabs,amd,scgvaxd,dual,idi,omsvax}!intelca!clif

These views are my own property.  However anyone who wants them can have 
them for a nominal fee.
	

freeman@spar.UUCP (03/28/87)

Keywords:

<*munch*> 

In article <230@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>
>In article <909@spar.SPAR.SLB.COM> freeman@spar.UUCP (Jay Freeman) writes:
>>                            [...]
>>Note that this particular instruction may well win even if no compiler EVER
>>generates it:
>>                            [...]
>>         other cases in which ... functions coded with particular
>>care, will allow efficient use of highly specialized instructions.

> [Thoughtful queries on cycles consumed by string-processing.]

I offer personal anecdotal evidence that highly optimized string-processing
might be perceived as valuable by users:  One noticeable irritation in the
editors I use daily on various machines (Sun 3/160, Sun 3/75, Symbolics
3600), is the time-delay for string-search in text buffers:  For editors
that work by stuffing large pieces of a file into memory as contiguous
characters, (rather than those that use a linked list of lines), this time
delay can be reduced a *lot* by hand-coding the search routines to use
low-level high-speed string-search primitives.

(My evidence for "*lot*" is also anecdotal; I once did a screen editor in
the Intel *86 architecture:  With all text as contiguous bytes in memory,
the difference between the usual C loop "while( *buf++ != c ) ..."  and
careful use of Intel's REP SCASB string-search assembly-language idiom, was
markedly noticeable in buffers only a few 10s of KBytes long on my 5 MHz
8088 machine.)

Note that I address user satisfaction, not cycles consumed.

>Finally, regarding the inclusion of hand-coded cases, i.e., of instructions
>that can't be gotten to from compilers, but may still be useful, our
>sense is that there aren't very many like that.

How about special floating-point instructions to support "cortic" (have I
got that spelled right?)  algorithms for transcendental functions?  (I
acknowledge that a general-purpose RISC machine would probably curdle its
silicon at the mere thought of having floating-point hardware on chip, but
consider a CISC manufacturer who is doing a floating-point coprocessor or
auxiliary processor.)

How about test-bit-field-and-branch instructions to support Lisp?

How about "atomic" test-and-set operations for semaphores, or bus-locking
primitives for dealing with various specialized peripheral devices.


Jay Freeman, Schlumberger Palo Alto Research <canonical disclaimer>

rpw3@amdcad.UUCP (03/28/87)

Well, John Mashey's numbers were certainly counter-intuitive to ME, given
that those programs bang away at characters all the time, but we should
look at one more thing:

When were nroff/grep/yacc written compared to when str* was standardized
in the library? After all, the str* libraries didn't spring full-blown on
day one with those exact sematics and calling sequences (I'm guessing).
I'll also guess (could easily be wrong) that there is a lot more string
stuff going on than you see in the str* statistics, and that it's hand
coded "in line" (in C, of course).

Counter point: What do similar NEW programs look like? Take some code
written by someone (or a team) that started with the "modern" libraries.
Run the statistics on that, and see...

Finally, don't forget str{,r}chr (a.k.a. {,r}index). Things like "ls -l"
beat on those really hard (at least in System-V), since way down at the
bottom a lot of time is spent (I'm guessing) tearing apart /etc/passwd
entries (go look at "getpwent" & friends).

I know that I sometimes avoid str* because I'm afraid it might be slow,
particularly when I know that keeping a count (a.k.a. length) can avoid
a lot of searching for the end. The solution is usually a high-level change
in the applications basic algorithms so that so much copying & searching
is simply avoided. A simple example is when you are building various
strings up out of pieces (as in "speeding up" shell scripts by recoding
in C). Even USING the libraries, sometimes instead of:

	strcpy(buf, "piece");
	strcat(buf, "another piece");
	strcat(buf, "still another piece");

I find myself doing:

	char *p = &buf[0];
	int n;

	strcpy(p, "piece");		p += strlen(p);
	strcpy(p, "another piece");	p += strlen(p);
	sprintf(p, "format%d%s", args...); p+= strlen(p);
	strcpy(p, "still another piece"); p += strlen(p);
	etc...

Then when you're done, you can assert that strlen(buf) == (p - buf);


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun,attmail}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

guy@gorodish.UUCP (03/28/87)

>>The most obvious use of new string processing instructions would be to
>>improve the performance of printf.  Has anyone done this?

The version of "_doprnt" that comes with 4BSD and comes in the VAX System III
source (althought whether it's the one that's in the library or not,
I don't know), and may very well have come with UNIX 32V, is written
in assembler.  The 4.3BSD version, at least, does use the string
*and* EDITPC instructions, although it doesn't use the string
instructions for handling things like "%s".

VAX System V doesn't use this version (and no, it's probably not
because the assembly version doesn't handle all the formatting
operators in the S5 version - I know it handles most of them, and I
think it handles all of them), so I'm curious whether they dropped it
because it didn't buy much.

>We have experimented with a number of solutions, including a hardware
>printf coprocessor for the 68020.  Predictably, it improved performance
>quite a bit. We also have tried using specially coded versions of printf
>in assembly language that we passed through a (very time expensive)
>exhaustive search optimiser and the results were also pretty good.

How much did the coprocessor cost, and how much better was it than the
assembly-language version?

mash@mips.UUCP (03/29/87)

In article <15315@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>So, I would like to turn the question around, John, and ask you to do
>what we might find difficult:  look at the dynamic execution of your
>compiler.
Will do.  For the time being, I did find one program that used them
enough to be interesting: (4.3BSD /bin/ls /bin /usr/bin)
function	%CPU	cycles/call
strlen		7.35	18
strcat		3.24	57
strcmp		2.95	11

Again: strcmp's clearly fail pretty quickly, and strlens's don't look at
very big strings [later, I'll show the dis-assembled code for the
strlen & strcmp functions (the others are a bit longer).

For shorter ls (like ls of current directory with 10 files), or for SYS V ls,
the numbers were down in the noise. GIANT CAVEAT: remember these are the
undegraded (for cache/tlb/memory interference) numbers, and even
more importantly, remember that these are for our architecture,
i.e., my previous estimates are for how much WE would save by ahving
the instruction.  Note that any architectural addition must always be
evaluated in the context of what's already in the architecture, i.e.,
it's not possible to SAY "such and such instruction is worth having"
without considering what it's being added onto.
For example, Brian's first posting on this topic didn't mention that the
AMD was a WORD-ADDRESSED machine, that comment was later.  It wouldn't
surprise me at all that the special string instruction is quite worthwhile
for such machines, even if it's probably [from the evidence I've got
so far] below the cutoff on ours [although still useful enough not to
reject without thought.]

>Sigh, it may be that there are just a few important programs that will
>benefit.  Out of the 100 most popular programs, maybe two benefit sub-
>stantially from the compare-bytes instruction.  But if those two are the
>most popular (or close to it), it might still be worthwhile to have the
>instruction.  Dhrystone is either important or not depending upon how
>much time you spend running benchmarks or how paranoid one is. :-)
As noted: the instruction is almost certainly worth something on the AMD part.
[The word-addressing thing is a whole separate issue that's worth a LOT
of discussion, but we might as well finish this one first.]
As usual, computer architecture needs a great deal of measurement,
because intuition can often be wrong.  You still need intuition and
experience for the stuff that's so hard to measure.
>
>Ok, so now Tim tells me that the assembler spends about 17% of its time
>in strcmp() (strcpy() is less than 1% in this case).  And note that this
>17% is *including* the effect of our compare-bytes instruction, without
Is it possible for you to post the code for strcmp/strlen?
(It would be especially interesting to see how you handle pointers
to bytes & shorts).

Rest of this is details of strlen/strcmp mentioned above.
	strlen:
0x404000:	addiu	v1,a0,1
0x404004:	lbu	v0,0(a0)
0x404008:	addiu	a0,a0,1
0x40400c:	bne	v0,zero,0x404004
0x404010:	nop
0x404014:	jr	ra
0x404018:	subu	v0,a0,v1
In ls, this was 7.35%, 18 cycles/call.  This says that the mean case
is addiu, 4 times around the loop (4 instrs each), plus the 2 instrs
(jr, subu) on exit. I.e., the general cost for this routine is 3 + 4N
(where N is the size).  All of this implies what the size distribution must
be: highly skewed, with a mode of 1-2, and a long tail towards longer
strings.  Thus, the tradeoff is between a routine with minimal setup that
runs quickly on short strings, but is not as efficient on longer strings,
versus ones that uses more setup. [As you can see, proper design of all
this REALLY depends on knowing the statistics: we wrote several versions
of each of these for comparison.  Again, I wish somebody would do
a definitive study!]

	strcmp:
0x403f50:	lbu	t0,0(a0)				|
0x403f54:	lbu	t1,0(a1)				|
0x403f58:	beq	t0,zero,0x403f84			|-> null str, ex
0x403f5c:	addi	a0,a0,2				|
0x403f60:	bne	t0,t1,0x403f8c			|-> 1st char !=, exit
0x403f64:	lbu	t2,-1(a0)		|
0x403f68:	lbu	t1,1(a1)		|
0x403f6c:	beq	t2,zero,0x403f84	|-> 2nd char null, exit
0x403f70:	addi	a1,a1,2		|
0x403f74:	beq	t2,t1,0x403f54	|---^ characters equal, loop
0x403f78:	lbu	t0,0(a0)
0x403f7c:	jr	ra
0x403f80:	subu	v0,t2,t1
0x403f84:	jr	ra
0x403f88:	subu	v0,zero,t1
0x403f8c:	jr	ra
0x403f90:	subu	v0,t0,t1

This is a 2-way unrolled version of the simplest strcmp.  In ls,
this was 2.95% of the instruction cycles, 11 cycles/call average.
There seem to be 2 "likely" cases here: 
	11 instrs: take the "2nd char null, exit" branch and return
	13 instrs: find 2nd char !=, i.e., execute 13 instrs with no branch
What this says is that once again, there's a very skewed distribution,
since some longer strings must compare equal. [To be honest, this
even surprises me somewhat].

IF this is representative, this says that the special instruction
wouldn't do US much good on these routines. [It might help on
strcpy/strcat: I'll look at that later.]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.UUCP (03/29/87)

In article <948@spar.SPAR.SLB.COM> freeman@spar.UUCP (Jay Freeman) writes:
>
>In article <230@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>
>I offer personal anecdotal evidence that highly optimized string-processing
>might be perceived as valuable by users:  ....
>the Intel *86 architecture:  With all text as contiguous bytes in memory,
>the difference between the usual C loop "while( *buf++ != c ) ..."  and
>careful use of Intel's REP SCASB string-search assembly-language idiom, was
>markedly noticeable in buffers only a few 10s of KBytes long on my 5 MHz
>8088 machine.)
	I believe that if you have a microcoded machine, you should have
	at least a few of the critical string routines, although note that
	it is very difficult to get a minimal set that covers all of the
	interesting cases, i.e., many CPU's string instructions don't do C.
	If you have a machine that provides them, you're crazy not to use
	them.
>
>
>>Finally, regarding the inclusion of hand-coded cases, i.e., of instructions
>>that can't be gotten to from compilers, but may still be useful, our
>>sense is that there aren't very many like that.
>
>How about special floating-point instructions to support "cortic" (have I
>got that spelled right?)  algorithms for transcendental functions?  (I
I think it's "quartic", and I think there may be some justification
for them, depending on the silicon tradeoffs, i.e., does it cost you
time in add or multiply to put the quartics on.  Depending on the
target applications, it may or may not be worth it.
>acknowledge that a general-purpose RISC machine would probably curdle its
>silicon at the mere thought of having floating-point hardware on chip, but
>consider a CISC manufacturer who is doing a floating-point coprocessor or
>auxiliary processor.)
I don't know why its silicon would be curdled.  Again, it really depends
on the target application mix: for example, for some applications, I'd
be tempted to put an FP unit on the CPU chip. [We didn't because we
wanted ferocious FP performance, and you cannot yet get enough silicon
area to do that on the CPU chip, certainly not at 2micron, and probably
not a 1.2.]
>
>How about test-bit-field-and-branch instructions to support Lisp?
We did some looking at that for LISP & Smalltalk.  Some of them
really want to be traps on bitfields for tag-checking. However, these
are in the category of "compiler can generate them".
>
>How about "atomic" test-and-set operations for semaphores, or bus-locking
>primitives for dealing with various specialized peripheral devices.
Some operations of this class are required, although there are
different ways to implement them.  Most of the typical implementations
don't extend very well into large N shared-memory multis, i.e., bus-lock
is the LAST thing you want to be going on very much.

Anyway, some good points: there certainly exist instructions that are
hard for compilers to generate, that are still useful, and depending on
the application targets, people may decide or not to include them.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.UUCP (03/29/87)

In article <15317@amdcad.UUCP> rpw3@amdcad.UUCP (Rob Warnock) writes:
>Well, John Mashey's numbers were certainly counter-intuitive to ME, given
>that those programs bang away at characters all the time, but we should
>look at one more thing:
>
>When were nroff/grep/yacc written compared to when str* was standardized
>in the library? After all, the str* libraries didn't spring full-blown on
>day one with those exact sematics and calling sequences (I'm guessing).
>I'll also guess (could easily be wrong) that there is a lot more string
>stuff going on than you see in the str* statistics, and that it's hand
>coded "in line" (in C, of course).
Yes, those were old programs.  I'd be happy to hear of suggestions for
specific programs to check.  It's absolutely sure that there was plenty
of string stuff going on, coded in C. [Of coruse, some of the string
stuff is algorithmic work that hardware isn't likely to support for
a while, i.e., regular-expression matching.]
>
>Finally, don't forget str{,r}chr (a.k.a. {,r}index). Things like "ls -l"
>beat on those really hard (at least in System-V), since way down at the
>bottom a lot of time is spent (I'm guessing) tearing apart /etc/passwd
>entries (go look at "getpwent" & friends).
I generally looked for str* routines; the others were generally in the noise.
Following are the stats for SYS V : ls -l /usr
Profile listing generated Sat Mar 28 14:19:29 1987 with:
----------------------------------------------------------------------------
*  -p[rocedures] using basic-block counts;                                 *
*  sorted in descending order by the number of cycles executed in each     *
*  procedure; unexecuted procedures are excluded                           *
----------------------------------------------------------------------------
1139801 cycles

    cycles %cycles  cum %     cycles  bytes procedure (file)
                               /call  /line

    511480   44.87  44.87       6558     22 getname (ls.c)
    310413   27.23  72.11         16     40 fgetc (../fgetc.c)
     99006    8.69  80.79         43     28 _flsbuf (../flsbuf.c)
     79820    7.00  87.80        340     17 _doprnt (../doprnt.c)
     18830    1.65  89.45         54     23 select (ls.c)
     15073    1.32  90.77        321     19 gmtime (../ctime.c)
....
      1655    0.15  98.79         10      5 strcmp (../strcmp.s)
....
      1338    0.12  99.18         18      5 strlen (../strlen.s)
....
        63    0.01  99.96          8     10 strchr (../strchr.c)

[Surprise! str* stuff isn't on the map in this one.  Just goes to show
that even experienced [and Rob is] folks' intuition needs checking.
>Well, this surprised me also.]

This whole string-processing discussion has been quite worthwhile,
if only to remind us that we often discuss pretty esoteric things
in this newsgroup, while at the same time, we sometimes don't have
the foggiest ideas on even basic program behavior!!
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

rbutterworth@watmath.UUCP (03/29/87)

In article <15317@amdcad.UUCP>, rpw3@amdcad.UUCP (Rob Warnock) writes:
> I find myself doing:
> 	char *p = &buf[0];
> 	int n;
> 	strcpy(p, "piece");		p += strlen(p);
> 	strcpy(p, "another piece");	p += strlen(p);
> 	sprintf(p, "format%d%s", args...); p+= strlen(p);
> 	strcpy(p, "still another piece"); p += strlen(p);
> 	etc...

Wouldn't it have been nice if strcpy(), strcat(), sprintf(), etc.
had been designed to return a pointer to the END of the string,
(or a count of the number of characters in the string)?

They already have this information for free, but instead they return
their first argument, something that the caller already knows.

earl@mips.UUCP (03/29/87)

Actually the MIPS architecture does contain string processing
instructions.  They go by the names ADDU and XOR.  For example, the
inner loop of my strlen routine, which searches for a word with a
nonzero byte, is

l1:	lw	t2,1(a1)
	addu	a1,4
	addu	t3,t2,t0
	xor	t3,t2
	and	t3,t1
	beq	t3,t1,l1

Quiz: what are the magic constants that need to be loaded into t0 and
t1 to make this work?

This is 7 cycles per word, or 1.75 cycles per byte.

-Earl

gemini@homxb.UUCP (03/29/87)

I got curious myself, and profiled "ls", "awk", "cpp", "lln", and
several SGS's and local applications programs for str* and mem* usage.
In most cases, the str* numbers were in the 0-3% range.  Diff didn't
use strcat/cpy/cmp at all, though it obviously does lots of string work.
But I found heavy usage of the mem* routines in many of the newer programs
I tried.  7-20% was the range!

If the C convention for strings had been length oriented, I suspect that
Dhrystone would come out with a better approximation of reality.  This
would be due to the heavier usage of mem*-like routines for moving and
comparing the strings.

In believe that Dhrystone would have been better off doing 'stringy' things
with some custom code.  I'm certain that lots of string work gets done in
the 'average' program, yet it appears that not much of it gets done using
the str* routines.  On the other hand, in my observations, I've seen lots
of novice C programmers go overboard using str*, so perhaps profiling the
(presumably) finely tuned UNIX utilities isn't showing us reality either.
It's certain that novice programmers will be sharing the machine and
their programs with me for a long time to come.

Rick Richardson, PC Research, Inc: (201) 922-1134  ..!ihnp4!castor!pcrat!rick
	         when at AT&T-CPL: (201) 834-1378  ..!ihnp4!castor!polux!rer

mash@mips.UUCP (03/29/87)

In article <182@homxb.UUCP> gemini@homxb.UUCP (Rick Richardson) writes:
>I got curious myself, and profiled "ls", "awk", "cpp", "lln", and
>several SGS's and local applications programs for str* and mem* usage.
>In most cases, the str* numbers were in the 0-3% range.  Diff didn't
>use strcat/cpy/cmp at all, though it obviously does lots of string work.
>But I found heavy usage of the mem* routines in many of the newer programs
>I tried.  7-20% was the range!
>
>If the C convention for strings had been length oriented, I suspect that
>Dhrystone would come out with a better approximation of reality.  This
>would be due to the heavier usage of mem*-like routines for moving and
>comparing the strings.

My current sense of things says:
	a) strcmp finds a difference very quickly, and strlen sess fairly
	short strings, hence for them, it's most important to optimize
	the short cases.
	b) strcpy appears to copy longer strings than a), not as long as c).
	c) people use memcpy/bcopy for copying bigger hunks of memory,
	hence they need to be optimized for the long case: fortunately,
	since you have length as a parameter, you can make correct decisions
	easily, unlike the str* things, which must guess.  There's one
	instance of bcopy that nees to be especially tuned for long copies,
	which is the one inside the kernel.
None of the above is definitive, jsut the sort of patterns I've seen when
I've looked.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

gemini@homxb.UUCP (03/30/87)

Options to "ls" have a great affect upon "str*" usages.  The following
results were obtained on an Amdahl 580 running uts 5.2.5:

./ls /bin /usr/bin [listed 64 times]
 %Time Seconds Cumsecs  #Calls   msec/call  Name
  22.1   10.72   10.72   80190      0.1336  strcmp
   3.3    1.60   40.25   21577      0.0742  memcpy
   0.7    0.33   47.65   10688      0.0312  strlen
   0.3    0.13   47.97                      strchr
   0.1    0.05   48.40                      memchr
   0.1    0.03   48.53                      strcpy

./ls -CF /bin /usr/bin [listed 64 times]
 %Time Seconds Cumsecs  #Calls   msec/call  Name
   7.9   10.67  100.55   80191      0.1330  strcmp
   1.2    1.58  128.38   32234      0.0491  memcpy
   0.3    0.42  133.08   21440      0.0194  strlen
   0.1    0.12  134.05       2     58.      strchr
   0.1    0.08  134.23                      memchr
   0.0    0.02  134.50                      strcpy

./ls -l /bin /usr/bin [listed 8 times]
 %Time Seconds Cumsecs  #Calls   msec/call  Name
   0.4    1.42  338.83   10023      0.1413  strcmp
   0.3    1.18  341.32   18761      0.0631  memcpy
   0.0    0.15  345.45    1336      0.112   strlen
   0.0    0.10  345.92                      memchr
   0.0    0.08  346.18                      strcpy
   0.0    0.03  346.35       1     33.      strchr

Note: on the Amdahl, strcmp and friends are in assembler.

Rick Richardson, PC Research, Inc: (201) 922-1134  ..!ihnp4!castor!pcrat!rick
	         when at AT&T-CPL: (201) 834-1378  ..!ihnp4!castor!pollux!rer

ka@rpiacm.UUCP (03/30/87)

In article <230@winchester.mips.UUCP>, mash@mips.UUCP (John Mashey) writes:
> We did some rummaging and profiling of code, and were astonished to find
> the actual amount of time spent in the str* routines to be pretty small.

Probably a major reason for this is that programmers use in-line code for
lots of string operations in C.  The early UN*X code was pretty efficient.
People who learned their coding styles from reading early UN*X code simply
don't include calls to the str* routines inside tight loops.
				Kenneth Almquist

guy@gorodish.UUCP (03/30/87)

>If the C convention for strings had been length oriented, I suspect that
>Dhrystone would come out with a better approximation of reality.  This
>would be due to the heavier usage of mem*-like routines for moving and
>comparing the strings.

Why would people use mem*-like routines for manipulating counted
strings, but not use str*-like routines for manipulating
null-terminated ones?  The only reason I can think of is that since
finding the length/end of a string is a constant-time operation with
counted strings, but is linear in the length of the string with
null-terminated strings, people would be less tempted to write their
own string manipulation code that found the end of a string as a side
effect of the processing.  Is this what you had in mind?

I'd be interested to know how much string processing is, in fact,
done by *ad hoc* code because it didn't quite match the string
routines available on the system?  "strtok" and company weren't in
V7, so people would usually have done their own string-chopping.

howard@cpocd2.UUCP (03/30/87)

In article <948@spar.SPAR.SLB.COM> freeman@spar.UUCP (Jay Freeman) writes:
>How about special floating-point instructions to support "cortic" (have I
>got that spelled right?)  algorithms for transcendental functions?

In article <237@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>I think it's "quartic", ...

No, it's "Cordic", and is usable for FFT as well.  One reason people are
interested in it is that it maps well into silicon, so custom Cordic chips
may be practical for some applications.
-- 

Copyright (c) 1987 Howard A. Landman.  Transmission of this material
constitutes permission from the intermediary to all recipients to freely
retransmit the material within USENET.  All other rights reserved.

bcase@amdcad.UUCP (03/31/87)

In article <236@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>For example, Brian's first posting on this topic didn't mention that the
>AMD was a WORD-ADDRESSED machine, that comment was later.  It wouldn't
>surprise me at all that the special string instruction is quite worthwhile
>for such machines, even if it's probably [from the evidence I've got
>so far] below the cutoff on ours [although still useful enough not to
>reject without thought.]

Like oops:  the fact that we use the 29000 as a word-addressed machine in
our simulations is so familiar to me that I forgot to mention it as an
assumption.  It is important to note that the 29000 can be used as either
a word-addressed or as a byte-addressed machine:  there are 3 "user bits"
in the load/store control fields that simply drive some pins on the chip.
Thus, they can be used to encode things like load/store byte/halfword/word
if a system implementor so chooses.  However, we think it is stupid because
such things slow down memory systems and thus the entire system.  But if
you really want to....

>As noted: the instruction is almost certainly worth something on the AMD part.
>[The word-addressing thing is a whole separate issue that's worth a LOT
>of discussion, but we might as well finish this one first.]

Yes, I would really like to get some opinions/facts from the net regarding
the word-addess/byte-address issue.

>Is it possible for you to post the code for strcmp/strlen?
>(It would be especially interesting to see how you handle pointers
>to bytes & shorts).

Yes!  Remember:  This code was generated by hand tweeking the output
of my C compiler!  Thus, it isn't the best (Please don't hurt my feelings
by complaining about the quality of my compiler.  I am quite aware of
its shortcomings).  I am posting both the C source and the Am29000 code.
Most of it should be easy to understand (but note that the extract is
the 32-bits-from-64-bits funnel-shift function) (instruction format:
<op> <dest>,<source1>,<source2> (source2 can be an 8-bit positive
constant).

Well, this is so damn long that I'll just post strcpy.  strcmp is easy
to write once you see how this works, and strlen is something we haven't
bothered with yet (because nothing has needed it).  Sorry in advance
about the length.  If you have better algorithms than the one I have
implemented below, please tell me (via mail).  I'll summarize to net
if appropriate.

strcpy (s, t)
char *s, *t;
{
    int word0, word1, word2;

    if (((int)s & 3) == 0)
    {
	if (((int)t & 3) == 0)
	{
	    /* can do fastest */
	    /* has_a_zero is the compare-bytes instruction */

	    while (has_a_zero (word0 = *(int *)t) == 0)
	    {
		*(int *)s = word0;
		s += 4;
		t += 4;
	    }
	    while (*s = *t)
		++s, ++t;
	    return;
	}

	/* destination is word aligned, but source isn't */

	word0 = *(int *)t;
	while (1)
	{
	    t += 4;
	    word1 = *(int *)t;

	    /* bit_extract is the funnel-shift instruction */

	    word2 = bit_extract (word0, word1, ((int)t & 3) << 3);
	    if (has_a_zero (word2))
		break;

	    *(int *)s = word2;
	    s += 4;
	    word0 = word1;
	}
	t -= 4;
	while (*s = *t)
	    ++s, ++t;
	return;
    }

    /* both strings unaligned */

   if (((int)s & 3) == ((int)t & 3))
   {
	/* both strings have the same mis-alignment */

	while (1)
	{
	    if ((*s = *t) == 0)
		return;
	    ++s, ++t;
	    if (((int)s & 3) == 0)
		break;
	}
	/* now transfer a word at a time */

	while (1)
	{
	    if (has_a_zero (word0 = *(int *)t))
		break;
	    *(int *)s = word0;
	    s += 4;
	    t += 4;
	}
	while (*s = *t)
	    ++s, ++t;
	return;
    }

    /* neither string aligned and both have different alignment */

    while (1)
    {
	if ((*s = *t) == 0)
	    return;
	++s, ++t;
	if (((int)s & 3) == 0)
	    break;
    }
    word0 = *(int *)t;
    while (1)
    {
	t += 4;
	word1 = *(int *)t;
	word2 = bit_extract (word0, word1, ((int)t & 3) << 3);
	if (has_a_zero (word2))
	    break;

	*(int *)s = word2;
	s += 4;
	word0 = word1;
    }
    t -= 4;
    while (*s = *t)
	++s, ++t;
    return;
}
----------------------------------------
$L0:
	.use	data
	.align
	.use	code
	.global	_strcpy
_strcpy:
	sub	gr01,gr01,40
	asgeu	OVERTRAP,gr01,gr40
	add	lr01,gr01,56
;.6	"strcpy.c"
	and	lr08,lr0c,3
	eq	lr08,lr08,0
	jmpf	lr08,$16
	and	lr08,lr0c,3
;.8	"strcpy.c"
	and	lr08,lr0d,3
	eq	lr08,lr08,0
	jmpf	lr08,$17
	or	lr00,lr00,0
	jmp	$LI1
	or	lr00,lr00,0
$LT19:
;.14	"strcpy.c"
	store	16,lr05,lr0c
;.15	"strcpy.c"
	add	lr0c,lr0c,4
;.16	"strcpy.c"
	add	lr0d,lr0d,4
$LI1:
;.12	"strcpy.c"
	load	16,lr05,lr0d
	cpbyte	lr02,lr05,0
	neq	lr08,lr02,0
	jmpf	lr08,$LT19
	or	lr00,lr00,0
$20:
	jmp	$LI2
	or	lr00,lr00,0
$LT21:
;.19	"strcpy.c"
	add	lr0c,lr0c,1
	add	lr0d,lr0d,1
$LI2:
;.18	"strcpy.c"
	load	16,lr08,lr0d
	byteex	lr08,lr08,0
	load	16,lr09,lr0c
	bytein	lr09,lr09,lr08
	store	16,lr09,lr0c
	eq	lr08,lr08,0
	jmpf	lr08,$LT21
	or	lr00,lr00,0
$22:
	add	gr01,gr01,40
	or	lr00,lr00,0
	jmpi	lr00
	asleu	UNDERTRAP,lr01,gr41
$17:
;.25	"strcpy.c"
	load	16,lr05,lr0d
$LT23:
;.28	"strcpy.c"
	add	lr0d,lr0d,4
;.29	"strcpy.c"
	load	16,lr06,lr0d
;.30	"strcpy.c"
	and	lr08,lr0d,3
	lls	lr04,lr08,3
	mtsp	FS,lr04
	extract	lr07,lr05,lr06
;.31	"strcpy.c"
	cpbyte	lr02,lr07,0
	neq	lr08,lr02,0
	jmpf	lr08,$26
	or	lr00,lr00,0
	jmp	$24
	or	lr00,lr00,0
$26:
;.34	"strcpy.c"
	store	16,lr07,lr0c
;.35	"strcpy.c"
	add	lr0c,lr0c,4
;.36	"strcpy.c"
	jmp	$LT23
	add	lr05,lr06,0
$24:
;.38	"strcpy.c"
	jmp	$LI3
	sub	lr0d,lr0d,4
$LT27:
;.40	"strcpy.c"
	add	lr0c,lr0c,1
	add	lr0d,lr0d,1
$LI3:
;.39	"strcpy.c"
	load	16,lr08,lr0d
	byteex	lr08,lr08,0
	load	16,lr09,lr0c
	bytein	lr09,lr09,lr08
	store	16,lr09,lr0c
	eq	lr08,lr08,0
	jmpf	lr08,$LT27
	or	lr00,lr00,0
$28:
	add	gr01,gr01,40
	or	lr00,lr00,0
	jmpi	lr00
	asleu	UNDERTRAP,lr01,gr41
$16:
;.46	"strcpy.c"
	and	lr09,lr0d,3
	eq	lr08,lr08,lr09
	jmpf	lr08,$29
	or	lr00,lr00,0
$LT30:
;.52	"strcpy.c"
	load	16,lr08,lr0d
	byteex	lr08,lr08,0
	load	16,lr09,lr0c
	bytein	lr09,lr09,lr08
	store	16,lr09,lr0c
	eq	lr08,lr08,0
	jmpf	lr08,$32
	or	lr00,lr00,0
	add	gr01,gr01,40
	or	lr00,lr00,0
	jmpi	lr00
	asleu	UNDERTRAP,lr01,gr41
$32:
;.54	"strcpy.c"
	add	lr0c,lr0c,1
	add	lr0d,lr0d,1
;.55	"strcpy.c"
	and	lr08,lr0c,3
	eq	lr08,lr08,0
	jmpf	lr08,$33
	or	lr00,lr00,0
	jmp	$31
	or	lr00,lr00,0
$33:
	jmp	$LT30
	or	lr00,lr00,0
$31:
$LT34:
;.62	"strcpy.c"
	load	16,lr05,lr0d
	cpbyte	lr02,lr05,0
	neq	lr08,lr02,0
	jmpf	lr08,$36
	or	lr00,lr00,0
	jmp	$35
	or	lr00,lr00,0
$36:
;.64	"strcpy.c"
	store	16,lr05,lr0c
;.65	"strcpy.c"
	add	lr0c,lr0c,4
;.66	"strcpy.c"
	jmp	$LT34
	add	lr0d,lr0d,4
$35:
	jmp	$LI4
	or	lr00,lr00,0
$LT37:
;.69	"strcpy.c"
	add	lr0c,lr0c,1
	add	lr0d,lr0d,1
$LI4:
;.68	"strcpy.c"
	load	16,lr08,lr0d
	byteex	lr08,lr08,0
	load	16,lr09,lr0c
	bytein	lr09,lr09,lr08
	store	16,lr09,lr0c
	eq	lr08,lr08,0
	jmpf	lr08,$LT37
	or	lr00,lr00,0
$38:
	add	gr01,gr01,40
	or	lr00,lr00,0
	jmpi	lr00
	asleu	UNDERTRAP,lr01,gr41
$29:
$LT39:
;.77	"strcpy.c"
	load	16,lr08,lr0d
	byteex	lr08,lr08,0
	load	16,lr09,lr0c
	bytein	lr09,lr09,lr08
	store	16,lr09,lr0c
	eq	lr08,lr08,0
	jmpf	lr08,$41
	or	lr00,lr00,0
	add	gr01,gr01,40
	or	lr00,lr00,0
	jmpi	lr00
	asleu	UNDERTRAP,lr01,gr41
$41:
;.79	"strcpy.c"
	add	lr0c,lr0c,1
	add	lr0d,lr0d,1
;.80	"strcpy.c"
	and	lr08,lr0c,3
	eq	lr08,lr08,0
	jmpf	lr08,$42
	or	lr00,lr00,0
	jmp	$40
	or	lr00,lr00,0
$42:
	jmp	$LT39
	or	lr00,lr00,0
$40:
;.83	"strcpy.c"
	load	16,lr05,lr0d
$LT43:
;.86	"strcpy.c"
	add	lr0d,lr0d,4
;.87	"strcpy.c"
	load	16,lr06,lr0d
;.88	"strcpy.c"
	and	lr08,lr0d,3
	lls	lr04,lr08,3
	mtsp	FS,lr04
	extract	lr07,lr05,lr06
;.89	"strcpy.c"
	cpbyte	lr02,lr07,0
	neq	lr08,lr02,0
	jmpf	lr08,$45
	or	lr00,lr00,0
	jmp	$44
	or	lr00,lr00,0
$45:
;.92	"strcpy.c"
	store	16,lr07,lr0c
;.93	"strcpy.c"
	add	lr0c,lr0c,4
;.94	"strcpy.c"
	jmp	$LT43
	add	lr05,lr06,0
$44:
;.96	"strcpy.c"
	jmp	$LI5
	sub	lr0d,lr0d,4
$LT46:
;.98	"strcpy.c"
	add	lr0c,lr0c,1
	add	lr0d,lr0d,1
$LI5:
;.97	"strcpy.c"
	load	16,lr08,lr0d
	byteex	lr08,lr08,0
	load	16,lr09,lr0c
	bytein	lr09,lr09,lr08
	store	16,lr09,lr0c
	eq	lr08,lr08,0
	jmpf	lr08,$LT46
	or	lr00,lr00,0
$47:
	add	gr01,gr01,40
	or	lr00,lr00,0
	jmpi	lr00
	asleu	UNDERTRAP,lr01,gr41
	.use	data
	.align
-----------------------------------------------------

bcase@amdcad.UUCP (03/31/87)

In article <15325@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>Like oops:  the fact that we use the 29000 as a word-addressed machine in
>our simulations is so familiar to me that I forgot to mention it as an
>assumption.  It is important to note that the 29000 can be used as either
>a word-addressed or as a byte-addressed machine:  there are 3 "user bits"
>in the load/store control fields that simply drive some pins on the chip.
>Thus, they can be used to encode things like load/store byte/halfword/word
>if a system implementor so chooses.  However, we think it is stupid because
>such things slow down memory systems and thus the entire system.  But if
>you really want to....

Jeeze louise, I forgot to mention that the Am29000 always drives the
entire byte address on the pins.  It's just that the preferred memory
system implementation ignores the lower two bits and always returns a full
word.  In this case, the byte-extract and byte-insert instructions are
used once the word gets on chip.  (For RISC historians:  The paper
"Hardware/Software Tradeoffs for Increased Performance" talks about
word-addressed machines.  The information in this and other papers helped
shape our implementation.)

    bcase

tim@amdcad.UUCP (03/31/87)

In article <985@mips.UUCP>, earl@mips.UUCP (Earl Killian) writes:
> Quiz: what are the magic constants that need to be loaded into t0 and
> t1 to make this work?

Let's see... How about

	t0 = 0x7efefeff
	t1 = 0x81010100

Is that right?  If so, it appears to work only for 7 bit characters (but
we admit that such a restriction is reasonable in real life)...

(This is a fun problem but probably a bit too hard for use in interviews.)

Mike Johnson
Brian Case
Tim Olson

henry@utzoo.UUCP (Henry Spencer) (03/31/87)

> ... I think it is time for some *real* optimizing compilers that can detect
> when they are compiling a benchmark and optimize for the appropriate 
> instruction! :-) :-) :-)

Don't laugh!  It's not unknown for an optimizing-compiler group to put extra
attention into making common benchmarks run fast.  I'm told that there is at
least one compiler that has a specialized loop-unrolling optimization whose
major purpose is to quietly turn the "looping" version of Linpack into the
"unrolled" version.
-- 
"We must choose: the stars or	Henry Spencer @ U of Toronto Zoology
the dust.  Which shall it be?"	{allegra,ihnp4,decvax,pyramid}!utzoo!henry

rpw3@amdcad.UUCP (03/31/87)

In article <238@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>Following are the stats for SYS V : ls -l /usr
>    cycles %cycles  cum %     cycles  bytes procedure (file)
>                               /call  /line
>    511480   44.87  44.87       6558     22 getname (ls.c)
>    310413   27.23  72.11         16     40 fgetc (../fgetc.c)
>     99006    8.69  80.79         43     28 _flsbuf (../flsbuf.c)
>     79820    7.00  87.80        340     17 _doprnt (../doprnt.c)
>	...	...	...	...	 ...	...

Hey, John, wanna know something really interesting? Go look at 4.2 BSD
"ls", at routine getname(), and you'll discover it's full of getpwent()'s
(plus {set,end}pwent(), etc.) and strncpy()'s (hidden in the "SCPYN" macro).
And of course getpwent() uses the str* routines, though it also has its
own pwskip() instead if index().  (Have you got a 4.{2,3} BSD "ls" that you
could run your profiler on?)

Then look at the Sys V "ls", and lo & behold, all of those neat library
routines have been unfolded into the getname() routine, and hand-coded
into a most glorious mess, which calls only (guess)...  fgetc()!!!

(No wonder my jaw dropped when I saw your numbers!)

Now other parts of Sys V use {set,get,end}pwent(), so I guess somebody
decided that "ls" was running too slow (or had an "intuition"?) and ripped
it apart. Given the looks of the code, I'm surprised that even fgetc()
survived.

This reminds me of Gerry Weinberg's "Railroad Paradox":

	"We don't need fast strlib routines, because no one ever uses them."

Yeah... 'cause they're not fast, everyone codes around them!

We (me, too, I just realized I got suckered in also) must be *VERY* careful
to look inside any major piece of code we are going to try and use to prove
something or other, in order to avoid simply confirming the {good,bad,weird}
practices of the past.

I am now completely unsure whether we can get ANY reasonable statistics on
what the proper usage (percent, etc.) of "str*" should/could be in NEW code,
if we base it on code that has been hacked on as long as the various Unices.

Anybody want to cough up a "Stryngstone" benchmark for us to throw rocks at?


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun,attmail}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

earl@mips.UUCP (04/01/87)

In article <15331@amdcad.UUCP>, tim@amdcad.UUCP (Tim Olson) writes:
> > Quiz: what are the magic constants that need to be loaded into t0 and
> > t1 to make this work?
> Let's see... How about
> 	t0 = 0x7efefeff
> 	t1 = 0x81010100
> Is that right?  If so, it appears to work only for 7 bit characters (but
> we admit that such a restriction is reasonable in real life)...

Yes, you've got it.  Congratulations. (Gee, do I owe you lunch or
something now?)  Actually, it works for 8-bit characters in bytes 0,
1, and 2 and a 7-bit one in byte 3 (I'm using little-endian byte
numbers here).  Thus it may accept '\200' as a zero in byte 3.  But
this is easily handled in the follow-on code that has to detect which
byte is nonzero by actually testing byte 3 and branching back to the
loop if it is nonzero.  Actually, here is that code.  Note how after
the word loop the carries also conveniently let you detect the byte.
This part only works for little-endian byte ordering.  (Note: my
personal convention is to indent instructions in branch delay slots
once space):

	## search for word with zero byte
l1:	lw	t2,1(a1)	# read next word
	addu	a1,4
	addu	t3,t2,t0	# cause carry out of any nonzero byte
	xor	t3,t2		# xor to find carry bits
	and	t3,t1		# look at carries out of bytes
	beq	t3,t1,l1	# all bytes carried => none zero
	## found word with zero byte, now find the byte
	 sll	t4,t3,23	# carry out of byte 0?
l3:	beq	t4,0,x0		# if not, go compute length
	 sll	t4,t3,15	# carry out of byte 1?
	bge	t4,0,x1		# if not, go compute length
	 sll	t4,t3,7		# carry out of byte 2?
	bge	t4,0,x2		# if not, go compute length
	 srl	t4,t2,24	# byte 3 zero?
	bne	t4,0,l1		# no, re-enter loop

jesup@steinmetz.UUCP (04/01/87)

In article <15829@sun.uucp> guy@sun.UUCP (Guy Harris) writes:
>
>I'd be interested to know how much string processing is, in fact,
>done by *ad hoc* code because it didn't quite match the string
>routines available on the system?  "strtok" and company weren't in
>V7, so people would usually have done their own string-chopping.

	For the shell I wrote for the Amiga (and for other programs as
well), I have often used a set of string routines I wrote that always
returns a pointer to the END of the string, not the beginning.  This way
I rarely, if ever, have to call strcat.  I usually know what the beginning
of the string is (except for things like strcpy(malloc(SIZE),string)).

	Randell Jesup
	jesup@steinmetz.uucp
	jesup@ge-crd.arpa

john@frog.UUCP (04/01/87)

> Well, John Mashey's numbers were certainly counter-intuitive to ME...

Really sophisticated programs spend most of their time doing really
sophisticated things; simple grunt routines like str*() aren't universally
applicable.

> When were nroff/grep/yacc written compared to when str* was standardized
> in the library?  After all, the str* libraries didn't spring full-blown on
> day one...

I don't know (offhand) about these, but I recall once being curious about
why lex was slow.  It turned out to use subroutines for isalpha(), isspace(),
isdigit(), etc, instead of the (presumably newer) ctype.h macros.  HOWEVER,
even after making that change to a version of lex, it didn't help all that
much, because lex has so much else to do.

> Even USING the libraries, sometimes instead of:
> 	strcpy(buf, "piece");
> 	strcat(buf, "another piece");
> 	strcat(buf, "still another piece");
> I find myself doing:
> 	char *p = &buf[0];
> 	int n;
> 	strcpy(p, "piece");		p += strlen(p);
> 	strcpy(p, "another piece");	p += strlen(p);
> 	sprintf(p, "format%d%s", args...); p+= strlen(p);
> 	strcpy(p, "still another piece"); p += strlen(p);
> 	etc...
> 
WHOOPS!  That's about as much work (first you copy the string, then
you step over it) as the use of strcat().  If you don't mind being
non-portable to oddball architectures, you can write a routine like
"strcatl()", which takes a destination buffer as a first argument, and
a NULL-terminated list of strings to concatenate into that buffer (or
you can be slightly portable and do it with <varargs.h>).   The easiest
way might be, however, to use:

	char *strcpya(p1, p2)	/* strcat and return end address */
	char *p1, *p2;
	{
		while (*p1++ = *p2++) /**/;
		return p1-1;
	}

and then
	strcpya(strcpya(strcpya(buf, "first piece"), " second piece"),
		" and the last");
or if you prefer
	char *p,buf[127 /* a prime number*/ ];
	p = strcpya(buf, "first piece");
	p = strcpya(p, " second piece");
	p = strcpya(p, " third piece");

Either way, you only touch each byte once.

Sometimes, the question you need to ask yourself about library routines is
not "Could they be better coded with this screwball assembly instruction?"
but rather "Does it *really* do what I want?".

--
John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101
...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA

"Happiness is the planet Earth
	in your rear-view mirror."
		- Sam Hurt, in "Eyebeam, Therefore I Am"

kent@xanth.UUCP (04/01/87)

In article <15325@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>Well, this is so damn long that I'll just post strcpy.

Brian,
	I almost choked when I saw your code, right after seeing the same job
	done in a handful of lines in a posting that hit here a few articles
	earlier.  You couldn't have written a more cogent argument for byte
	level addressing in any hardware which does text processing if it had
	been your intent.  Programmers are _the_most_expensive_part_ of the
	total lifecycle (including execution) costs of all but the most widely
	distributed, embedded code.  Who could afford to pay the staff to
	write code ten or fifteen times as long (I didn't count) to do the
	same job in a comparable time.  The discussion so far has brought out
	quite a lot of evidence that applications level programmers aren't
	using the library string routines much, yet Unix stream files, and
	the masses of editors, word processors, post processors compiler
	parsers and so on in use testify that the job is being done.  The
	implication is that the application programmers are writing the
	string software themselves, to suite the particular case at hand.
	I sure don't want to pay them to write code like that you showed,
	at the application level, to replace
	
		while (*s1++ = *s2++);
	
	Yeeks!
	
	You've talked _me_ into byte instructions and byte addressing.
	Comments?

Kent.							(kent@xanth.UUCP)

meissner@dg_rtp.UUCP (04/02/87)

In article <6448@watmath.UUCP> rbutterworth@watmath.UUCP writes:
> Wouldn't it have been nice if strcpy(), strcat(), sprintf(), etc.
> had been designed to return a pointer to the END of the string,
> (or a count of the number of characters in the string)?

The System V.3 (and ANSI) *{scanf printf} routines have a a format specifier
%n which stores the number of bytes {read, written} so far into the int
pointed to by the current argument.  Also in System V.xx and ANSI, sprintf
returns the total number of bytes written (not the pointer to the string
like 4.2 BSD does).
-- 
	Michael Meissner, Data General	Uucp: ...mcnc!rti-sel!dg_rtp!meissner

It is 11pm, do you know what your sendmail and uucico are doing?

stubbs@ncr-sd.UUCP (04/04/87)

In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes:
>
>My question is this:  How likely is it that a compiler itself will be able to
>detect the case when it can use an instruction like this and generate code
>automatically to use it.  

In this case, these instructions would be used in the hand coded
strcopy and strcmp C library routines, so compilers need not be concerned.
My guess is they are still a win, because these libraries are used a bunch.


Jan Stubbs    ....sdcsvax!ncr-sd!stubbs
619 485-3052
NCR Corporation
Advanced Development
16550 W. Bernardo Drive MS4010
San Diego, CA. 92127

pec@necntc.UUCP (04/06/87)

How likely is it that a compiler can make use of string copy
instructions?  NEC's re-targeting of the portable C compiler to the 
V60 architecture frequently takes advantage of one of the V60's string 
copy instructions.  It uses a string copy for  structure assignments!  

Of course, these instructions are also used in the library routines.   
Another use that comes to mind is when loader zeros out a BSS area...

paulsc@orca.UUCP (04/06/87)

In article <7843@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>> ... I think it is time for some *real* optimizing compilers that can detect
>> when they are compiling a benchmark and optimize for the appropriate 
>> instruction! :-) :-) :-)
>Don't laugh!  It's not unknown for an optimizing-compiler group to put extra
>attention into making common benchmarks run fast.  I'm told that there is at
>least one compiler that has a specialized loop-unrolling optimization whose
>major purpose is to quietly turn the "looping" version of Linpack into the
>"unrolled" version.

It isn't always a conscious effort to trick customers into
buying the compiler.  The compiler group might only have a
handful of "test programs" for their compiler.  If the compiler
does any peep hole optimizations, the compiler group probably
compiles whatever test programs they have to see how their
compiler might generate better code.  If any of their test
programs are benchmarks, they will be spending time making the
benchmarks run fast.

Paul Scherf, Tektronix, Box 1000, MS 61-033, Wilsonville, OR, USA
tektronix!orca!paulsc

eugene@pioneer.UUCP (04/07/87)

In article <2344@orca.TEK.COM> paulsc@orca.UUCP (Paul Scherf) writes:
>
>It isn't always a conscious effort to trick customers into
>buying the compiler.  The compiler group might only have a
>handful of "test programs" for their compiler.
>
>Paul Scherf, Tektronix, Box 1000, MS 61-033, Wilsonville, OR, USA

This is part of the point of my research: to get everyone writing
compilers to have a large consistent set of programs to check
optimizations just like we have programs (validation suites) to check
conformance to language standards.  Hope, I can get the various companies
interested in a set of open performance measurement tool.

From the Rock of Ages Home for Retired Hackers:

--eugene miya
  NASA Ames Research Center
  eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene

am@cl-jenny.UUCP (04/08/87)

In article <15292@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>One unique feature of the Am29000 architecture is a special instruction....
> ... [ to speed up detecting null bytes in words].
You might be interested to know that such detection of null bytes in words
can be done in 3 or 4 instructions on almost any hardware (nay even in C).
(Code that follows relies on x being a 32 bit unsigned (or 2's complement
int with overflow ignored)...)
   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
Then if e is an expression without side effects (e.g. variable)
   has_nullbyte_(e)
is nonzero iff the value of e has a null byte.
(One can view this as explicit programming of the Manchester carry chain
present in many processors which is hidden from the instruction level).

On Acorn's ARM risc machine using this feature as Brian Case suggests speeded
up dhrystone by 20-25%, which may comment on dhrystone or risc. This gives
some 5-6K dhrystones on the slow clocked pre-production 3 micron chips.
(Wonder what state of the art fabrication can give....)

karl@haddock.UUCP (04/11/87)

In article <486@danews.ATT.COM> lvc@danews.ATT.COM (Larry Cipriani) writes:
>In article <15317@amdcad.UUCP>, rpw3@amdcad.UUCP (Rob Warnock) writes:
>> I find myself doing:
>> 	char *p = &buf[0];
>> 	int n;
>> 	strcpy(p, "piece");		p += strlen(p);
>> 	strcpy(p, "another piece");	p += strlen(p);
>> 	sprintf(p, "format%d%s", args...); p+= strlen(p);
>> 	strcpy(p, "still another piece"); p += strlen(p);
>> 	etc...
>
>It seems to me that the code would be far more readable and faster
>if it were written as:   p = cat(p1, p2, ..., (char*)0);

Or, to better mimic the original semantics,  p = cat(buf, p1, ...) with no
malloc involved.  (Yeah, the malloc version is useful too; I've got one of my
own.)

However, this is only useful if each operation is a strcpy or strcat; it
doesn't work if there are intermixed one-char stores (*p++ = c) or more
complex stuff (the embedded sprintf in the original example).

Of course, one could always write the whole thing with sprintf, using "%s%s%s"
format instead of strcat(strcat(strcpy(...))), but sometimes it's better to
use the low-level stuff.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

stevew@karau.UUCP (04/11/87)

In article <486@danews.ATT.COM>, lvc@danews.ATT.COM (Larry Cipriani) writes:
> In article <15317@amdcad.UUCP>, rpw3@amdcad.UUCP (Rob Warnock) writes:
> > I find myself doing:
> > 	char *p = &buf[0];
> > 	int n;
> > 	strcpy(p, "piece");		p += strlen(p);
> > 	strcpy(p, "another piece");	p += strlen(p);
> > 	etc...
> It seems to me that the code would be far more readable and faster
> if it were written as:
> 	p = cat(p1, p2, ..., (char*)0);
> where all the p's are pointers to characters.  Cat will malloc
> enough space to store a concatenation of all the pn's and a '\0'.

Another possible solution would be:
	cat(p, n, p1, p2, ..., (char*)0);
Where p == destination string,  n == amount of space available,
p1 == the first pointer,  etc...

You might have two versions one with n, one without.  The debate is now
what it should return:
	1. pointer to the initial string.
	2. the number of characters cat()'ed.
	3. a pointer to the terminating null.

jack@mcvax.UUCP (04/15/87)

In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes:
>You might be interested to know that such detection of null bytes in words
>can be done in 3 or 4 instructions on almost any hardware (nay even in C).
>(Code that follows relies on x being a 32 bit unsigned (or 2's complement
>int with overflow ignored)...)
>   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
>Then if e is an expression without side effects (e.g. variable)
>   has_nullbyte_(e)
>is nonzero iff the value of e has a null byte.

I was so impressed by this new trick (well, to *me* it is new:-)
that I immedeately decided to try it. my Whitechapel MG-1,
a 32016 based machine, the results were impressive.

I coded strcpy() using this methods, and the results were
great. Break-even with normal strcpy() at 4-char strings, performance
slightly worse with 5/6/7-char strings, and getting better and
better from there on. For strings with length 4N (N>=4) performance
was twice that from old strcpy(). This is the routine:


#define hasnull(x) ((x-0x01010101) & ~(x) & 0x80808080)

strcpy(at,f)
    long *at;
    register long *f;
{
    register long d;
    register long *t = at;
    register char *fc, *tc;

    do {
	d = *f++;
	if( !hasnull(d) ) {
	    *t++ = d;
	    continue;
	}
	tc = (char *)t;
	fc = (char *)(f-1);
	while( *tc++ = *fc++);
	return;
    } while(1);
    return(at);
}

Coding in assembler caused a 30% decrease in time for small (10-char)
strings (less registers to save, t/tc and f/fc in the same reg, etc).
Something I haven't explained yet is that unaligned strings give the
*same* performance. Maybe the extra fetches are noise wrt the
number of instruction fetches?

Note that the 32016 is a 32 bit machine with a 16 bit bus, 
so that is probably why I found twice the speed, in stead of four
times.

Anyway, the next thing I thought of is "Wow! This is *great* for
strcmp() on big-endians. Comparing 4 bytes in one go through
the loop!". But, of course, I don't have a big-endian handy.

Anyone care to try this?
-- 
	Jack Jansen, jack@cwi.nl (or jack@mcvax.uucp)
	The shell is my oyster.

greg@utcsri.UUCP (04/17/87)

In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes:
>   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
>Then if e is an expression without side effects (e.g. variable)
>   has_nullbyte_(e)
>is nonzero iff the value of e has a null byte.

If one of the bytes contains 0x80, then has_nullbyte() will
say 'yes'. This can be circumvented by a more thorough test
after this one to see if there really is a null there.

Jack Jensen's subroutine does this apparently by accident;
there is a 'while( *tc++ = *fc++ );' after the test finds a 'null'
so the only effect of having a 0x80 byte will be to revert to
the normal strcpy for the rest of the string.

Someone posted a similar but usually better test on comp.arch ( I think )
a little while ago:

#define has_nullbyte(e)  ((e+0x7efefeff)^e)&0x81010100) != 0x81010100

This one is only 'fooled' by an 0x80 in the most significant byte.
which makes the following test much simpler ( a sign test ).
In either case ( especially in this one with two identical
constants ) it helps a lot if the constants are in registers while
the loop is running. When using C, you can do this explicitly:

register k1=0x7efefeff, k2=0x81010100;

-- 
----------------------------------------------------------------------
Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg
Have vAX, will hack...

bjorn@alberta.UUCP (04/22/87)

In article <4605@utcsri.UUCP>, greg@utcsri.UUCP writes:
>In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes:
>>   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
>>Then if e is an expression without side effects (e.g. variable)
>>   has_nullbyte_(e)
>>is nonzero iff the value of e has a null byte.
>
>If one of the bytes contains 0x80, then has_nullbyte() will
>say 'yes'. This can be circumvented by a more thorough test
>after this one to see if there really is a null there.

You're mistaken the "has_nullbyte_(x)" defined above works
as advertised for all 32 bit x.  That is to say it returns
a nonzero result if and only if x contains a null byte.

>Someone posted a similar but usually better test on comp.arch ( I think )
>a little while ago:
>
>#define has_nullbyte(e)  ((e+0x7efefeff)^e)&0x81010100) != 0x81010100
>
>This one is only 'fooled' by an 0x80 in the most significant byte.
>which makes the following test much simpler ( a sign test ).

You are right this one does not always tell the truth.  Besides
it's a lot more effort.

However for tricks like these to work reliably (especially on
systems with memory protection) you had best get help from
your local linker and 'malloc'.  Otherwise one day a program
is going to read off the edge of an accessible hunk of memory
and flush your program straight down the drain.  Making sure
that no data item ends closer than n-1 bytes (if you're reading
n bytes at time) from a boundary (between accessible and
inaccessible memory) fixes this.


			Bjorn R. Bjornsson
			{mnetor,ihnp4}!alberta!bjorn

greg@utcsri.UUCP (04/23/87)

In article <299@pembina.UUCP> bjorn@alberta.UUCP (Bjorn R. Bjornsson) writes:
>In article <4605@utcsri.UUCP>, greg@utcsri.UUCP writes:
>>In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes:
>>>   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
>>>Then if e is an expression without side effects (e.g. variable)
>>>   has_nullbyte_(e)
>>>is nonzero iff the value of e has a null byte.
>>
>>If one of the bytes contains 0x80, then has_nullbyte() will	[BUNK]
>>say 'yes'. This can be circumvented by a more thorough test	[BUNK]
>>after this one to see if there really is a null there.	[BUNK]
>
>You're mistaken the "has_nullbyte_(x)" defined above works
>as advertised for all 32 bit x.  That is to say it returns
>a nonzero result if and only if x contains a null byte.

Right. I apologize for any inconvenience I may have caused anybody.
I also apologize for posting bunk in a sleepy haze and then vanishing
for a week.
>
>>Someone posted a similar but usually better test on comp.arch ( I think )
>>a little while ago:
>>
>>#define has_nullbyte(e)  ((e+0x7efefeff)^e)&0x81010100) != 0x81010100
>>
>>This one is only 'fooled' by an 0x80 in the most significant byte.
>>which makes the following test much simpler ( a sign test ).
>
>You are right this one does not always tell the truth.  Besides
>it's a lot more effort.

I seems to be about the same, in the loop. Below is the subtract version, and
the add version on an NS32k:

	movd	0(r0),r1
	subd	r2,r1
	bicd	0(r0),r1
	andd	r3,r1		;r2 & r3 contain constants
	cmpqd	$0,r1		;need this on ns32k

	movd	0(r0),r1
	addd	r2,r1
	xord	0(r0),r1
	andd	r3,r1
	cmpd	r3,r1

I can't remember why I thought it was more efficient. I guess it
is if you haven't got 'bic'. ( and if 'and' doesn't set the Z flag :-)
The original poster of this method was describing a RISC which
had a 'compare registers and branch if equal' op.
>
>However for tricks like these to work reliably (especially on
>systems with memory protection) you had best get help from
>your local linker and 'malloc'.  Otherwise one day a program
>is going to read off the edge of an accessible hunk of memory
>and flush your program straight down the drain.  Making sure
>that no data item ends closer than n-1 bytes (if you're reading
>n bytes at time) from a boundary (between accessible and
>inaccessible memory) fixes this.

Not generally a problem. The whole point of this is that the 32-bit
accesses are constrained to 32-bit boundaries to make them fast.
Furthermore, in most systems, your access-controlled segments or
whatever are rounded up to a size at least as round as the machine
word. (The MMU does not usually deal with address lines a0,a1 on a
32-bit machine). Thus if it is legal to read the first byte in a word,
it is legal to read the whole word, and thus the last byte.
-- 
----------------------------------------------------------------------
Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg
Have vAX, will hack...

am@cl.cam.ac.uk (Alan Mycroft) (04/27/87)

In article <4605@utcsri.UUCP> greg@utcsri.UUCP (Gregory Smith) writes:
>In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes:
>>   #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
>>Then if e is an expression without side effects (e.g. variable)
>>   has_nullbyte_(e)
>>is nonzero iff the value of e has a null byte.
>
>If one of the bytes contains 0x80, then has_nullbyte() will
>say 'yes'. This can be circumvented by a more thorough test
>after this one to see if there really is a null there.

Untrue.  It would be true if I had mistakenly used ^ instead of &~.