bcase@amdcad.UUCP (03/25/87)
There was a discussion a few months ago about processing strings more efficiently than a byte at a time. The Am29000 takes one of the possible approaches to improving string processing performance.... One unique feature of the Am29000 architecture is a special instruction. This instruction is intended to be used to speed-up string processing, but my guess is that other uses will be discovered. The instruction is called "compare-bytes" and works like this: Compare bytes specifies two source register operands and one destination register operand. The 4 pairs of corresponding bytes of the two 32-bit source operands are compared for equality (i.e., the two most-significant bytes are compared, the two next-most-significant bytes are compared, etc.). If any of the four pairs are equal, then the destination register is set to the value "TRUE" (which on the Am29000 is a one in the most-significant bit with all other bits cleared to zero). If none of the four pairs are equal, then the destination register is set to "FALSE" (all bits cleared). (Am29000 conditional branch instructions test only the most significant bit of a register, condition codes are not used; we get a free "test for negative.") So, if one of the source operands is set to all zeros (four null characters) (which can be specified in the instruction by choosing the second operand as the zero-extended eight-bit constant zero) and the other operand is a word of the character string being dealt with (say for copying or comparing), the Am29000 can, in one cycle (not counting the branch), determine if the word contains the end of string character (according to the C language definition of string). If the word does not contain the end of string character, then the four bytes in the word can be manipulated (e.g. loaded or stored) as a unit. Word operations on the Am29000 are much more efficient than character operations (this is true of most machines though). There are, of course, special circumstances to deal with (such as misaligned strings, and we have a funnel shifter to help in those cases), but by using the compare-bytes instruction in the library routines strcpy() and strcmp() (and strlen() too, but we haven't bothered since it seems to never be used in the programs we have encountered), significant improvements in the run-time of many C programs can be realized. Another thing which really helps is to have the compiler word-align literal strings (and I have implemented this), but even with word-alignment, some substrings will begin on strange boundaries and must be dealt with correctly. My approach to using this instruction consisted of re-writing the library routines in C with function calls wherever the compare-bytes instruction should go. I compiled this C code with my compiler, changed the assembly code to eliminate the function calls in favor of the compare-bytes instruction, and assembled it into the library (actually a module of code that gets included in all final links, but that is just a detail of our simple environment). Since most C programs (especially utilities and other systems programs) do a lot of string processing, this one instruction is really worth the small implementation cost. It often improves run times by 15% to 20% (just goes to show that the impact of processing C language strings has been long- ignored). It implements just the right semantics and probably has other applications for specialized pattern matching. I just thought some of you would be interested. bcase
lamaster@pioneer.UUCP (03/26/87)
In article <15292@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >There was a discussion a few months ago about processing strings more >efficiently than a byte at a time. The Am29000 takes one of the possible >approaches to improving string processing performance.... > : > >I just thought some of you would be interested. > > bcase I appreciate seeing submissions like this. Designing a computer instruction set is a little like backpacking: There is an art to figuring out exactly what you really need to take with you. My question is this: How likely is it that a compiler itself will be able to detect the case when it can use an instruction like this and generate code automatically to use it. One of the positive points to the RISC debate is that it brought out the point that useful instructions which are hard for a compiler to generate are not always a win. The old STAR-100 had a number of instructions, including string processing instructions, which were left out of the Cyber 205 because it turned out that the compilers never generated them. They were neat instructions though. So, my question is, is this instruction easily usable by a C, Fortran, Pascal, or ADA compiler and if so, under what conditions? Hugh LaMaster, m/s 233-9, UUCP {seismo,topaz,lll-crg,ucbvax}! NASA Ames Research Center ames!pioneer!lamaster Moffett Field, CA 94035 ARPA lamaster@ames-pioneer.arpa Phone: (415)694-6117 ARPA lamaster@pioneer.arc.nasa.gov "In order to promise genuine progress, the acronym RISC should stand for REGULAR (not reduced) instruction set computer." - Wirth ("Any opinions expressed herein are solely the responsibility of the author and do not represent the opinions of NASA or the U.S. Government")
baum@apple.UUCP (03/26/87)
-------- [] > >One unique feature of the Am29000 architecture is a special instruction. >This instruction is intended to be used to speed-up string processing, >but my guess is that other uses will be discovered. The instruction is >called "compare-bytes" and works like this:..... The HP Spectrum has three instructions that are very similar. Unit XOR will Exlusive-OR two registers and let you skip if any byte or halfword is equal or not equal. Unit AddComplement[&Trap on Condition] will add the ones complement of one register to another. This is sort of subtraction without the forced carry. That way, all bytes, halfwords & nibbles essentially have the same carry-in (although a carry out of one stage could force a carry-in to another, it turns out this doesn't affect the major uses of the instruction). Then, either a trap will occur or the next instruction will be skipped if the conditions are met. The conditions are any halfword/ byte/nibble carry, any halfword/byte zero, and their complements. Among other things, these instructions can be used for string searching a word at a time, decimal/byte/halfword range checking/validation a byte at a time, and decimal addition pre-biasing. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
mark@mimsy.UUCP (03/26/87)
In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: >...How likely is it that a compiler [can] detect the case when it can use > an instruction like this and generate code automatically to use it? This is of course a good point, and it seems that this is going to be one of those hard cases for compiler writers. But it opens the issue of subroutine libraries as providing an environment which also needs support in the instruction set. The Unix string routines are a good example, because they are even now often recoded in assembly for efficiency. Various pieces of Unix kernel frequently suffer the same fate (e.g. tty buffers). But once one is into instructions which are good for certain operating systems or environments, rather than certain languages, one has narrowed one's market. For instance, the Am29000 special string compare is useless if strings are encoded as a length followed by the chars. -mark Background: I happen to think Unix-style strings are a big market, and I love things which make Unix faster (because there is a positive feedback cycle which makes more Unix yet more available and desirable.). -- Spoken: Mark Weiser ARPA: mark@mimsy.umd.edu Phone: +1-301-454-7817 After May 1, 1987: weiser@xerox.com
bcase@amdcad.UUCP (03/26/87)
In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: >I appreciate seeing submissions like this. Designing a computer instruction >set is a little like backpacking: There is an art to figuring out exactly >what you really need to take with you. Thanks for the positive feedback. An interesting point of view! >My question is this: How likely is it that a compiler itself will be able to >detect the case when it can use an instruction like this and generate code >automatically to use it. One of the positive points to the RISC debate is >that it brought out the point that useful instructions which are hard for a >compiler to generate are not always a win. The old STAR-100 had a number of >instructions, including string processing instructions, which were left out of >the Cyber 205 because it turned out that the compilers never generated them. >They were neat instructions though. So, my question is, is this instruction >easily usable by a C, Fortran, Pascal, or ADA compiler and if so, under what >conditions? This is an excellent question, one that I intended to address in the original posting. However, I forgot to include it when the damn article got kinda long. The fact of the matter is: A C compiler is going to have a real hard time generating this instruction. A Pascal Compiler might have a slightly easier time since the conept of "string" is a little more integrated with the langauge. In general, a compiler for a language without built-in types and functions for dealing with strings will have a hard time with our compare-bytes instruction. We realized this when we added it to the architecture (and it was truly a last-minute addition). However, it has such a significant impact on C programs in general that we were not deterred by the realization that this instruction will probably occur only a few times in the strcpy(), strcmp(), and strlen() library functions. Out of all the megabytes of code in a UNIX system, this instruction will account for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough guess). But if it improves the running time of all the string-oriented system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems worth it. And the implementation cost was so small. Also, there are some instructions that must be present just to administer the system, like return from interrupt, move-from-special-register, etc. These are not generated by a compiler either. Just to reiterate a point: RISC is not reducing the instruction set but is improving performance. Ok, so you don't believe the above? How about "It improved dhrystone a hell of lot." bcase
guy@gorodish.UUCP (03/26/87)
>My question is this: How likely is it that a compiler itself will be able to >detect the case when it can use an instruction like this and generate code >automatically to use it. In most C implementations, "strcpy", "strcmp", etc. are library routines; if the speedup from these instructions is sufficiently large, it doesn't matter if the compiler can generate them because the compiler can generate calls to those routines, and those routines can be written in assembler language. You could also imagine a compiler that would recognize names like "_builtin_strcpy" and generate the assembler code for those routines in line, and a <string.h> that contains #define strcpy _builtin_strcpy (see the footnote on page 86 of the October 1, 1986 ANSI C draft). Another possibility might be to have something like the WE32000 C compiler's "asm" directive, which permits you to define "functions" that are expanded in line to assembler code. Similar techniques can probably be used in other languages. If they do string processing in routines, the routines can be written in assembler language to use these instructions. If they generate stereotyped sequences for these string operations, these stereotyped seqences can use these instructions. However, I don't know how difficult it would be to recognize various special cases and generate non-stereotyped sequences that use these instructions, or how much of a speedup this would give (if these special cases weren't common, speeding them up might not buy you enough to worry about it).
jaw@ames.UUCP (03/27/87)
> [....] Since > most C programs (especially utilities and other systems programs) do a lot of > string processing, this one instruction is really worth the small > implementation cost. It often improves run times by 15% to 20% (just goes to > show that the impact of processing C language strings has been long- ignored). just curious which unix utilities use str(cpy|cmp|len) in their inner loops? certainly, 'vn' comes to mind as devoting much cpu time to these functions. is the 15-20% claimed due at all to either (a) significantly slow byte addressing to begin with (ala cray)? (b) in-line compilation of the string(3) stuff into the application? if (a), then improving memory byte access speed in the architecture is a more general solution with more payoff overall than the compare gate hack. what is the risc chip cost for byte vs. word addressibility, anyway? if (b), then maybe function call speed is the culprit rather than dearth of the specialized instruction. at any rate, for cray unix, buffer copy ops in the kernel were vastly improved when re-written for words instead of bytes, even more so when vectorized (the only place in the kernel with vectorization, i think). of course, table lookup using only 2^16 locations would be a joke software solution for super-intensive null-char-in-16-bit-smallword compare code. drastic, but saves a test the amd chip appears worried about. personally, i'm a fan of branch folding ... ames!jaw
mash@mips.UUCP (03/27/87)
In article <15304@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >... >for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough >guess). But if it improves the running time of all the string-oriented >system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems >worth it. And the implementation cost was so small. Also, there are This is clearly true. >some instructions that must be present just to administer the system, >like return from interrupt, move-from-special-register, etc. These >are not generated by a compiler either. Just to reiterate a point: RISC >is not reducing the instruction set but is improving performance. Absolutely! There are almost always "structural" instructions that are hard to generate, but are needed. > >Ok, so you don't believe the above? How about "It improved dhrystone >a hell of lot." Unfortunately, Dhrystone is an artificial benchmark. I couldn't resist doing a quick test [we have some pretty neat profiling tools for taking an already-compiled program & turning it into one that does procedure & statement counts, then gives you instruction cycle counts [which would be, typically 60% of the total cycles, the other 40% being cache misses, TLB-miss overhead, memory path interference, etc.] OK: here's a quiz: I did some simple tests [ones we use in our standard Performance Brief suite, which includes diff, grep yacc, and nroff], plus csh. I won't pretend this is definitive, since I just whipped it up. How much time would you guess these programs spend doing strlen, strcmp. strcpy [the high-runners]? [Just guess the instruction cycle %]. Try 0-1%, 2-5%, 6-10%, 11-20%, 21%-up. ANSWERS: program strlen strcmp strcpy % cycs cy/call % cycs cy/call % cycs cy/call diff .03% - 0 - 0 - grep none yacc .04% - .59% - 0 - nroff 0 - 0 - <0.1% - csh 1.71% 20 1.27% 9 1.84% 21 % of total func calls 3.76% 6.11% 3.75% Dhrystone <.01% 19 16.94% 103 22.36% 136 Bottom-line: 1) Dhrystone isn't remotely representative of some common string-pushing programs in UNIX. 2) most of these, the total is <1% of instruction cycles, hence <0.6% of fully-degraded cycles. Maybe you can save 20% of this, or about .1%. 3) For csh: what's going on is that these routines are called very frequently, but for short strings: 3-6 characters; strcmp's obviously fail quickly [2nd or 3rd character]. I think the implication is that maybe you can get rid of 20% of the cycles, which would be a 1% instruction cycle saving, or about <0.6% full-degraded cycle saving for csh. 4) Given all of this, maybe what you get can be grossly estimated as about .3%, maybe. [Again, this was somethign whipped up in half an hour, so hardly definitive]. 5) Note that Dhrystone spends a huge lot of its time copying and comparing long strings. Hence, it's well worth a little extra setup time for Dhrystone to lessen the cost per loop. [In fact, we only got around to installing an assembler strcpy when we noticed how much time was spent there in Dhrystone.] Thus, I'd say it's still an open issue [for those of you who happen to be designing computer architectures at this instant!] Dhrystone: says it's important, but is totally unrepresentative. csh: says it might get you .6% others: surprise you by using str* so little general evidence: says it might be useful, but if you follow our 1% rule [and HP used something similar, as I recall], there is as yet insufficient evidence to include it. If I were right now designing a system, I'd be tempted to do a lot of data-gathering. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
shebanow@ji.Berkeley.EDU.UUCP (03/27/87)
Every time I trace a program which uses stdio to any great degree, I find printf to be the worst offender (with regard to time consumed). In fact, I have seen programs which spend 30% of their time in printf (actually _doprnt). In the code I have optimized, I usually try to eliminate the use of printf (which can make the code rather opaque). This does result in 10-20% improvements in speed. The most obvious use of new string processing instructions would be to improve the performance of printf. Has anyone done this? Mike Shebanow
cmcmanis@sun.UUCP (03/27/87)
In article <15304@amdcad.UUCP>, bcase@amdcad.UUCP (Brian Case) writes: > > Ok, so you don't believe the above? How about "It improved dhrystone > a hell of lot." > > bcase This I believe, I was wondering when someone would point that out. I think it is time for some *real* optimizing compilers that can detect when they are compiling a benchmark and optimize for the appropriate instruction! :-) :-) :-) -- --Chuck McManis uucp: {anywhere}!sun!cmcmanis BIX: cmcmanis ARPAnet: cmcmanis@sun.com These opinions are my own and no one elses, but you knew that didn't you.
metzger@heathcliff.columbia.edu.UUCP (03/27/87)
In article <18036@ucbvax.BERKELEY.EDU> shebanow@ji.Berkeley.EDU.UUCP (Mike Shebanow) writes: >Every time I trace a program which uses stdio to any great degree, I find >printf to be the worst offender (with regard to time consumed). In fact, >I have seen programs which spend 30% of their time in printf (actually >_doprnt). In the code I have optimized, I usually try to eliminate the >use of printf (which can make the code rather opaque). This does result >in 10-20% improvements in speed. The most obvious use of new string >processing instructions would be to improve the performance of printf. >Has anyone done this? Here at Project Synthesis we found that out a while back. It is pretty strange to think of it that way, but yes, Printf and friends are compute bound, not I/O bound. We have experimented with a number of solutions, including a hardware printf coprocessor for the 68020. Predictably, it improved performance quite a bit. We also have tried using specially coded versions of printf in assembly language that we passed through a (very time expensive) exhaustive search optimiser and the results were also pretty good. Perry Metzger
bcase@amdcad.UUCP (03/28/87)
In article <1001@ames.UUCP> jaw@ames.UUCP (James A. Woods) writes: >just curious which unix utilities use str(cpy|cmp|len) in their inner loops? >certainly, 'vn' comes to mind as devoting much cpu time to these functions. > >is the 15-20% claimed due at all to either > > (a) significantly slow byte addressing to begin with (ala cray)? > (b) in-line compilation of the string(3) stuff into the application? > >if (a), then improving memory byte access speed in the architecture is a >more general solution with more payoff overall than the compare gate hack. >what is the risc chip cost for byte vs. word addressibility, anyway? > >if (b), then maybe function call speed is the culprit rather than dearth of >the specialized instruction. > >at any rate, >for cray unix, buffer copy ops in the kernel were vastly improved when >re-written for words instead of bytes, even more so when vectorized >(the only place in the kernel with vectorization, i think). > >of course, table lookup using only 2^16 locations would be a joke >software solution for super-intensive null-char-in-16-bit-smallword >compare code. drastic, but saves a test the amd chip appears worried about. >personally, i'm a fan of branch folding ... > Re: (a) above. NO. Implementing a byte addressable (especially writable) memory slows down all memory accesses for a slight improvement in byte processing efficiency. For my own part, I can say that from the beginning of the Am29000 project, I was firmly against anything but a word-oriented interface. You have to realize that byte accessing requires an alignment network somewhere: it will add some nanoseconds to all memory accesses; you can put the alignment network in its own pipeline stage, but even then, it will *always* slow down every memory access, there is nothing you can do about it! (The same reasoning leads to our addressing-mode-less load and store instructions: if addressing modes are there, then the instructions always pay the cost even when the compiler knows better!) Thus, this is not strictly a RISC issue, but a maximum through-put issue regardless of RISCness. (Note that the Am29000 extract-byte and insert-byte instructions are essentially putting the alignment network in a pipeline stage, but the compiler can decided to pay or not pay the penalty for a given memory access (depending upon whether it is a byte access or not)). Re: (b) above. NO. We do not do inlining of the strcmp(), strcpy(), or strlen() routines (I wish we could, it would be even better!). The Am29000 has one of the fastest calling conventions around. The performance improvement we saw with the special instruction is an order of magnatude (ok, I'm taking a guess here, but it is probably pretty close) greater than what would have been gained by in-lining in this case. Re: Branch folding. I like all optimizations. Give me more MIPS! More. MORE. MORE! bcase --------------- Back again by popular demand: Sung to the tune of "Timewarp:" It's just a shift the left, And then a shift to the ri-i-ight! Get a load of those MIPS And the code's real ti-i-ght! This lock-step pipeline Really drives me insa-a-a-e-a-ane Let's do the I-fetch again!
bcase@amdcad.UUCP (03/28/87)
In article <5969@mimsy.UUCP> mark@mimsy.UUCP (Mark Weiser) writes: >In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: >>...How likely is it that a compiler [can] detect the case when it can use >> an instruction like this and generate code automatically to use it? > >This is of course a good point, and it seems that this is going to be >one of those hard cases for compiler writers. But it opens the issue >of subroutine libraries as providing an environment which also needs >support in the instruction set. The Unix string routines are a good >example, because they are even now often recoded in assembly for efficiency. >Various pieces of Unix kernel frequently suffer the same fate >(e.g. tty buffers). But once one is into instructions which are good >for certain operating systems or environments, rather than certain languages, >one has narrowed one's market. For instance, the Am29000 special string >compare is useless if strings are encoded as a length followed by the chars. Yes, you are right: when the length is known, things are *much* easier. But, the instruction is still a useful primitive: it can be used to find out, in one cycle, whether or not a 32-bit word contains a byte of interest. The compiler may not generate it, but it can be used in hand- coded inner loops for this purpose. bcase
clif@intelca.UUCP (03/28/87)
> In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: > >conditions? [ Discussion of best history of string instructions] > The fact of the matter is: A C compiler is going to have a real hard time > generating this instruction. A Pascal Compiler might have a slightly > easier time since the conept of "string" is a little more integrated with > the langauge. In general, a compiler for a language without built-in types > and functions for dealing with strings will have a hard time with our > compare-bytes instruction. We realized this when we added it to the > architecture (and it was truly a last-minute addition). However, it has > such a significant impact on C programs in general that we were not deterred > by the realization that this instruction will probably occur only a few > times in the strcpy(), strcmp(), and strlen() library functions. Out of > all the megabytes of code in a UNIX system, this instruction will account > for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough > guess). But if it improves the running time of all the string-oriented > system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems > worth it. And the implementation cost was so small. Also, there are > some instructions that must be present just to administer the system, > like return from interrupt, move-from-special-register, etc. These > are not generated by a compiler either. Just to reiterate a point: RISC > is not reducing the instruction set but is improving performance. > > Ok, so you don't believe the above? How about "It improved dhrystone > a hell of lot." > > bcase I think that it makes sense to add instructions like this even if they only get put into the standard libraries, presumably they will prove to be modest win for most applications written in C. I vaguely remember reading that calls to string routines represented about .5% of all C code. As for Brian's second statement about Dhrystone being helped by string instructions: He is certainly right about that. -- Clif Purkiser, Intel, Santa Clara, Ca. {pur-ee,hplabs,amd,scgvaxd,dual,idi,omsvax}!intelca!clif These views are my own property. However anyone who wants them can have them for a nominal fee.
rpw3@amdcad.UUCP (03/28/87)
Well, John Mashey's numbers were certainly counter-intuitive to ME, given that those programs bang away at characters all the time, but we should look at one more thing: When were nroff/grep/yacc written compared to when str* was standardized in the library? After all, the str* libraries didn't spring full-blown on day one with those exact sematics and calling sequences (I'm guessing). I'll also guess (could easily be wrong) that there is a lot more string stuff going on than you see in the str* statistics, and that it's hand coded "in line" (in C, of course). Counter point: What do similar NEW programs look like? Take some code written by someone (or a team) that started with the "modern" libraries. Run the statistics on that, and see... Finally, don't forget str{,r}chr (a.k.a. {,r}index). Things like "ls -l" beat on those really hard (at least in System-V), since way down at the bottom a lot of time is spent (I'm guessing) tearing apart /etc/passwd entries (go look at "getpwent" & friends). I know that I sometimes avoid str* because I'm afraid it might be slow, particularly when I know that keeping a count (a.k.a. length) can avoid a lot of searching for the end. The solution is usually a high-level change in the applications basic algorithms so that so much copying & searching is simply avoided. A simple example is when you are building various strings up out of pieces (as in "speeding up" shell scripts by recoding in C). Even USING the libraries, sometimes instead of: strcpy(buf, "piece"); strcat(buf, "another piece"); strcat(buf, "still another piece"); I find myself doing: char *p = &buf[0]; int n; strcpy(p, "piece"); p += strlen(p); strcpy(p, "another piece"); p += strlen(p); sprintf(p, "format%d%s", args...); p+= strlen(p); strcpy(p, "still another piece"); p += strlen(p); etc... Then when you're done, you can assert that strlen(buf) == (p - buf); Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun,attmail}!redwood!rpw3 ATTmail: !rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403
guy@gorodish.UUCP (03/28/87)
>>The most obvious use of new string processing instructions would be to >>improve the performance of printf. Has anyone done this? The version of "_doprnt" that comes with 4BSD and comes in the VAX System III source (althought whether it's the one that's in the library or not, I don't know), and may very well have come with UNIX 32V, is written in assembler. The 4.3BSD version, at least, does use the string *and* EDITPC instructions, although it doesn't use the string instructions for handling things like "%s". VAX System V doesn't use this version (and no, it's probably not because the assembly version doesn't handle all the formatting operators in the S5 version - I know it handles most of them, and I think it handles all of them), so I'm curious whether they dropped it because it didn't buy much. >We have experimented with a number of solutions, including a hardware >printf coprocessor for the 68020. Predictably, it improved performance >quite a bit. We also have tried using specially coded versions of printf >in assembly language that we passed through a (very time expensive) >exhaustive search optimiser and the results were also pretty good. How much did the coprocessor cost, and how much better was it than the assembly-language version?
mash@mips.UUCP (03/29/87)
In article <15317@amdcad.UUCP> rpw3@amdcad.UUCP (Rob Warnock) writes: >Well, John Mashey's numbers were certainly counter-intuitive to ME, given >that those programs bang away at characters all the time, but we should >look at one more thing: > >When were nroff/grep/yacc written compared to when str* was standardized >in the library? After all, the str* libraries didn't spring full-blown on >day one with those exact sematics and calling sequences (I'm guessing). >I'll also guess (could easily be wrong) that there is a lot more string >stuff going on than you see in the str* statistics, and that it's hand >coded "in line" (in C, of course). Yes, those were old programs. I'd be happy to hear of suggestions for specific programs to check. It's absolutely sure that there was plenty of string stuff going on, coded in C. [Of coruse, some of the string stuff is algorithmic work that hardware isn't likely to support for a while, i.e., regular-expression matching.] > >Finally, don't forget str{,r}chr (a.k.a. {,r}index). Things like "ls -l" >beat on those really hard (at least in System-V), since way down at the >bottom a lot of time is spent (I'm guessing) tearing apart /etc/passwd >entries (go look at "getpwent" & friends). I generally looked for str* routines; the others were generally in the noise. Following are the stats for SYS V : ls -l /usr Profile listing generated Sat Mar 28 14:19:29 1987 with: ---------------------------------------------------------------------------- * -p[rocedures] using basic-block counts; * * sorted in descending order by the number of cycles executed in each * * procedure; unexecuted procedures are excluded * ---------------------------------------------------------------------------- 1139801 cycles cycles %cycles cum % cycles bytes procedure (file) /call /line 511480 44.87 44.87 6558 22 getname (ls.c) 310413 27.23 72.11 16 40 fgetc (../fgetc.c) 99006 8.69 80.79 43 28 _flsbuf (../flsbuf.c) 79820 7.00 87.80 340 17 _doprnt (../doprnt.c) 18830 1.65 89.45 54 23 select (ls.c) 15073 1.32 90.77 321 19 gmtime (../ctime.c) .... 1655 0.15 98.79 10 5 strcmp (../strcmp.s) .... 1338 0.12 99.18 18 5 strlen (../strlen.s) .... 63 0.01 99.96 8 10 strchr (../strchr.c) [Surprise! str* stuff isn't on the map in this one. Just goes to show that even experienced [and Rob is] folks' intuition needs checking. >Well, this surprised me also.] This whole string-processing discussion has been quite worthwhile, if only to remind us that we often discuss pretty esoteric things in this newsgroup, while at the same time, we sometimes don't have the foggiest ideas on even basic program behavior!! -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
rbutterworth@watmath.UUCP (03/29/87)
In article <15317@amdcad.UUCP>, rpw3@amdcad.UUCP (Rob Warnock) writes: > I find myself doing: > char *p = &buf[0]; > int n; > strcpy(p, "piece"); p += strlen(p); > strcpy(p, "another piece"); p += strlen(p); > sprintf(p, "format%d%s", args...); p+= strlen(p); > strcpy(p, "still another piece"); p += strlen(p); > etc... Wouldn't it have been nice if strcpy(), strcat(), sprintf(), etc. had been designed to return a pointer to the END of the string, (or a count of the number of characters in the string)? They already have this information for free, but instead they return their first argument, something that the caller already knows.
earl@mips.UUCP (03/29/87)
Actually the MIPS architecture does contain string processing instructions. They go by the names ADDU and XOR. For example, the inner loop of my strlen routine, which searches for a word with a nonzero byte, is l1: lw t2,1(a1) addu a1,4 addu t3,t2,t0 xor t3,t2 and t3,t1 beq t3,t1,l1 Quiz: what are the magic constants that need to be loaded into t0 and t1 to make this work? This is 7 cycles per word, or 1.75 cycles per byte. -Earl
gemini@homxb.UUCP (03/30/87)
Options to "ls" have a great affect upon "str*" usages. The following results were obtained on an Amdahl 580 running uts 5.2.5: ./ls /bin /usr/bin [listed 64 times] %Time Seconds Cumsecs #Calls msec/call Name 22.1 10.72 10.72 80190 0.1336 strcmp 3.3 1.60 40.25 21577 0.0742 memcpy 0.7 0.33 47.65 10688 0.0312 strlen 0.3 0.13 47.97 strchr 0.1 0.05 48.40 memchr 0.1 0.03 48.53 strcpy ./ls -CF /bin /usr/bin [listed 64 times] %Time Seconds Cumsecs #Calls msec/call Name 7.9 10.67 100.55 80191 0.1330 strcmp 1.2 1.58 128.38 32234 0.0491 memcpy 0.3 0.42 133.08 21440 0.0194 strlen 0.1 0.12 134.05 2 58. strchr 0.1 0.08 134.23 memchr 0.0 0.02 134.50 strcpy ./ls -l /bin /usr/bin [listed 8 times] %Time Seconds Cumsecs #Calls msec/call Name 0.4 1.42 338.83 10023 0.1413 strcmp 0.3 1.18 341.32 18761 0.0631 memcpy 0.0 0.15 345.45 1336 0.112 strlen 0.0 0.10 345.92 memchr 0.0 0.08 346.18 strcpy 0.0 0.03 346.35 1 33. strchr Note: on the Amdahl, strcmp and friends are in assembler. Rick Richardson, PC Research, Inc: (201) 922-1134 ..!ihnp4!castor!pcrat!rick when at AT&T-CPL: (201) 834-1378 ..!ihnp4!castor!pollux!rer
karl@haddock.UUCP (03/31/87)
In article <15317@amdcad.UUCP> rpw3@amdcad.UUCP (Rob Warnock) writes: >[Sometimes instead of using strcat] I find myself doing: > char *p = &buf[0]; > strcpy(p, "piece"); p += strlen(p); > strcpy(p, "another piece"); p += strlen(p); > sprintf(p, "format%d%s", args...); p+= strlen(p); > strcpy(p, "still another piece"); p += strlen(p); > etc... Me too. I wish strcpy et al would return a pointer to the *end* of the copied string (fgets too, as someone else brought up). In fact, one could argue that the low-level routine should not even bother to write the final '\0'; then your example would become *_strcpy(_sprintf(_strcpy(_strcpy(p, "piece"), "another piece"), "format%d%s", args...), "still another piece") = '\0'; (Which can still be written using temporaries if the code gets too opaque!) Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
tim@amdcad.UUCP (03/31/87)
In article <985@mips.UUCP>, earl@mips.UUCP (Earl Killian) writes: > Quiz: what are the magic constants that need to be loaded into t0 and > t1 to make this work? Let's see... How about t0 = 0x7efefeff t1 = 0x81010100 Is that right? If so, it appears to work only for 7 bit characters (but we admit that such a restriction is reasonable in real life)... (This is a fun problem but probably a bit too hard for use in interviews.) Mike Johnson Brian Case Tim Olson
henry@utzoo.UUCP (Henry Spencer) (03/31/87)
> ... I think it is time for some *real* optimizing compilers that can detect > when they are compiling a benchmark and optimize for the appropriate > instruction! :-) :-) :-) Don't laugh! It's not unknown for an optimizing-compiler group to put extra attention into making common benchmarks run fast. I'm told that there is at least one compiler that has a specialized loop-unrolling optimization whose major purpose is to quietly turn the "looping" version of Linpack into the "unrolled" version. -- "We must choose: the stars or Henry Spencer @ U of Toronto Zoology the dust. Which shall it be?" {allegra,ihnp4,decvax,pyramid}!utzoo!henry
rpw3@amdcad.UUCP (03/31/87)
In article <238@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: >Following are the stats for SYS V : ls -l /usr > cycles %cycles cum % cycles bytes procedure (file) > /call /line > 511480 44.87 44.87 6558 22 getname (ls.c) > 310413 27.23 72.11 16 40 fgetc (../fgetc.c) > 99006 8.69 80.79 43 28 _flsbuf (../flsbuf.c) > 79820 7.00 87.80 340 17 _doprnt (../doprnt.c) > ... ... ... ... ... ... Hey, John, wanna know something really interesting? Go look at 4.2 BSD "ls", at routine getname(), and you'll discover it's full of getpwent()'s (plus {set,end}pwent(), etc.) and strncpy()'s (hidden in the "SCPYN" macro). And of course getpwent() uses the str* routines, though it also has its own pwskip() instead if index(). (Have you got a 4.{2,3} BSD "ls" that you could run your profiler on?) Then look at the Sys V "ls", and lo & behold, all of those neat library routines have been unfolded into the getname() routine, and hand-coded into a most glorious mess, which calls only (guess)... fgetc()!!! (No wonder my jaw dropped when I saw your numbers!) Now other parts of Sys V use {set,get,end}pwent(), so I guess somebody decided that "ls" was running too slow (or had an "intuition"?) and ripped it apart. Given the looks of the code, I'm surprised that even fgetc() survived. This reminds me of Gerry Weinberg's "Railroad Paradox": "We don't need fast strlib routines, because no one ever uses them." Yeah... 'cause they're not fast, everyone codes around them! We (me, too, I just realized I got suckered in also) must be *VERY* careful to look inside any major piece of code we are going to try and use to prove something or other, in order to avoid simply confirming the {good,bad,weird} practices of the past. I am now completely unsure whether we can get ANY reasonable statistics on what the proper usage (percent, etc.) of "str*" should/could be in NEW code, if we base it on code that has been hacked on as long as the various Unices. Anybody want to cough up a "Stryngstone" benchmark for us to throw rocks at? Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun,attmail}!redwood!rpw3 ATTmail: !rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403
ricker@bunker.UUCP (03/31/87)
>>My question is this: How likely is it that a compiler itself will be able to >>detect the case when it can use an instruction like this and generate code >>automatically to use it.... The Wang Labs VS processor also has some very useful assembler (not machine) instructions for string processing. The manner in which this is handled is by adding an attribute to functions called "builtin". This enables the C (or other language) compiler to generate in-line code for functions that are in the instruction set. Example: builtin char *memcpy(); main() { struct stbuf { char buf[8]; int i; } structure1; struct stbuf structure2; /** initialize structure1 with code here **/ /** This function generates a single VS assembler instruction **/ memcpy(&structure2,&structure2,sizeof(structure1)); } The list of builtins is clearly listed in the reference manual and includes all the standard System 5 string functions. If one does not wish to use builtins, then the function is declared explicitly without the builtin attribute. I am not sure, but I think I recall that if the function is not declared explicitly and the function is a builtin, then it is treated as a builtin. Portability is not a problem with this scheme. Your basic ricker.
earl@mips.UUCP (04/01/87)
In article <15331@amdcad.UUCP>, tim@amdcad.UUCP (Tim Olson) writes: > > Quiz: what are the magic constants that need to be loaded into t0 and > > t1 to make this work? > Let's see... How about > t0 = 0x7efefeff > t1 = 0x81010100 > Is that right? If so, it appears to work only for 7 bit characters (but > we admit that such a restriction is reasonable in real life)... Yes, you've got it. Congratulations. (Gee, do I owe you lunch or something now?) Actually, it works for 8-bit characters in bytes 0, 1, and 2 and a 7-bit one in byte 3 (I'm using little-endian byte numbers here). Thus it may accept '\200' as a zero in byte 3. But this is easily handled in the follow-on code that has to detect which byte is nonzero by actually testing byte 3 and branching back to the loop if it is nonzero. Actually, here is that code. Note how after the word loop the carries also conveniently let you detect the byte. This part only works for little-endian byte ordering. (Note: my personal convention is to indent instructions in branch delay slots once space): ## search for word with zero byte l1: lw t2,1(a1) # read next word addu a1,4 addu t3,t2,t0 # cause carry out of any nonzero byte xor t3,t2 # xor to find carry bits and t3,t1 # look at carries out of bytes beq t3,t1,l1 # all bytes carried => none zero ## found word with zero byte, now find the byte sll t4,t3,23 # carry out of byte 0? l3: beq t4,0,x0 # if not, go compute length sll t4,t3,15 # carry out of byte 1? bge t4,0,x1 # if not, go compute length sll t4,t3,7 # carry out of byte 2? bge t4,0,x2 # if not, go compute length srl t4,t2,24 # byte 3 zero? bne t4,0,l1 # no, re-enter loop
john@frog.UUCP (04/01/87)
> Well, John Mashey's numbers were certainly counter-intuitive to ME... Really sophisticated programs spend most of their time doing really sophisticated things; simple grunt routines like str*() aren't universally applicable. > When were nroff/grep/yacc written compared to when str* was standardized > in the library? After all, the str* libraries didn't spring full-blown on > day one... I don't know (offhand) about these, but I recall once being curious about why lex was slow. It turned out to use subroutines for isalpha(), isspace(), isdigit(), etc, instead of the (presumably newer) ctype.h macros. HOWEVER, even after making that change to a version of lex, it didn't help all that much, because lex has so much else to do. > Even USING the libraries, sometimes instead of: > strcpy(buf, "piece"); > strcat(buf, "another piece"); > strcat(buf, "still another piece"); > I find myself doing: > char *p = &buf[0]; > int n; > strcpy(p, "piece"); p += strlen(p); > strcpy(p, "another piece"); p += strlen(p); > sprintf(p, "format%d%s", args...); p+= strlen(p); > strcpy(p, "still another piece"); p += strlen(p); > etc... > WHOOPS! That's about as much work (first you copy the string, then you step over it) as the use of strcat(). If you don't mind being non-portable to oddball architectures, you can write a routine like "strcatl()", which takes a destination buffer as a first argument, and a NULL-terminated list of strings to concatenate into that buffer (or you can be slightly portable and do it with <varargs.h>). The easiest way might be, however, to use: char *strcpya(p1, p2) /* strcat and return end address */ char *p1, *p2; { while (*p1++ = *p2++) /**/; return p1-1; } and then strcpya(strcpya(strcpya(buf, "first piece"), " second piece"), " and the last"); or if you prefer char *p,buf[127 /* a prime number*/ ]; p = strcpya(buf, "first piece"); p = strcpya(p, " second piece"); p = strcpya(p, " third piece"); Either way, you only touch each byte once. Sometimes, the question you need to ask yourself about library routines is not "Could they be better coded with this screwball assembly instruction?" but rather "Does it *really* do what I want?". -- John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101 ...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA "Happiness is the planet Earth in your rear-view mirror." - Sam Hurt, in "Eyebeam, Therefore I Am"
meissner@dg_rtp.UUCP (04/02/87)
In article <6448@watmath.UUCP> rbutterworth@watmath.UUCP writes: > Wouldn't it have been nice if strcpy(), strcat(), sprintf(), etc. > had been designed to return a pointer to the END of the string, > (or a count of the number of characters in the string)? The System V.3 (and ANSI) *{scanf printf} routines have a a format specifier %n which stores the number of bytes {read, written} so far into the int pointed to by the current argument. Also in System V.xx and ANSI, sprintf returns the total number of bytes written (not the pointer to the string like 4.2 BSD does). -- Michael Meissner, Data General Uucp: ...mcnc!rti-sel!dg_rtp!meissner It is 11pm, do you know what your sendmail and uucico are doing?
stubbs@ncr-sd.UUCP (04/04/87)
In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: > >My question is this: How likely is it that a compiler itself will be able to >detect the case when it can use an instruction like this and generate code >automatically to use it. In this case, these instructions would be used in the hand coded strcopy and strcmp C library routines, so compilers need not be concerned. My guess is they are still a win, because these libraries are used a bunch. Jan Stubbs ....sdcsvax!ncr-sd!stubbs 619 485-3052 NCR Corporation Advanced Development 16550 W. Bernardo Drive MS4010 San Diego, CA. 92127
pec@necntc.UUCP (04/06/87)
How likely is it that a compiler can make use of string copy instructions? NEC's re-targeting of the portable C compiler to the V60 architecture frequently takes advantage of one of the V60's string copy instructions. It uses a string copy for structure assignments! Of course, these instructions are also used in the library routines. Another use that comes to mind is when loader zeros out a BSS area...
paulsc@orca.UUCP (04/06/87)
In article <7843@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >> ... I think it is time for some *real* optimizing compilers that can detect >> when they are compiling a benchmark and optimize for the appropriate >> instruction! :-) :-) :-) >Don't laugh! It's not unknown for an optimizing-compiler group to put extra >attention into making common benchmarks run fast. I'm told that there is at >least one compiler that has a specialized loop-unrolling optimization whose >major purpose is to quietly turn the "looping" version of Linpack into the >"unrolled" version. It isn't always a conscious effort to trick customers into buying the compiler. The compiler group might only have a handful of "test programs" for their compiler. If the compiler does any peep hole optimizations, the compiler group probably compiles whatever test programs they have to see how their compiler might generate better code. If any of their test programs are benchmarks, they will be spending time making the benchmarks run fast. Paul Scherf, Tektronix, Box 1000, MS 61-033, Wilsonville, OR, USA tektronix!orca!paulsc
eugene@pioneer.UUCP (04/07/87)
In article <2344@orca.TEK.COM> paulsc@orca.UUCP (Paul Scherf) writes: > >It isn't always a conscious effort to trick customers into >buying the compiler. The compiler group might only have a >handful of "test programs" for their compiler. > >Paul Scherf, Tektronix, Box 1000, MS 61-033, Wilsonville, OR, USA This is part of the point of my research: to get everyone writing compilers to have a large consistent set of programs to check optimizations just like we have programs (validation suites) to check conformance to language standards. Hope, I can get the various companies interested in a set of open performance measurement tool. From the Rock of Ages Home for Retired Hackers: --eugene miya NASA Ames Research Center eugene@ames-aurora.ARPA "You trust the `reply' command with all those different mailers out there?" "Send mail, avoid follow-ups. If enough, I'll summarize." {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene
am@cl-jenny.UUCP (04/08/87)
In article <15292@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >One unique feature of the Am29000 architecture is a special instruction.... > ... [ to speed up detecting null bytes in words]. You might be interested to know that such detection of null bytes in words can be done in 3 or 4 instructions on almost any hardware (nay even in C). (Code that follows relies on x being a 32 bit unsigned (or 2's complement int with overflow ignored)...) #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) Then if e is an expression without side effects (e.g. variable) has_nullbyte_(e) is nonzero iff the value of e has a null byte. (One can view this as explicit programming of the Manchester carry chain present in many processors which is hidden from the instruction level). On Acorn's ARM risc machine using this feature as Brian Case suggests speeded up dhrystone by 20-25%, which may comment on dhrystone or risc. This gives some 5-6K dhrystones on the slow clocked pre-production 3 micron chips. (Wonder what state of the art fabrication can give....)
karl@haddock.UUCP (04/11/87)
In article <486@danews.ATT.COM> lvc@danews.ATT.COM (Larry Cipriani) writes: >In article <15317@amdcad.UUCP>, rpw3@amdcad.UUCP (Rob Warnock) writes: >> I find myself doing: >> char *p = &buf[0]; >> int n; >> strcpy(p, "piece"); p += strlen(p); >> strcpy(p, "another piece"); p += strlen(p); >> sprintf(p, "format%d%s", args...); p+= strlen(p); >> strcpy(p, "still another piece"); p += strlen(p); >> etc... > >It seems to me that the code would be far more readable and faster >if it were written as: p = cat(p1, p2, ..., (char*)0); Or, to better mimic the original semantics, p = cat(buf, p1, ...) with no malloc involved. (Yeah, the malloc version is useful too; I've got one of my own.) However, this is only useful if each operation is a strcpy or strcat; it doesn't work if there are intermixed one-char stores (*p++ = c) or more complex stuff (the embedded sprintf in the original example). Of course, one could always write the whole thing with sprintf, using "%s%s%s" format instead of strcat(strcat(strcpy(...))), but sometimes it's better to use the low-level stuff. Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
stevew@karau.UUCP (04/11/87)
In article <486@danews.ATT.COM>, lvc@danews.ATT.COM (Larry Cipriani) writes: > In article <15317@amdcad.UUCP>, rpw3@amdcad.UUCP (Rob Warnock) writes: > > I find myself doing: > > char *p = &buf[0]; > > int n; > > strcpy(p, "piece"); p += strlen(p); > > strcpy(p, "another piece"); p += strlen(p); > > etc... > It seems to me that the code would be far more readable and faster > if it were written as: > p = cat(p1, p2, ..., (char*)0); > where all the p's are pointers to characters. Cat will malloc > enough space to store a concatenation of all the pn's and a '\0'. Another possible solution would be: cat(p, n, p1, p2, ..., (char*)0); Where p == destination string, n == amount of space available, p1 == the first pointer, etc... You might have two versions one with n, one without. The debate is now what it should return: 1. pointer to the initial string. 2. the number of characters cat()'ed. 3. a pointer to the terminating null.
jack@mcvax.UUCP (04/15/87)
In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes: >You might be interested to know that such detection of null bytes in words >can be done in 3 or 4 instructions on almost any hardware (nay even in C). >(Code that follows relies on x being a 32 bit unsigned (or 2's complement >int with overflow ignored)...) > #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) >Then if e is an expression without side effects (e.g. variable) > has_nullbyte_(e) >is nonzero iff the value of e has a null byte. I was so impressed by this new trick (well, to *me* it is new:-) that I immedeately decided to try it. my Whitechapel MG-1, a 32016 based machine, the results were impressive. I coded strcpy() using this methods, and the results were great. Break-even with normal strcpy() at 4-char strings, performance slightly worse with 5/6/7-char strings, and getting better and better from there on. For strings with length 4N (N>=4) performance was twice that from old strcpy(). This is the routine: #define hasnull(x) ((x-0x01010101) & ~(x) & 0x80808080) strcpy(at,f) long *at; register long *f; { register long d; register long *t = at; register char *fc, *tc; do { d = *f++; if( !hasnull(d) ) { *t++ = d; continue; } tc = (char *)t; fc = (char *)(f-1); while( *tc++ = *fc++); return; } while(1); return(at); } Coding in assembler caused a 30% decrease in time for small (10-char) strings (less registers to save, t/tc and f/fc in the same reg, etc). Something I haven't explained yet is that unaligned strings give the *same* performance. Maybe the extra fetches are noise wrt the number of instruction fetches? Note that the 32016 is a 32 bit machine with a 16 bit bus, so that is probably why I found twice the speed, in stead of four times. Anyway, the next thing I thought of is "Wow! This is *great* for strcmp() on big-endians. Comparing 4 bytes in one go through the loop!". But, of course, I don't have a big-endian handy. Anyone care to try this? -- Jack Jansen, jack@cwi.nl (or jack@mcvax.uucp) The shell is my oyster.
greg@utcsri.UUCP (04/17/87)
In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes: > #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) >Then if e is an expression without side effects (e.g. variable) > has_nullbyte_(e) >is nonzero iff the value of e has a null byte. If one of the bytes contains 0x80, then has_nullbyte() will say 'yes'. This can be circumvented by a more thorough test after this one to see if there really is a null there. Jack Jensen's subroutine does this apparently by accident; there is a 'while( *tc++ = *fc++ );' after the test finds a 'null' so the only effect of having a 0x80 byte will be to revert to the normal strcpy for the rest of the string. Someone posted a similar but usually better test on comp.arch ( I think ) a little while ago: #define has_nullbyte(e) ((e+0x7efefeff)^e)&0x81010100) != 0x81010100 This one is only 'fooled' by an 0x80 in the most significant byte. which makes the following test much simpler ( a sign test ). In either case ( especially in this one with two identical constants ) it helps a lot if the constants are in registers while the loop is running. When using C, you can do this explicitly: register k1=0x7efefeff, k2=0x81010100; -- ---------------------------------------------------------------------- Greg Smith University of Toronto UUCP: ..utzoo!utcsri!greg Have vAX, will hack...
lcc.rich-wiz@CS.UCLA.EDU (Richard Mathews) (04/20/87)
> In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes: > >You might be interested to know that such detection of null bytes in words > >can be done in 3 or 4 instructions on almost any hardware (nay even in C). > >(Code that follows relies on x being a 32 bit unsigned (or 2's complement > >int with overflow ignored)...) > > #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) > >Then if e is an expression without side effects (e.g. variable) > > has_nullbyte_(e) > >is nonzero iff the value of e has a null byte. > I was so impressed by this new trick (well, to *me* it is new:-) > that I immedeately decided to try it. my Whitechapel MG-1, > a 32016 based machine, the results were impressive. Be careful if you use this. It does not work correctly for strings near the end of memory. Consider an (unaligned) string which is 21 bytes long and is right at the end of data space (or any data segment on machines which have such nonsense). The posted code will fault on the last read. On machines where data space is always a multiple of 4 bytes long, this method will work if you first copy enough bytes to get the source pointer to a longword boundary and then start copying 4 bytes at a time. Someone else claimed that the has_nullbyte macro above does not work correctly if it contains a byte equal to 0x80. I do not see this. Consider x==0x17801717. x - 0x01010101 == 0x167f1616 ~x == 0xe87fe8e8 (x - ONES) & ~x == 0x007f0000 has_nullbyte(x) == 0x00000000 The claim that has_nullbyte(x) will be TRUE appears to be incorrect. (I was able to prove to myself that the macro always gives the correct answer, but I won't bother to write it up and post it.) Richard M. Mathews Locus Computing Corporation lcc.richard@LOCUS.UCLA.EDU lcc.richard@UCLA-CS {ihnp4,trwrb}!lcc!richard {randvax,sdcrdcf,ucbvax,trwspp}!ucla-cs!lcc!richard
bjorn@alberta.UUCP (04/22/87)
In article <4605@utcsri.UUCP>, greg@utcsri.UUCP writes: >In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes: >> #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) >>Then if e is an expression without side effects (e.g. variable) >> has_nullbyte_(e) >>is nonzero iff the value of e has a null byte. > >If one of the bytes contains 0x80, then has_nullbyte() will >say 'yes'. This can be circumvented by a more thorough test >after this one to see if there really is a null there. You're mistaken the "has_nullbyte_(x)" defined above works as advertised for all 32 bit x. That is to say it returns a nonzero result if and only if x contains a null byte. >Someone posted a similar but usually better test on comp.arch ( I think ) >a little while ago: > >#define has_nullbyte(e) ((e+0x7efefeff)^e)&0x81010100) != 0x81010100 > >This one is only 'fooled' by an 0x80 in the most significant byte. >which makes the following test much simpler ( a sign test ). You are right this one does not always tell the truth. Besides it's a lot more effort. However for tricks like these to work reliably (especially on systems with memory protection) you had best get help from your local linker and 'malloc'. Otherwise one day a program is going to read off the edge of an accessible hunk of memory and flush your program straight down the drain. Making sure that no data item ends closer than n-1 bytes (if you're reading n bytes at time) from a boundary (between accessible and inaccessible memory) fixes this. Bjorn R. Bjornsson {mnetor,ihnp4}!alberta!bjorn
greg@utcsri.UUCP (04/23/87)
In article <299@pembina.UUCP> bjorn@alberta.UUCP (Bjorn R. Bjornsson) writes: >In article <4605@utcsri.UUCP>, greg@utcsri.UUCP writes: >>In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes: >>> #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) >>>Then if e is an expression without side effects (e.g. variable) >>> has_nullbyte_(e) >>>is nonzero iff the value of e has a null byte. >> >>If one of the bytes contains 0x80, then has_nullbyte() will [BUNK] >>say 'yes'. This can be circumvented by a more thorough test [BUNK] >>after this one to see if there really is a null there. [BUNK] > >You're mistaken the "has_nullbyte_(x)" defined above works >as advertised for all 32 bit x. That is to say it returns >a nonzero result if and only if x contains a null byte. Right. I apologize for any inconvenience I may have caused anybody. I also apologize for posting bunk in a sleepy haze and then vanishing for a week. > >>Someone posted a similar but usually better test on comp.arch ( I think ) >>a little while ago: >> >>#define has_nullbyte(e) ((e+0x7efefeff)^e)&0x81010100) != 0x81010100 >> >>This one is only 'fooled' by an 0x80 in the most significant byte. >>which makes the following test much simpler ( a sign test ). > >You are right this one does not always tell the truth. Besides >it's a lot more effort. I seems to be about the same, in the loop. Below is the subtract version, and the add version on an NS32k: movd 0(r0),r1 subd r2,r1 bicd 0(r0),r1 andd r3,r1 ;r2 & r3 contain constants cmpqd $0,r1 ;need this on ns32k movd 0(r0),r1 addd r2,r1 xord 0(r0),r1 andd r3,r1 cmpd r3,r1 I can't remember why I thought it was more efficient. I guess it is if you haven't got 'bic'. ( and if 'and' doesn't set the Z flag :-) The original poster of this method was describing a RISC which had a 'compare registers and branch if equal' op. > >However for tricks like these to work reliably (especially on >systems with memory protection) you had best get help from >your local linker and 'malloc'. Otherwise one day a program >is going to read off the edge of an accessible hunk of memory >and flush your program straight down the drain. Making sure >that no data item ends closer than n-1 bytes (if you're reading >n bytes at time) from a boundary (between accessible and >inaccessible memory) fixes this. Not generally a problem. The whole point of this is that the 32-bit accesses are constrained to 32-bit boundaries to make them fast. Furthermore, in most systems, your access-controlled segments or whatever are rounded up to a size at least as round as the machine word. (The MMU does not usually deal with address lines a0,a1 on a 32-bit machine). Thus if it is legal to read the first byte in a word, it is legal to read the whole word, and thus the last byte. -- ---------------------------------------------------------------------- Greg Smith University of Toronto UUCP: ..utzoo!utcsri!greg Have vAX, will hack...
am@cl.cam.ac.uk (Alan Mycroft) (04/27/87)
In article <4605@utcsri.UUCP> greg@utcsri.UUCP (Gregory Smith) writes: >In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes: >> #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) >>Then if e is an expression without side effects (e.g. variable) >> has_nullbyte_(e) >>is nonzero iff the value of e has a null byte. > >If one of the bytes contains 0x80, then has_nullbyte() will >say 'yes'. This can be circumvented by a more thorough test >after this one to see if there really is a null there. Untrue. It would be true if I had mistakenly used ^ instead of &~.