bcase@amdcad.UUCP (03/25/87)
There was a discussion a few months ago about processing strings more efficiently than a byte at a time. The Am29000 takes one of the possible approaches to improving string processing performance.... One unique feature of the Am29000 architecture is a special instruction. This instruction is intended to be used to speed-up string processing, but my guess is that other uses will be discovered. The instruction is called "compare-bytes" and works like this: Compare bytes specifies two source register operands and one destination register operand. The 4 pairs of corresponding bytes of the two 32-bit source operands are compared for equality (i.e., the two most-significant bytes are compared, the two next-most-significant bytes are compared, etc.). If any of the four pairs are equal, then the destination register is set to the value "TRUE" (which on the Am29000 is a one in the most-significant bit with all other bits cleared to zero). If none of the four pairs are equal, then the destination register is set to "FALSE" (all bits cleared). (Am29000 conditional branch instructions test only the most significant bit of a register, condition codes are not used; we get a free "test for negative.") So, if one of the source operands is set to all zeros (four null characters) (which can be specified in the instruction by choosing the second operand as the zero-extended eight-bit constant zero) and the other operand is a word of the character string being dealt with (say for copying or comparing), the Am29000 can, in one cycle (not counting the branch), determine if the word contains the end of string character (according to the C language definition of string). If the word does not contain the end of string character, then the four bytes in the word can be manipulated (e.g. loaded or stored) as a unit. Word operations on the Am29000 are much more efficient than character operations (this is true of most machines though). There are, of course, special circumstances to deal with (such as misaligned strings, and we have a funnel shifter to help in those cases), but by using the compare-bytes instruction in the library routines strcpy() and strcmp() (and strlen() too, but we haven't bothered since it seems to never be used in the programs we have encountered), significant improvements in the run-time of many C programs can be realized. Another thing which really helps is to have the compiler word-align literal strings (and I have implemented this), but even with word-alignment, some substrings will begin on strange boundaries and must be dealt with correctly. My approach to using this instruction consisted of re-writing the library routines in C with function calls wherever the compare-bytes instruction should go. I compiled this C code with my compiler, changed the assembly code to eliminate the function calls in favor of the compare-bytes instruction, and assembled it into the library (actually a module of code that gets included in all final links, but that is just a detail of our simple environment). Since most C programs (especially utilities and other systems programs) do a lot of string processing, this one instruction is really worth the small implementation cost. It often improves run times by 15% to 20% (just goes to show that the impact of processing C language strings has been long- ignored). It implements just the right semantics and probably has other applications for specialized pattern matching. I just thought some of you would be interested. bcase
lamaster@pioneer.UUCP (03/26/87)
In article <15292@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >There was a discussion a few months ago about processing strings more >efficiently than a byte at a time. The Am29000 takes one of the possible >approaches to improving string processing performance.... > : > >I just thought some of you would be interested. > > bcase I appreciate seeing submissions like this. Designing a computer instruction set is a little like backpacking: There is an art to figuring out exactly what you really need to take with you. My question is this: How likely is it that a compiler itself will be able to detect the case when it can use an instruction like this and generate code automatically to use it. One of the positive points to the RISC debate is that it brought out the point that useful instructions which are hard for a compiler to generate are not always a win. The old STAR-100 had a number of instructions, including string processing instructions, which were left out of the Cyber 205 because it turned out that the compilers never generated them. They were neat instructions though. So, my question is, is this instruction easily usable by a C, Fortran, Pascal, or ADA compiler and if so, under what conditions? Hugh LaMaster, m/s 233-9, UUCP {seismo,topaz,lll-crg,ucbvax}! NASA Ames Research Center ames!pioneer!lamaster Moffett Field, CA 94035 ARPA lamaster@ames-pioneer.arpa Phone: (415)694-6117 ARPA lamaster@pioneer.arc.nasa.gov "In order to promise genuine progress, the acronym RISC should stand for REGULAR (not reduced) instruction set computer." - Wirth ("Any opinions expressed herein are solely the responsibility of the author and do not represent the opinions of NASA or the U.S. Government")
baum@apple.UUCP (03/26/87)
-------- [] > >One unique feature of the Am29000 architecture is a special instruction. >This instruction is intended to be used to speed-up string processing, >but my guess is that other uses will be discovered. The instruction is >called "compare-bytes" and works like this:..... The HP Spectrum has three instructions that are very similar. Unit XOR will Exlusive-OR two registers and let you skip if any byte or halfword is equal or not equal. Unit AddComplement[&Trap on Condition] will add the ones complement of one register to another. This is sort of subtraction without the forced carry. That way, all bytes, halfwords & nibbles essentially have the same carry-in (although a carry out of one stage could force a carry-in to another, it turns out this doesn't affect the major uses of the instruction). Then, either a trap will occur or the next instruction will be skipped if the conditions are met. The conditions are any halfword/ byte/nibble carry, any halfword/byte zero, and their complements. Among other things, these instructions can be used for string searching a word at a time, decimal/byte/halfword range checking/validation a byte at a time, and decimal addition pre-biasing. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
mark@mimsy.UUCP (03/26/87)
In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: >...How likely is it that a compiler [can] detect the case when it can use > an instruction like this and generate code automatically to use it? This is of course a good point, and it seems that this is going to be one of those hard cases for compiler writers. But it opens the issue of subroutine libraries as providing an environment which also needs support in the instruction set. The Unix string routines are a good example, because they are even now often recoded in assembly for efficiency. Various pieces of Unix kernel frequently suffer the same fate (e.g. tty buffers). But once one is into instructions which are good for certain operating systems or environments, rather than certain languages, one has narrowed one's market. For instance, the Am29000 special string compare is useless if strings are encoded as a length followed by the chars. -mark Background: I happen to think Unix-style strings are a big market, and I love things which make Unix faster (because there is a positive feedback cycle which makes more Unix yet more available and desirable.). -- Spoken: Mark Weiser ARPA: mark@mimsy.umd.edu Phone: +1-301-454-7817 After May 1, 1987: weiser@xerox.com
freeman@spar.UUCP (03/26/87)
<*munch*> In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: > My question is this: How likely is it that a compiler itself will be able to > detect the case when it can use an instruction like this and generate code > automatically to use it. One of the positive points to the RISC debate is > that it brought out the point that useful instructions which are hard for a > compiler to generate are not always a win. Note that this particular instruction may well win even if no compiler EVER generates it: BCase's posting stated that hand-modifying the object code for two string-processing routines in the standard C library to use the instruction, and arranging that the compiler word-align strings where possible, effected consequent frequent improvements in run time of 15-20 percent. That sounds like a substantial win for C programming, and surely a fair amount of the code that does get written for this processor will be in C. That may be reason enough for the instruction's existence. Although the point from the RISC debate, that LaMaster summarizes above, is carefully qualified, it is perhaps nonetheless too narrow a view: Surely there are many other cases in which a well-planned library of calls to functions coded with particular care, will allow efficient use of highly specialized instructions. Proper analysis of whether or not to include a given instruction in an instruction set should include the effect of such cases. -- Jay Freeman
bcase@amdcad.UUCP (03/26/87)
In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: >I appreciate seeing submissions like this. Designing a computer instruction >set is a little like backpacking: There is an art to figuring out exactly >what you really need to take with you. Thanks for the positive feedback. An interesting point of view! >My question is this: How likely is it that a compiler itself will be able to >detect the case when it can use an instruction like this and generate code >automatically to use it. One of the positive points to the RISC debate is >that it brought out the point that useful instructions which are hard for a >compiler to generate are not always a win. The old STAR-100 had a number of >instructions, including string processing instructions, which were left out of >the Cyber 205 because it turned out that the compilers never generated them. >They were neat instructions though. So, my question is, is this instruction >easily usable by a C, Fortran, Pascal, or ADA compiler and if so, under what >conditions? This is an excellent question, one that I intended to address in the original posting. However, I forgot to include it when the damn article got kinda long. The fact of the matter is: A C compiler is going to have a real hard time generating this instruction. A Pascal Compiler might have a slightly easier time since the conept of "string" is a little more integrated with the langauge. In general, a compiler for a language without built-in types and functions for dealing with strings will have a hard time with our compare-bytes instruction. We realized this when we added it to the architecture (and it was truly a last-minute addition). However, it has such a significant impact on C programs in general that we were not deterred by the realization that this instruction will probably occur only a few times in the strcpy(), strcmp(), and strlen() library functions. Out of all the megabytes of code in a UNIX system, this instruction will account for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough guess). But if it improves the running time of all the string-oriented system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems worth it. And the implementation cost was so small. Also, there are some instructions that must be present just to administer the system, like return from interrupt, move-from-special-register, etc. These are not generated by a compiler either. Just to reiterate a point: RISC is not reducing the instruction set but is improving performance. Ok, so you don't believe the above? How about "It improved dhrystone a hell of lot." bcase
guy@gorodish.UUCP (03/26/87)
>My question is this: How likely is it that a compiler itself will be able to >detect the case when it can use an instruction like this and generate code >automatically to use it. In most C implementations, "strcpy", "strcmp", etc. are library routines; if the speedup from these instructions is sufficiently large, it doesn't matter if the compiler can generate them because the compiler can generate calls to those routines, and those routines can be written in assembler language. You could also imagine a compiler that would recognize names like "_builtin_strcpy" and generate the assembler code for those routines in line, and a <string.h> that contains #define strcpy _builtin_strcpy (see the footnote on page 86 of the October 1, 1986 ANSI C draft). Another possibility might be to have something like the WE32000 C compiler's "asm" directive, which permits you to define "functions" that are expanded in line to assembler code. Similar techniques can probably be used in other languages. If they do string processing in routines, the routines can be written in assembler language to use these instructions. If they generate stereotyped sequences for these string operations, these stereotyped seqences can use these instructions. However, I don't know how difficult it would be to recognize various special cases and generate non-stereotyped sequences that use these instructions, or how much of a speedup this would give (if these special cases weren't common, speeding them up might not buy you enough to worry about it).
jaw@ames.UUCP (03/27/87)
> [....] Since > most C programs (especially utilities and other systems programs) do a lot of > string processing, this one instruction is really worth the small > implementation cost. It often improves run times by 15% to 20% (just goes to > show that the impact of processing C language strings has been long- ignored). just curious which unix utilities use str(cpy|cmp|len) in their inner loops? certainly, 'vn' comes to mind as devoting much cpu time to these functions. is the 15-20% claimed due at all to either (a) significantly slow byte addressing to begin with (ala cray)? (b) in-line compilation of the string(3) stuff into the application? if (a), then improving memory byte access speed in the architecture is a more general solution with more payoff overall than the compare gate hack. what is the risc chip cost for byte vs. word addressibility, anyway? if (b), then maybe function call speed is the culprit rather than dearth of the specialized instruction. at any rate, for cray unix, buffer copy ops in the kernel were vastly improved when re-written for words instead of bytes, even more so when vectorized (the only place in the kernel with vectorization, i think). of course, table lookup using only 2^16 locations would be a joke software solution for super-intensive null-char-in-16-bit-smallword compare code. drastic, but saves a test the amd chip appears worried about. personally, i'm a fan of branch folding ... ames!jaw
mash@mips.UUCP (03/27/87)
r Keywords: In article <909@spar.SPAR.SLB.COM> freeman@spar.UUCP (Jay Freeman) writes: >In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: > >> My question is this: How likely is it that a compiler itself will be able to >> detect the case when it can use an instruction like this and generate code >> automatically to use it. One of the positive points to the RISC debate is >> that it brought out the point that useful instructions which are hard for a >> compiler to generate are not always a win. > >Note that this particular instruction may well win even if no compiler EVER >generates it: BCase's posting stated that hand-modifying the object code >for two string-processing routines in the standard C library to use the >instruction, and arranging that the compiler word-align strings where >possible, effected consequent frequent improvements in run time of 15-20 >percent. That sounds like a substantial win for C programming, and surely a >fair amount of the code that does get written for this processor will be in >C. That may be reason enough for the instruction's existence. > >Although the point from the RISC debate, that LaMaster summarizes above, is >carefully qualified, it is perhaps nonetheless too narrow a view: Surely >there are many other cases in which a well-planned library of calls to >functions coded with particular care, will allow efficient use of highly >specialized instructions. Proper analysis of whether or not to include a >given instruction in an instruction set should include the effect of such >cases. This is a fairly interesting and useful discussion, especially if we can generate a few answers. Amusingly enough, neither MIPS nor IBM put this instruction in, although Marty Hopkins (IBM) once offered a desire to have it (even though it disobeyed the "must be useful from compilers rule"), and we've thought about it off and on. At one point, I did some hunting around for statistics on the use of the str* routines, how much time was spent in them, the distribution of string lengths, etc [you may have seen postings to such effect in the last few years in this group]. Weirdly enough, either nobody knows very much about the global nature of this, or nobody's telling. We did some rummaging and profiling of code, and were astonished to find the actual amount of time spent in the str* routines to be pretty small. [This was very non-intuitive to me.] Maybe we looked at the wrong things. So, now a few questions: 1) for Brian: could you say which programs you got the big improvements on? It would be very interesting for people to be able to try the test cases on some different machines and see how mcuh time is spent on these. 2) for anybody: a) What percentage of your machine's daily work is spent doing str* routines? b) Of that, how much is getting there and getting set up, and how much is in the fundamental loops? c) For each of strcpy, strcmp, and strlen, what is the distribution of lengths of argument? d) For strcmp, what is the fraction of times it finds equality. e) What fraction of strings begin on non-word boundaries/ f) For strcpy and strcmp, what fraction of calls start on word boundaries, what fraction don't start on words, but at least have similar alignment within the word, and what fraction are completely misaligned. [I think that there is actually a useful MS thesis around in here if done well.] Finally, regarding the inclusion of hand-coded cases, i.e., of instructions that can't be gotten to from compilers, but may still be useful, our sense is that there aren't very many like that. This one is one of the few like that [the others we've generally considered have often been compilable] that at least has seemed plausible to lots of us. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.UUCP (03/27/87)
In article <15304@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >... >for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough >guess). But if it improves the running time of all the string-oriented >system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems >worth it. And the implementation cost was so small. Also, there are This is clearly true. >some instructions that must be present just to administer the system, >like return from interrupt, move-from-special-register, etc. These >are not generated by a compiler either. Just to reiterate a point: RISC >is not reducing the instruction set but is improving performance. Absolutely! There are almost always "structural" instructions that are hard to generate, but are needed. > >Ok, so you don't believe the above? How about "It improved dhrystone >a hell of lot." Unfortunately, Dhrystone is an artificial benchmark. I couldn't resist doing a quick test [we have some pretty neat profiling tools for taking an already-compiled program & turning it into one that does procedure & statement counts, then gives you instruction cycle counts [which would be, typically 60% of the total cycles, the other 40% being cache misses, TLB-miss overhead, memory path interference, etc.] OK: here's a quiz: I did some simple tests [ones we use in our standard Performance Brief suite, which includes diff, grep yacc, and nroff], plus csh. I won't pretend this is definitive, since I just whipped it up. How much time would you guess these programs spend doing strlen, strcmp. strcpy [the high-runners]? [Just guess the instruction cycle %]. Try 0-1%, 2-5%, 6-10%, 11-20%, 21%-up. ANSWERS: program strlen strcmp strcpy % cycs cy/call % cycs cy/call % cycs cy/call diff .03% - 0 - 0 - grep none yacc .04% - .59% - 0 - nroff 0 - 0 - <0.1% - csh 1.71% 20 1.27% 9 1.84% 21 % of total func calls 3.76% 6.11% 3.75% Dhrystone <.01% 19 16.94% 103 22.36% 136 Bottom-line: 1) Dhrystone isn't remotely representative of some common string-pushing programs in UNIX. 2) most of these, the total is <1% of instruction cycles, hence <0.6% of fully-degraded cycles. Maybe you can save 20% of this, or about .1%. 3) For csh: what's going on is that these routines are called very frequently, but for short strings: 3-6 characters; strcmp's obviously fail quickly [2nd or 3rd character]. I think the implication is that maybe you can get rid of 20% of the cycles, which would be a 1% instruction cycle saving, or about <0.6% full-degraded cycle saving for csh. 4) Given all of this, maybe what you get can be grossly estimated as about .3%, maybe. [Again, this was somethign whipped up in half an hour, so hardly definitive]. 5) Note that Dhrystone spends a huge lot of its time copying and comparing long strings. Hence, it's well worth a little extra setup time for Dhrystone to lessen the cost per loop. [In fact, we only got around to installing an assembler strcpy when we noticed how much time was spent there in Dhrystone.] Thus, I'd say it's still an open issue [for those of you who happen to be designing computer architectures at this instant!] Dhrystone: says it's important, but is totally unrepresentative. csh: says it might get you .6% others: surprise you by using str* so little general evidence: says it might be useful, but if you follow our 1% rule [and HP used something similar, as I recall], there is as yet insufficient evidence to include it. If I were right now designing a system, I'd be tempted to do a lot of data-gathering. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
shebanow@ji.Berkeley.EDU.UUCP (03/27/87)
Every time I trace a program which uses stdio to any great degree, I find printf to be the worst offender (with regard to time consumed). In fact, I have seen programs which spend 30% of their time in printf (actually _doprnt). In the code I have optimized, I usually try to eliminate the use of printf (which can make the code rather opaque). This does result in 10-20% improvements in speed. The most obvious use of new string processing instructions would be to improve the performance of printf. Has anyone done this? Mike Shebanow
cmcmanis@sun.UUCP (03/27/87)
In article <15304@amdcad.UUCP>, bcase@amdcad.UUCP (Brian Case) writes: > > Ok, so you don't believe the above? How about "It improved dhrystone > a hell of lot." > > bcase This I believe, I was wondering when someone would point that out. I think it is time for some *real* optimizing compilers that can detect when they are compiling a benchmark and optimize for the appropriate instruction! :-) :-) :-) -- --Chuck McManis uucp: {anywhere}!sun!cmcmanis BIX: cmcmanis ARPAnet: cmcmanis@sun.com These opinions are my own and no one elses, but you knew that didn't you.
metzger@heathcliff.columbia.edu.UUCP (03/27/87)
In article <18036@ucbvax.BERKELEY.EDU> shebanow@ji.Berkeley.EDU.UUCP (Mike Shebanow) writes: >Every time I trace a program which uses stdio to any great degree, I find >printf to be the worst offender (with regard to time consumed). In fact, >I have seen programs which spend 30% of their time in printf (actually >_doprnt). In the code I have optimized, I usually try to eliminate the >use of printf (which can make the code rather opaque). This does result >in 10-20% improvements in speed. The most obvious use of new string >processing instructions would be to improve the performance of printf. >Has anyone done this? Here at Project Synthesis we found that out a while back. It is pretty strange to think of it that way, but yes, Printf and friends are compute bound, not I/O bound. We have experimented with a number of solutions, including a hardware printf coprocessor for the 68020. Predictably, it improved performance quite a bit. We also have tried using specially coded versions of printf in assembly language that we passed through a (very time expensive) exhaustive search optimiser and the results were also pretty good. Perry Metzger
bcase@amdcad.UUCP (03/28/87)
In article <1001@ames.UUCP> jaw@ames.UUCP (James A. Woods) writes: >just curious which unix utilities use str(cpy|cmp|len) in their inner loops? >certainly, 'vn' comes to mind as devoting much cpu time to these functions. > >is the 15-20% claimed due at all to either > > (a) significantly slow byte addressing to begin with (ala cray)? > (b) in-line compilation of the string(3) stuff into the application? > >if (a), then improving memory byte access speed in the architecture is a >more general solution with more payoff overall than the compare gate hack. >what is the risc chip cost for byte vs. word addressibility, anyway? > >if (b), then maybe function call speed is the culprit rather than dearth of >the specialized instruction. > >at any rate, >for cray unix, buffer copy ops in the kernel were vastly improved when >re-written for words instead of bytes, even more so when vectorized >(the only place in the kernel with vectorization, i think). > >of course, table lookup using only 2^16 locations would be a joke >software solution for super-intensive null-char-in-16-bit-smallword >compare code. drastic, but saves a test the amd chip appears worried about. >personally, i'm a fan of branch folding ... > Re: (a) above. NO. Implementing a byte addressable (especially writable) memory slows down all memory accesses for a slight improvement in byte processing efficiency. For my own part, I can say that from the beginning of the Am29000 project, I was firmly against anything but a word-oriented interface. You have to realize that byte accessing requires an alignment network somewhere: it will add some nanoseconds to all memory accesses; you can put the alignment network in its own pipeline stage, but even then, it will *always* slow down every memory access, there is nothing you can do about it! (The same reasoning leads to our addressing-mode-less load and store instructions: if addressing modes are there, then the instructions always pay the cost even when the compiler knows better!) Thus, this is not strictly a RISC issue, but a maximum through-put issue regardless of RISCness. (Note that the Am29000 extract-byte and insert-byte instructions are essentially putting the alignment network in a pipeline stage, but the compiler can decided to pay or not pay the penalty for a given memory access (depending upon whether it is a byte access or not)). Re: (b) above. NO. We do not do inlining of the strcmp(), strcpy(), or strlen() routines (I wish we could, it would be even better!). The Am29000 has one of the fastest calling conventions around. The performance improvement we saw with the special instruction is an order of magnatude (ok, I'm taking a guess here, but it is probably pretty close) greater than what would have been gained by in-lining in this case. Re: Branch folding. I like all optimizations. Give me more MIPS! More. MORE. MORE! bcase --------------- Back again by popular demand: Sung to the tune of "Timewarp:" It's just a shift the left, And then a shift to the ri-i-ight! Get a load of those MIPS And the code's real ti-i-ght! This lock-step pipeline Really drives me insa-a-a-e-a-ane Let's do the I-fetch again!
bcase@amdcad.UUCP (03/28/87)
In article <5969@mimsy.UUCP> mark@mimsy.UUCP (Mark Weiser) writes: >In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: >>...How likely is it that a compiler [can] detect the case when it can use >> an instruction like this and generate code automatically to use it? > >This is of course a good point, and it seems that this is going to be >one of those hard cases for compiler writers. But it opens the issue >of subroutine libraries as providing an environment which also needs >support in the instruction set. The Unix string routines are a good >example, because they are even now often recoded in assembly for efficiency. >Various pieces of Unix kernel frequently suffer the same fate >(e.g. tty buffers). But once one is into instructions which are good >for certain operating systems or environments, rather than certain languages, >one has narrowed one's market. For instance, the Am29000 special string >compare is useless if strings are encoded as a length followed by the chars. Yes, you are right: when the length is known, things are *much* easier. But, the instruction is still a useful primitive: it can be used to find out, in one cycle, whether or not a 32-bit word contains a byte of interest. The compiler may not generate it, but it can be used in hand- coded inner loops for this purpose. bcase
bcase@amdcad.UUCP (03/28/87)
In article <230@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: >At one point, I did some hunting around for statistics on the use of the >str* routines, how much time was spent in them, the distribution of >string lengths, etc [you may have seen postings to such effect in >the last few years in this group]. Weirdly enough, either nobody knows >very much about the global nature of this, or nobody's telling. >We did some rummaging and profiling of code, and were astonished to find >the actual amount of time spent in the str* routines to be pretty small. >[This was very non-intuitive to me.] Maybe we looked at the wrong things. >So, now a few questions: >1) for Brian: could you say which programs you got the big improvements on? Well, there is, of course, dhrystone. Now, now, calm down, we here know very well that dhrystone is *quite* skewed away from reality. But we have to pay homage to dhrystone for pointing this out to us and making us find a solution (I think it is so typical that something we thought was at least a little original happens to have been thought of by many other people long before us! Oh well, "All the good ideas have already been had.") We are at this moment dredging up our early results. However, most have been deleted by this time. But, and I think I remember correctly, dhrystone and our assembler were the two big ones. However, I think the compiler will also show a benefit as will any program that does some comparing against fixed strings (e.g. I usually implement opcode lookups and such with a binary search through a sorted array of strings). So, I would like to turn the question around, John, and ask you to do what we might find difficult: look at the dynamic execution of your compiler. Well, Tim just resimulated nroff: the conclusion is that strcpy() isn't worth talking about. Sigh, it may be that there are just a few important programs that will benefit. Out of the 100 most popular programs, maybe two benefit sub- stantially from the compare-bytes instruction. But if those two are the most popular (or close to it), it might still be worthwhile to have the instruction. Dhrystone is either important or not depending upon how much time you spend running benchmarks or how paranoid one is. :-) Ok, so now Tim tells me that the assembler spends about 17% of its time in strcmp() (strcpy() is less than 1% in this case). And note that this 17% is *including* the effect of our compare-bytes instruction, without the instruction the percentage would be much larger (how much? I don't think we have the old version of strcmp() sitting around. Sorry). I agree, there is an MS thesis here (and the CS community would thank you a lot!). bcase
clif@intelca.UUCP (03/28/87)
> In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: > >conditions? [ Discussion of best history of string instructions] > The fact of the matter is: A C compiler is going to have a real hard time > generating this instruction. A Pascal Compiler might have a slightly > easier time since the conept of "string" is a little more integrated with > the langauge. In general, a compiler for a language without built-in types > and functions for dealing with strings will have a hard time with our > compare-bytes instruction. We realized this when we added it to the > architecture (and it was truly a last-minute addition). However, it has > such a significant impact on C programs in general that we were not deterred > by the realization that this instruction will probably occur only a few > times in the strcpy(), strcmp(), and strlen() library functions. Out of > all the megabytes of code in a UNIX system, this instruction will account > for maybe 40bytes/5Megabytes or .008% of all code (obvioulsy a real rough > guess). But if it improves the running time of all the string-oriented > system utilties (i.e. almost all utilties!!) by 15% to 20%, it seems > worth it. And the implementation cost was so small. Also, there are > some instructions that must be present just to administer the system, > like return from interrupt, move-from-special-register, etc. These > are not generated by a compiler either. Just to reiterate a point: RISC > is not reducing the instruction set but is improving performance. > > Ok, so you don't believe the above? How about "It improved dhrystone > a hell of lot." > > bcase I think that it makes sense to add instructions like this even if they only get put into the standard libraries, presumably they will prove to be modest win for most applications written in C. I vaguely remember reading that calls to string routines represented about .5% of all C code. As for Brian's second statement about Dhrystone being helped by string instructions: He is certainly right about that. -- Clif Purkiser, Intel, Santa Clara, Ca. {pur-ee,hplabs,amd,scgvaxd,dual,idi,omsvax}!intelca!clif These views are my own property. However anyone who wants them can have them for a nominal fee.
freeman@spar.UUCP (03/28/87)
Keywords: <*munch*> In article <230@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: > >In article <909@spar.SPAR.SLB.COM> freeman@spar.UUCP (Jay Freeman) writes: >> [...] >>Note that this particular instruction may well win even if no compiler EVER >>generates it: >> [...] >> other cases in which ... functions coded with particular >>care, will allow efficient use of highly specialized instructions. > [Thoughtful queries on cycles consumed by string-processing.] I offer personal anecdotal evidence that highly optimized string-processing might be perceived as valuable by users: One noticeable irritation in the editors I use daily on various machines (Sun 3/160, Sun 3/75, Symbolics 3600), is the time-delay for string-search in text buffers: For editors that work by stuffing large pieces of a file into memory as contiguous characters, (rather than those that use a linked list of lines), this time delay can be reduced a *lot* by hand-coding the search routines to use low-level high-speed string-search primitives. (My evidence for "*lot*" is also anecdotal; I once did a screen editor in the Intel *86 architecture: With all text as contiguous bytes in memory, the difference between the usual C loop "while( *buf++ != c ) ..." and careful use of Intel's REP SCASB string-search assembly-language idiom, was markedly noticeable in buffers only a few 10s of KBytes long on my 5 MHz 8088 machine.) Note that I address user satisfaction, not cycles consumed. >Finally, regarding the inclusion of hand-coded cases, i.e., of instructions >that can't be gotten to from compilers, but may still be useful, our >sense is that there aren't very many like that. How about special floating-point instructions to support "cortic" (have I got that spelled right?) algorithms for transcendental functions? (I acknowledge that a general-purpose RISC machine would probably curdle its silicon at the mere thought of having floating-point hardware on chip, but consider a CISC manufacturer who is doing a floating-point coprocessor or auxiliary processor.) How about test-bit-field-and-branch instructions to support Lisp? How about "atomic" test-and-set operations for semaphores, or bus-locking primitives for dealing with various specialized peripheral devices. Jay Freeman, Schlumberger Palo Alto Research <canonical disclaimer>
rpw3@amdcad.UUCP (03/28/87)
Well, John Mashey's numbers were certainly counter-intuitive to ME, given that those programs bang away at characters all the time, but we should look at one more thing: When were nroff/grep/yacc written compared to when str* was standardized in the library? After all, the str* libraries didn't spring full-blown on day one with those exact sematics and calling sequences (I'm guessing). I'll also guess (could easily be wrong) that there is a lot more string stuff going on than you see in the str* statistics, and that it's hand coded "in line" (in C, of course). Counter point: What do similar NEW programs look like? Take some code written by someone (or a team) that started with the "modern" libraries. Run the statistics on that, and see... Finally, don't forget str{,r}chr (a.k.a. {,r}index). Things like "ls -l" beat on those really hard (at least in System-V), since way down at the bottom a lot of time is spent (I'm guessing) tearing apart /etc/passwd entries (go look at "getpwent" & friends). I know that I sometimes avoid str* because I'm afraid it might be slow, particularly when I know that keeping a count (a.k.a. length) can avoid a lot of searching for the end. The solution is usually a high-level change in the applications basic algorithms so that so much copying & searching is simply avoided. A simple example is when you are building various strings up out of pieces (as in "speeding up" shell scripts by recoding in C). Even USING the libraries, sometimes instead of: strcpy(buf, "piece"); strcat(buf, "another piece"); strcat(buf, "still another piece"); I find myself doing: char *p = &buf[0]; int n; strcpy(p, "piece"); p += strlen(p); strcpy(p, "another piece"); p += strlen(p); sprintf(p, "format%d%s", args...); p+= strlen(p); strcpy(p, "still another piece"); p += strlen(p); etc... Then when you're done, you can assert that strlen(buf) == (p - buf); Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun,attmail}!redwood!rpw3 ATTmail: !rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403
guy@gorodish.UUCP (03/28/87)
>>The most obvious use of new string processing instructions would be to >>improve the performance of printf. Has anyone done this? The version of "_doprnt" that comes with 4BSD and comes in the VAX System III source (althought whether it's the one that's in the library or not, I don't know), and may very well have come with UNIX 32V, is written in assembler. The 4.3BSD version, at least, does use the string *and* EDITPC instructions, although it doesn't use the string instructions for handling things like "%s". VAX System V doesn't use this version (and no, it's probably not because the assembly version doesn't handle all the formatting operators in the S5 version - I know it handles most of them, and I think it handles all of them), so I'm curious whether they dropped it because it didn't buy much. >We have experimented with a number of solutions, including a hardware >printf coprocessor for the 68020. Predictably, it improved performance >quite a bit. We also have tried using specially coded versions of printf >in assembly language that we passed through a (very time expensive) >exhaustive search optimiser and the results were also pretty good. How much did the coprocessor cost, and how much better was it than the assembly-language version?
mash@mips.UUCP (03/29/87)
In article <15315@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >So, I would like to turn the question around, John, and ask you to do >what we might find difficult: look at the dynamic execution of your >compiler. Will do. For the time being, I did find one program that used them enough to be interesting: (4.3BSD /bin/ls /bin /usr/bin) function %CPU cycles/call strlen 7.35 18 strcat 3.24 57 strcmp 2.95 11 Again: strcmp's clearly fail pretty quickly, and strlens's don't look at very big strings [later, I'll show the dis-assembled code for the strlen & strcmp functions (the others are a bit longer). For shorter ls (like ls of current directory with 10 files), or for SYS V ls, the numbers were down in the noise. GIANT CAVEAT: remember these are the undegraded (for cache/tlb/memory interference) numbers, and even more importantly, remember that these are for our architecture, i.e., my previous estimates are for how much WE would save by ahving the instruction. Note that any architectural addition must always be evaluated in the context of what's already in the architecture, i.e., it's not possible to SAY "such and such instruction is worth having" without considering what it's being added onto. For example, Brian's first posting on this topic didn't mention that the AMD was a WORD-ADDRESSED machine, that comment was later. It wouldn't surprise me at all that the special string instruction is quite worthwhile for such machines, even if it's probably [from the evidence I've got so far] below the cutoff on ours [although still useful enough not to reject without thought.] >Sigh, it may be that there are just a few important programs that will >benefit. Out of the 100 most popular programs, maybe two benefit sub- >stantially from the compare-bytes instruction. But if those two are the >most popular (or close to it), it might still be worthwhile to have the >instruction. Dhrystone is either important or not depending upon how >much time you spend running benchmarks or how paranoid one is. :-) As noted: the instruction is almost certainly worth something on the AMD part. [The word-addressing thing is a whole separate issue that's worth a LOT of discussion, but we might as well finish this one first.] As usual, computer architecture needs a great deal of measurement, because intuition can often be wrong. You still need intuition and experience for the stuff that's so hard to measure. > >Ok, so now Tim tells me that the assembler spends about 17% of its time >in strcmp() (strcpy() is less than 1% in this case). And note that this >17% is *including* the effect of our compare-bytes instruction, without Is it possible for you to post the code for strcmp/strlen? (It would be especially interesting to see how you handle pointers to bytes & shorts). Rest of this is details of strlen/strcmp mentioned above. strlen: 0x404000: addiu v1,a0,1 0x404004: lbu v0,0(a0) 0x404008: addiu a0,a0,1 0x40400c: bne v0,zero,0x404004 0x404010: nop 0x404014: jr ra 0x404018: subu v0,a0,v1 In ls, this was 7.35%, 18 cycles/call. This says that the mean case is addiu, 4 times around the loop (4 instrs each), plus the 2 instrs (jr, subu) on exit. I.e., the general cost for this routine is 3 + 4N (where N is the size). All of this implies what the size distribution must be: highly skewed, with a mode of 1-2, and a long tail towards longer strings. Thus, the tradeoff is between a routine with minimal setup that runs quickly on short strings, but is not as efficient on longer strings, versus ones that uses more setup. [As you can see, proper design of all this REALLY depends on knowing the statistics: we wrote several versions of each of these for comparison. Again, I wish somebody would do a definitive study!] strcmp: 0x403f50: lbu t0,0(a0) | 0x403f54: lbu t1,0(a1) | 0x403f58: beq t0,zero,0x403f84 |-> null str, ex 0x403f5c: addi a0,a0,2 | 0x403f60: bne t0,t1,0x403f8c |-> 1st char !=, exit 0x403f64: lbu t2,-1(a0) | 0x403f68: lbu t1,1(a1) | 0x403f6c: beq t2,zero,0x403f84 |-> 2nd char null, exit 0x403f70: addi a1,a1,2 | 0x403f74: beq t2,t1,0x403f54 |---^ characters equal, loop 0x403f78: lbu t0,0(a0) 0x403f7c: jr ra 0x403f80: subu v0,t2,t1 0x403f84: jr ra 0x403f88: subu v0,zero,t1 0x403f8c: jr ra 0x403f90: subu v0,t0,t1 This is a 2-way unrolled version of the simplest strcmp. In ls, this was 2.95% of the instruction cycles, 11 cycles/call average. There seem to be 2 "likely" cases here: 11 instrs: take the "2nd char null, exit" branch and return 13 instrs: find 2nd char !=, i.e., execute 13 instrs with no branch What this says is that once again, there's a very skewed distribution, since some longer strings must compare equal. [To be honest, this even surprises me somewhat]. IF this is representative, this says that the special instruction wouldn't do US much good on these routines. [It might help on strcpy/strcat: I'll look at that later.] -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.UUCP (03/29/87)
In article <948@spar.SPAR.SLB.COM> freeman@spar.UUCP (Jay Freeman) writes: > >In article <230@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: > >I offer personal anecdotal evidence that highly optimized string-processing >might be perceived as valuable by users: .... >the Intel *86 architecture: With all text as contiguous bytes in memory, >the difference between the usual C loop "while( *buf++ != c ) ..." and >careful use of Intel's REP SCASB string-search assembly-language idiom, was >markedly noticeable in buffers only a few 10s of KBytes long on my 5 MHz >8088 machine.) I believe that if you have a microcoded machine, you should have at least a few of the critical string routines, although note that it is very difficult to get a minimal set that covers all of the interesting cases, i.e., many CPU's string instructions don't do C. If you have a machine that provides them, you're crazy not to use them. > > >>Finally, regarding the inclusion of hand-coded cases, i.e., of instructions >>that can't be gotten to from compilers, but may still be useful, our >>sense is that there aren't very many like that. > >How about special floating-point instructions to support "cortic" (have I >got that spelled right?) algorithms for transcendental functions? (I I think it's "quartic", and I think there may be some justification for them, depending on the silicon tradeoffs, i.e., does it cost you time in add or multiply to put the quartics on. Depending on the target applications, it may or may not be worth it. >acknowledge that a general-purpose RISC machine would probably curdle its >silicon at the mere thought of having floating-point hardware on chip, but >consider a CISC manufacturer who is doing a floating-point coprocessor or >auxiliary processor.) I don't know why its silicon would be curdled. Again, it really depends on the target application mix: for example, for some applications, I'd be tempted to put an FP unit on the CPU chip. [We didn't because we wanted ferocious FP performance, and you cannot yet get enough silicon area to do that on the CPU chip, certainly not at 2micron, and probably not a 1.2.] > >How about test-bit-field-and-branch instructions to support Lisp? We did some looking at that for LISP & Smalltalk. Some of them really want to be traps on bitfields for tag-checking. However, these are in the category of "compiler can generate them". > >How about "atomic" test-and-set operations for semaphores, or bus-locking >primitives for dealing with various specialized peripheral devices. Some operations of this class are required, although there are different ways to implement them. Most of the typical implementations don't extend very well into large N shared-memory multis, i.e., bus-lock is the LAST thing you want to be going on very much. Anyway, some good points: there certainly exist instructions that are hard for compilers to generate, that are still useful, and depending on the application targets, people may decide or not to include them. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.UUCP (03/29/87)
In article <15317@amdcad.UUCP> rpw3@amdcad.UUCP (Rob Warnock) writes: >Well, John Mashey's numbers were certainly counter-intuitive to ME, given >that those programs bang away at characters all the time, but we should >look at one more thing: > >When were nroff/grep/yacc written compared to when str* was standardized >in the library? After all, the str* libraries didn't spring full-blown on >day one with those exact sematics and calling sequences (I'm guessing). >I'll also guess (could easily be wrong) that there is a lot more string >stuff going on than you see in the str* statistics, and that it's hand >coded "in line" (in C, of course). Yes, those were old programs. I'd be happy to hear of suggestions for specific programs to check. It's absolutely sure that there was plenty of string stuff going on, coded in C. [Of coruse, some of the string stuff is algorithmic work that hardware isn't likely to support for a while, i.e., regular-expression matching.] > >Finally, don't forget str{,r}chr (a.k.a. {,r}index). Things like "ls -l" >beat on those really hard (at least in System-V), since way down at the >bottom a lot of time is spent (I'm guessing) tearing apart /etc/passwd >entries (go look at "getpwent" & friends). I generally looked for str* routines; the others were generally in the noise. Following are the stats for SYS V : ls -l /usr Profile listing generated Sat Mar 28 14:19:29 1987 with: ---------------------------------------------------------------------------- * -p[rocedures] using basic-block counts; * * sorted in descending order by the number of cycles executed in each * * procedure; unexecuted procedures are excluded * ---------------------------------------------------------------------------- 1139801 cycles cycles %cycles cum % cycles bytes procedure (file) /call /line 511480 44.87 44.87 6558 22 getname (ls.c) 310413 27.23 72.11 16 40 fgetc (../fgetc.c) 99006 8.69 80.79 43 28 _flsbuf (../flsbuf.c) 79820 7.00 87.80 340 17 _doprnt (../doprnt.c) 18830 1.65 89.45 54 23 select (ls.c) 15073 1.32 90.77 321 19 gmtime (../ctime.c) .... 1655 0.15 98.79 10 5 strcmp (../strcmp.s) .... 1338 0.12 99.18 18 5 strlen (../strlen.s) .... 63 0.01 99.96 8 10 strchr (../strchr.c) [Surprise! str* stuff isn't on the map in this one. Just goes to show that even experienced [and Rob is] folks' intuition needs checking. >Well, this surprised me also.] This whole string-processing discussion has been quite worthwhile, if only to remind us that we often discuss pretty esoteric things in this newsgroup, while at the same time, we sometimes don't have the foggiest ideas on even basic program behavior!! -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
rbutterworth@watmath.UUCP (03/29/87)
In article <15317@amdcad.UUCP>, rpw3@amdcad.UUCP (Rob Warnock) writes: > I find myself doing: > char *p = &buf[0]; > int n; > strcpy(p, "piece"); p += strlen(p); > strcpy(p, "another piece"); p += strlen(p); > sprintf(p, "format%d%s", args...); p+= strlen(p); > strcpy(p, "still another piece"); p += strlen(p); > etc... Wouldn't it have been nice if strcpy(), strcat(), sprintf(), etc. had been designed to return a pointer to the END of the string, (or a count of the number of characters in the string)? They already have this information for free, but instead they return their first argument, something that the caller already knows.
earl@mips.UUCP (03/29/87)
Actually the MIPS architecture does contain string processing instructions. They go by the names ADDU and XOR. For example, the inner loop of my strlen routine, which searches for a word with a nonzero byte, is l1: lw t2,1(a1) addu a1,4 addu t3,t2,t0 xor t3,t2 and t3,t1 beq t3,t1,l1 Quiz: what are the magic constants that need to be loaded into t0 and t1 to make this work? This is 7 cycles per word, or 1.75 cycles per byte. -Earl
gemini@homxb.UUCP (03/29/87)
I got curious myself, and profiled "ls", "awk", "cpp", "lln", and several SGS's and local applications programs for str* and mem* usage. In most cases, the str* numbers were in the 0-3% range. Diff didn't use strcat/cpy/cmp at all, though it obviously does lots of string work. But I found heavy usage of the mem* routines in many of the newer programs I tried. 7-20% was the range! If the C convention for strings had been length oriented, I suspect that Dhrystone would come out with a better approximation of reality. This would be due to the heavier usage of mem*-like routines for moving and comparing the strings. In believe that Dhrystone would have been better off doing 'stringy' things with some custom code. I'm certain that lots of string work gets done in the 'average' program, yet it appears that not much of it gets done using the str* routines. On the other hand, in my observations, I've seen lots of novice C programmers go overboard using str*, so perhaps profiling the (presumably) finely tuned UNIX utilities isn't showing us reality either. It's certain that novice programmers will be sharing the machine and their programs with me for a long time to come. Rick Richardson, PC Research, Inc: (201) 922-1134 ..!ihnp4!castor!pcrat!rick when at AT&T-CPL: (201) 834-1378 ..!ihnp4!castor!polux!rer
mash@mips.UUCP (03/29/87)
In article <182@homxb.UUCP> gemini@homxb.UUCP (Rick Richardson) writes: >I got curious myself, and profiled "ls", "awk", "cpp", "lln", and >several SGS's and local applications programs for str* and mem* usage. >In most cases, the str* numbers were in the 0-3% range. Diff didn't >use strcat/cpy/cmp at all, though it obviously does lots of string work. >But I found heavy usage of the mem* routines in many of the newer programs >I tried. 7-20% was the range! > >If the C convention for strings had been length oriented, I suspect that >Dhrystone would come out with a better approximation of reality. This >would be due to the heavier usage of mem*-like routines for moving and >comparing the strings. My current sense of things says: a) strcmp finds a difference very quickly, and strlen sess fairly short strings, hence for them, it's most important to optimize the short cases. b) strcpy appears to copy longer strings than a), not as long as c). c) people use memcpy/bcopy for copying bigger hunks of memory, hence they need to be optimized for the long case: fortunately, since you have length as a parameter, you can make correct decisions easily, unlike the str* things, which must guess. There's one instance of bcopy that nees to be especially tuned for long copies, which is the one inside the kernel. None of the above is definitive, jsut the sort of patterns I've seen when I've looked. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
gemini@homxb.UUCP (03/30/87)
Options to "ls" have a great affect upon "str*" usages. The following results were obtained on an Amdahl 580 running uts 5.2.5: ./ls /bin /usr/bin [listed 64 times] %Time Seconds Cumsecs #Calls msec/call Name 22.1 10.72 10.72 80190 0.1336 strcmp 3.3 1.60 40.25 21577 0.0742 memcpy 0.7 0.33 47.65 10688 0.0312 strlen 0.3 0.13 47.97 strchr 0.1 0.05 48.40 memchr 0.1 0.03 48.53 strcpy ./ls -CF /bin /usr/bin [listed 64 times] %Time Seconds Cumsecs #Calls msec/call Name 7.9 10.67 100.55 80191 0.1330 strcmp 1.2 1.58 128.38 32234 0.0491 memcpy 0.3 0.42 133.08 21440 0.0194 strlen 0.1 0.12 134.05 2 58. strchr 0.1 0.08 134.23 memchr 0.0 0.02 134.50 strcpy ./ls -l /bin /usr/bin [listed 8 times] %Time Seconds Cumsecs #Calls msec/call Name 0.4 1.42 338.83 10023 0.1413 strcmp 0.3 1.18 341.32 18761 0.0631 memcpy 0.0 0.15 345.45 1336 0.112 strlen 0.0 0.10 345.92 memchr 0.0 0.08 346.18 strcpy 0.0 0.03 346.35 1 33. strchr Note: on the Amdahl, strcmp and friends are in assembler. Rick Richardson, PC Research, Inc: (201) 922-1134 ..!ihnp4!castor!pcrat!rick when at AT&T-CPL: (201) 834-1378 ..!ihnp4!castor!pollux!rer
ka@rpiacm.UUCP (03/30/87)
In article <230@winchester.mips.UUCP>, mash@mips.UUCP (John Mashey) writes: > We did some rummaging and profiling of code, and were astonished to find > the actual amount of time spent in the str* routines to be pretty small. Probably a major reason for this is that programmers use in-line code for lots of string operations in C. The early UN*X code was pretty efficient. People who learned their coding styles from reading early UN*X code simply don't include calls to the str* routines inside tight loops. Kenneth Almquist
guy@gorodish.UUCP (03/30/87)
>If the C convention for strings had been length oriented, I suspect that >Dhrystone would come out with a better approximation of reality. This >would be due to the heavier usage of mem*-like routines for moving and >comparing the strings. Why would people use mem*-like routines for manipulating counted strings, but not use str*-like routines for manipulating null-terminated ones? The only reason I can think of is that since finding the length/end of a string is a constant-time operation with counted strings, but is linear in the length of the string with null-terminated strings, people would be less tempted to write their own string manipulation code that found the end of a string as a side effect of the processing. Is this what you had in mind? I'd be interested to know how much string processing is, in fact, done by *ad hoc* code because it didn't quite match the string routines available on the system? "strtok" and company weren't in V7, so people would usually have done their own string-chopping.
howard@cpocd2.UUCP (03/30/87)
In article <948@spar.SPAR.SLB.COM> freeman@spar.UUCP (Jay Freeman) writes: >How about special floating-point instructions to support "cortic" (have I >got that spelled right?) algorithms for transcendental functions? In article <237@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: >I think it's "quartic", ... No, it's "Cordic", and is usable for FFT as well. One reason people are interested in it is that it maps well into silicon, so custom Cordic chips may be practical for some applications. -- Copyright (c) 1987 Howard A. Landman. Transmission of this material constitutes permission from the intermediary to all recipients to freely retransmit the material within USENET. All other rights reserved.
bcase@amdcad.UUCP (03/31/87)
In article <236@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: >For example, Brian's first posting on this topic didn't mention that the >AMD was a WORD-ADDRESSED machine, that comment was later. It wouldn't >surprise me at all that the special string instruction is quite worthwhile >for such machines, even if it's probably [from the evidence I've got >so far] below the cutoff on ours [although still useful enough not to >reject without thought.] Like oops: the fact that we use the 29000 as a word-addressed machine in our simulations is so familiar to me that I forgot to mention it as an assumption. It is important to note that the 29000 can be used as either a word-addressed or as a byte-addressed machine: there are 3 "user bits" in the load/store control fields that simply drive some pins on the chip. Thus, they can be used to encode things like load/store byte/halfword/word if a system implementor so chooses. However, we think it is stupid because such things slow down memory systems and thus the entire system. But if you really want to.... >As noted: the instruction is almost certainly worth something on the AMD part. >[The word-addressing thing is a whole separate issue that's worth a LOT >of discussion, but we might as well finish this one first.] Yes, I would really like to get some opinions/facts from the net regarding the word-addess/byte-address issue. >Is it possible for you to post the code for strcmp/strlen? >(It would be especially interesting to see how you handle pointers >to bytes & shorts). Yes! Remember: This code was generated by hand tweeking the output of my C compiler! Thus, it isn't the best (Please don't hurt my feelings by complaining about the quality of my compiler. I am quite aware of its shortcomings). I am posting both the C source and the Am29000 code. Most of it should be easy to understand (but note that the extract is the 32-bits-from-64-bits funnel-shift function) (instruction format: <op> <dest>,<source1>,<source2> (source2 can be an 8-bit positive constant). Well, this is so damn long that I'll just post strcpy. strcmp is easy to write once you see how this works, and strlen is something we haven't bothered with yet (because nothing has needed it). Sorry in advance about the length. If you have better algorithms than the one I have implemented below, please tell me (via mail). I'll summarize to net if appropriate. strcpy (s, t) char *s, *t; { int word0, word1, word2; if (((int)s & 3) == 0) { if (((int)t & 3) == 0) { /* can do fastest */ /* has_a_zero is the compare-bytes instruction */ while (has_a_zero (word0 = *(int *)t) == 0) { *(int *)s = word0; s += 4; t += 4; } while (*s = *t) ++s, ++t; return; } /* destination is word aligned, but source isn't */ word0 = *(int *)t; while (1) { t += 4; word1 = *(int *)t; /* bit_extract is the funnel-shift instruction */ word2 = bit_extract (word0, word1, ((int)t & 3) << 3); if (has_a_zero (word2)) break; *(int *)s = word2; s += 4; word0 = word1; } t -= 4; while (*s = *t) ++s, ++t; return; } /* both strings unaligned */ if (((int)s & 3) == ((int)t & 3)) { /* both strings have the same mis-alignment */ while (1) { if ((*s = *t) == 0) return; ++s, ++t; if (((int)s & 3) == 0) break; } /* now transfer a word at a time */ while (1) { if (has_a_zero (word0 = *(int *)t)) break; *(int *)s = word0; s += 4; t += 4; } while (*s = *t) ++s, ++t; return; } /* neither string aligned and both have different alignment */ while (1) { if ((*s = *t) == 0) return; ++s, ++t; if (((int)s & 3) == 0) break; } word0 = *(int *)t; while (1) { t += 4; word1 = *(int *)t; word2 = bit_extract (word0, word1, ((int)t & 3) << 3); if (has_a_zero (word2)) break; *(int *)s = word2; s += 4; word0 = word1; } t -= 4; while (*s = *t) ++s, ++t; return; } ---------------------------------------- $L0: .use data .align .use code .global _strcpy _strcpy: sub gr01,gr01,40 asgeu OVERTRAP,gr01,gr40 add lr01,gr01,56 ;.6 "strcpy.c" and lr08,lr0c,3 eq lr08,lr08,0 jmpf lr08,$16 and lr08,lr0c,3 ;.8 "strcpy.c" and lr08,lr0d,3 eq lr08,lr08,0 jmpf lr08,$17 or lr00,lr00,0 jmp $LI1 or lr00,lr00,0 $LT19: ;.14 "strcpy.c" store 16,lr05,lr0c ;.15 "strcpy.c" add lr0c,lr0c,4 ;.16 "strcpy.c" add lr0d,lr0d,4 $LI1: ;.12 "strcpy.c" load 16,lr05,lr0d cpbyte lr02,lr05,0 neq lr08,lr02,0 jmpf lr08,$LT19 or lr00,lr00,0 $20: jmp $LI2 or lr00,lr00,0 $LT21: ;.19 "strcpy.c" add lr0c,lr0c,1 add lr0d,lr0d,1 $LI2: ;.18 "strcpy.c" load 16,lr08,lr0d byteex lr08,lr08,0 load 16,lr09,lr0c bytein lr09,lr09,lr08 store 16,lr09,lr0c eq lr08,lr08,0 jmpf lr08,$LT21 or lr00,lr00,0 $22: add gr01,gr01,40 or lr00,lr00,0 jmpi lr00 asleu UNDERTRAP,lr01,gr41 $17: ;.25 "strcpy.c" load 16,lr05,lr0d $LT23: ;.28 "strcpy.c" add lr0d,lr0d,4 ;.29 "strcpy.c" load 16,lr06,lr0d ;.30 "strcpy.c" and lr08,lr0d,3 lls lr04,lr08,3 mtsp FS,lr04 extract lr07,lr05,lr06 ;.31 "strcpy.c" cpbyte lr02,lr07,0 neq lr08,lr02,0 jmpf lr08,$26 or lr00,lr00,0 jmp $24 or lr00,lr00,0 $26: ;.34 "strcpy.c" store 16,lr07,lr0c ;.35 "strcpy.c" add lr0c,lr0c,4 ;.36 "strcpy.c" jmp $LT23 add lr05,lr06,0 $24: ;.38 "strcpy.c" jmp $LI3 sub lr0d,lr0d,4 $LT27: ;.40 "strcpy.c" add lr0c,lr0c,1 add lr0d,lr0d,1 $LI3: ;.39 "strcpy.c" load 16,lr08,lr0d byteex lr08,lr08,0 load 16,lr09,lr0c bytein lr09,lr09,lr08 store 16,lr09,lr0c eq lr08,lr08,0 jmpf lr08,$LT27 or lr00,lr00,0 $28: add gr01,gr01,40 or lr00,lr00,0 jmpi lr00 asleu UNDERTRAP,lr01,gr41 $16: ;.46 "strcpy.c" and lr09,lr0d,3 eq lr08,lr08,lr09 jmpf lr08,$29 or lr00,lr00,0 $LT30: ;.52 "strcpy.c" load 16,lr08,lr0d byteex lr08,lr08,0 load 16,lr09,lr0c bytein lr09,lr09,lr08 store 16,lr09,lr0c eq lr08,lr08,0 jmpf lr08,$32 or lr00,lr00,0 add gr01,gr01,40 or lr00,lr00,0 jmpi lr00 asleu UNDERTRAP,lr01,gr41 $32: ;.54 "strcpy.c" add lr0c,lr0c,1 add lr0d,lr0d,1 ;.55 "strcpy.c" and lr08,lr0c,3 eq lr08,lr08,0 jmpf lr08,$33 or lr00,lr00,0 jmp $31 or lr00,lr00,0 $33: jmp $LT30 or lr00,lr00,0 $31: $LT34: ;.62 "strcpy.c" load 16,lr05,lr0d cpbyte lr02,lr05,0 neq lr08,lr02,0 jmpf lr08,$36 or lr00,lr00,0 jmp $35 or lr00,lr00,0 $36: ;.64 "strcpy.c" store 16,lr05,lr0c ;.65 "strcpy.c" add lr0c,lr0c,4 ;.66 "strcpy.c" jmp $LT34 add lr0d,lr0d,4 $35: jmp $LI4 or lr00,lr00,0 $LT37: ;.69 "strcpy.c" add lr0c,lr0c,1 add lr0d,lr0d,1 $LI4: ;.68 "strcpy.c" load 16,lr08,lr0d byteex lr08,lr08,0 load 16,lr09,lr0c bytein lr09,lr09,lr08 store 16,lr09,lr0c eq lr08,lr08,0 jmpf lr08,$LT37 or lr00,lr00,0 $38: add gr01,gr01,40 or lr00,lr00,0 jmpi lr00 asleu UNDERTRAP,lr01,gr41 $29: $LT39: ;.77 "strcpy.c" load 16,lr08,lr0d byteex lr08,lr08,0 load 16,lr09,lr0c bytein lr09,lr09,lr08 store 16,lr09,lr0c eq lr08,lr08,0 jmpf lr08,$41 or lr00,lr00,0 add gr01,gr01,40 or lr00,lr00,0 jmpi lr00 asleu UNDERTRAP,lr01,gr41 $41: ;.79 "strcpy.c" add lr0c,lr0c,1 add lr0d,lr0d,1 ;.80 "strcpy.c" and lr08,lr0c,3 eq lr08,lr08,0 jmpf lr08,$42 or lr00,lr00,0 jmp $40 or lr00,lr00,0 $42: jmp $LT39 or lr00,lr00,0 $40: ;.83 "strcpy.c" load 16,lr05,lr0d $LT43: ;.86 "strcpy.c" add lr0d,lr0d,4 ;.87 "strcpy.c" load 16,lr06,lr0d ;.88 "strcpy.c" and lr08,lr0d,3 lls lr04,lr08,3 mtsp FS,lr04 extract lr07,lr05,lr06 ;.89 "strcpy.c" cpbyte lr02,lr07,0 neq lr08,lr02,0 jmpf lr08,$45 or lr00,lr00,0 jmp $44 or lr00,lr00,0 $45: ;.92 "strcpy.c" store 16,lr07,lr0c ;.93 "strcpy.c" add lr0c,lr0c,4 ;.94 "strcpy.c" jmp $LT43 add lr05,lr06,0 $44: ;.96 "strcpy.c" jmp $LI5 sub lr0d,lr0d,4 $LT46: ;.98 "strcpy.c" add lr0c,lr0c,1 add lr0d,lr0d,1 $LI5: ;.97 "strcpy.c" load 16,lr08,lr0d byteex lr08,lr08,0 load 16,lr09,lr0c bytein lr09,lr09,lr08 store 16,lr09,lr0c eq lr08,lr08,0 jmpf lr08,$LT46 or lr00,lr00,0 $47: add gr01,gr01,40 or lr00,lr00,0 jmpi lr00 asleu UNDERTRAP,lr01,gr41 .use data .align -----------------------------------------------------
bcase@amdcad.UUCP (03/31/87)
In article <15325@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >Like oops: the fact that we use the 29000 as a word-addressed machine in >our simulations is so familiar to me that I forgot to mention it as an >assumption. It is important to note that the 29000 can be used as either >a word-addressed or as a byte-addressed machine: there are 3 "user bits" >in the load/store control fields that simply drive some pins on the chip. >Thus, they can be used to encode things like load/store byte/halfword/word >if a system implementor so chooses. However, we think it is stupid because >such things slow down memory systems and thus the entire system. But if >you really want to.... Jeeze louise, I forgot to mention that the Am29000 always drives the entire byte address on the pins. It's just that the preferred memory system implementation ignores the lower two bits and always returns a full word. In this case, the byte-extract and byte-insert instructions are used once the word gets on chip. (For RISC historians: The paper "Hardware/Software Tradeoffs for Increased Performance" talks about word-addressed machines. The information in this and other papers helped shape our implementation.) bcase
tim@amdcad.UUCP (03/31/87)
In article <985@mips.UUCP>, earl@mips.UUCP (Earl Killian) writes: > Quiz: what are the magic constants that need to be loaded into t0 and > t1 to make this work? Let's see... How about t0 = 0x7efefeff t1 = 0x81010100 Is that right? If so, it appears to work only for 7 bit characters (but we admit that such a restriction is reasonable in real life)... (This is a fun problem but probably a bit too hard for use in interviews.) Mike Johnson Brian Case Tim Olson
henry@utzoo.UUCP (Henry Spencer) (03/31/87)
> ... I think it is time for some *real* optimizing compilers that can detect > when they are compiling a benchmark and optimize for the appropriate > instruction! :-) :-) :-) Don't laugh! It's not unknown for an optimizing-compiler group to put extra attention into making common benchmarks run fast. I'm told that there is at least one compiler that has a specialized loop-unrolling optimization whose major purpose is to quietly turn the "looping" version of Linpack into the "unrolled" version. -- "We must choose: the stars or Henry Spencer @ U of Toronto Zoology the dust. Which shall it be?" {allegra,ihnp4,decvax,pyramid}!utzoo!henry
rpw3@amdcad.UUCP (03/31/87)
In article <238@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: >Following are the stats for SYS V : ls -l /usr > cycles %cycles cum % cycles bytes procedure (file) > /call /line > 511480 44.87 44.87 6558 22 getname (ls.c) > 310413 27.23 72.11 16 40 fgetc (../fgetc.c) > 99006 8.69 80.79 43 28 _flsbuf (../flsbuf.c) > 79820 7.00 87.80 340 17 _doprnt (../doprnt.c) > ... ... ... ... ... ... Hey, John, wanna know something really interesting? Go look at 4.2 BSD "ls", at routine getname(), and you'll discover it's full of getpwent()'s (plus {set,end}pwent(), etc.) and strncpy()'s (hidden in the "SCPYN" macro). And of course getpwent() uses the str* routines, though it also has its own pwskip() instead if index(). (Have you got a 4.{2,3} BSD "ls" that you could run your profiler on?) Then look at the Sys V "ls", and lo & behold, all of those neat library routines have been unfolded into the getname() routine, and hand-coded into a most glorious mess, which calls only (guess)... fgetc()!!! (No wonder my jaw dropped when I saw your numbers!) Now other parts of Sys V use {set,get,end}pwent(), so I guess somebody decided that "ls" was running too slow (or had an "intuition"?) and ripped it apart. Given the looks of the code, I'm surprised that even fgetc() survived. This reminds me of Gerry Weinberg's "Railroad Paradox": "We don't need fast strlib routines, because no one ever uses them." Yeah... 'cause they're not fast, everyone codes around them! We (me, too, I just realized I got suckered in also) must be *VERY* careful to look inside any major piece of code we are going to try and use to prove something or other, in order to avoid simply confirming the {good,bad,weird} practices of the past. I am now completely unsure whether we can get ANY reasonable statistics on what the proper usage (percent, etc.) of "str*" should/could be in NEW code, if we base it on code that has been hacked on as long as the various Unices. Anybody want to cough up a "Stryngstone" benchmark for us to throw rocks at? Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun,attmail}!redwood!rpw3 ATTmail: !rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403
earl@mips.UUCP (04/01/87)
In article <15331@amdcad.UUCP>, tim@amdcad.UUCP (Tim Olson) writes: > > Quiz: what are the magic constants that need to be loaded into t0 and > > t1 to make this work? > Let's see... How about > t0 = 0x7efefeff > t1 = 0x81010100 > Is that right? If so, it appears to work only for 7 bit characters (but > we admit that such a restriction is reasonable in real life)... Yes, you've got it. Congratulations. (Gee, do I owe you lunch or something now?) Actually, it works for 8-bit characters in bytes 0, 1, and 2 and a 7-bit one in byte 3 (I'm using little-endian byte numbers here). Thus it may accept '\200' as a zero in byte 3. But this is easily handled in the follow-on code that has to detect which byte is nonzero by actually testing byte 3 and branching back to the loop if it is nonzero. Actually, here is that code. Note how after the word loop the carries also conveniently let you detect the byte. This part only works for little-endian byte ordering. (Note: my personal convention is to indent instructions in branch delay slots once space): ## search for word with zero byte l1: lw t2,1(a1) # read next word addu a1,4 addu t3,t2,t0 # cause carry out of any nonzero byte xor t3,t2 # xor to find carry bits and t3,t1 # look at carries out of bytes beq t3,t1,l1 # all bytes carried => none zero ## found word with zero byte, now find the byte sll t4,t3,23 # carry out of byte 0? l3: beq t4,0,x0 # if not, go compute length sll t4,t3,15 # carry out of byte 1? bge t4,0,x1 # if not, go compute length sll t4,t3,7 # carry out of byte 2? bge t4,0,x2 # if not, go compute length srl t4,t2,24 # byte 3 zero? bne t4,0,l1 # no, re-enter loop
jesup@steinmetz.UUCP (04/01/87)
In article <15829@sun.uucp> guy@sun.UUCP (Guy Harris) writes: > >I'd be interested to know how much string processing is, in fact, >done by *ad hoc* code because it didn't quite match the string >routines available on the system? "strtok" and company weren't in >V7, so people would usually have done their own string-chopping. For the shell I wrote for the Amiga (and for other programs as well), I have often used a set of string routines I wrote that always returns a pointer to the END of the string, not the beginning. This way I rarely, if ever, have to call strcat. I usually know what the beginning of the string is (except for things like strcpy(malloc(SIZE),string)). Randell Jesup jesup@steinmetz.uucp jesup@ge-crd.arpa
john@frog.UUCP (04/01/87)
> Well, John Mashey's numbers were certainly counter-intuitive to ME... Really sophisticated programs spend most of their time doing really sophisticated things; simple grunt routines like str*() aren't universally applicable. > When were nroff/grep/yacc written compared to when str* was standardized > in the library? After all, the str* libraries didn't spring full-blown on > day one... I don't know (offhand) about these, but I recall once being curious about why lex was slow. It turned out to use subroutines for isalpha(), isspace(), isdigit(), etc, instead of the (presumably newer) ctype.h macros. HOWEVER, even after making that change to a version of lex, it didn't help all that much, because lex has so much else to do. > Even USING the libraries, sometimes instead of: > strcpy(buf, "piece"); > strcat(buf, "another piece"); > strcat(buf, "still another piece"); > I find myself doing: > char *p = &buf[0]; > int n; > strcpy(p, "piece"); p += strlen(p); > strcpy(p, "another piece"); p += strlen(p); > sprintf(p, "format%d%s", args...); p+= strlen(p); > strcpy(p, "still another piece"); p += strlen(p); > etc... > WHOOPS! That's about as much work (first you copy the string, then you step over it) as the use of strcat(). If you don't mind being non-portable to oddball architectures, you can write a routine like "strcatl()", which takes a destination buffer as a first argument, and a NULL-terminated list of strings to concatenate into that buffer (or you can be slightly portable and do it with <varargs.h>). The easiest way might be, however, to use: char *strcpya(p1, p2) /* strcat and return end address */ char *p1, *p2; { while (*p1++ = *p2++) /**/; return p1-1; } and then strcpya(strcpya(strcpya(buf, "first piece"), " second piece"), " and the last"); or if you prefer char *p,buf[127 /* a prime number*/ ]; p = strcpya(buf, "first piece"); p = strcpya(p, " second piece"); p = strcpya(p, " third piece"); Either way, you only touch each byte once. Sometimes, the question you need to ask yourself about library routines is not "Could they be better coded with this screwball assembly instruction?" but rather "Does it *really* do what I want?". -- John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101 ...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA "Happiness is the planet Earth in your rear-view mirror." - Sam Hurt, in "Eyebeam, Therefore I Am"
kent@xanth.UUCP (04/01/87)
In article <15325@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >Well, this is so damn long that I'll just post strcpy. Brian, I almost choked when I saw your code, right after seeing the same job done in a handful of lines in a posting that hit here a few articles earlier. You couldn't have written a more cogent argument for byte level addressing in any hardware which does text processing if it had been your intent. Programmers are _the_most_expensive_part_ of the total lifecycle (including execution) costs of all but the most widely distributed, embedded code. Who could afford to pay the staff to write code ten or fifteen times as long (I didn't count) to do the same job in a comparable time. The discussion so far has brought out quite a lot of evidence that applications level programmers aren't using the library string routines much, yet Unix stream files, and the masses of editors, word processors, post processors compiler parsers and so on in use testify that the job is being done. The implication is that the application programmers are writing the string software themselves, to suite the particular case at hand. I sure don't want to pay them to write code like that you showed, at the application level, to replace while (*s1++ = *s2++); Yeeks! You've talked _me_ into byte instructions and byte addressing. Comments? Kent. (kent@xanth.UUCP)
meissner@dg_rtp.UUCP (04/02/87)
In article <6448@watmath.UUCP> rbutterworth@watmath.UUCP writes: > Wouldn't it have been nice if strcpy(), strcat(), sprintf(), etc. > had been designed to return a pointer to the END of the string, > (or a count of the number of characters in the string)? The System V.3 (and ANSI) *{scanf printf} routines have a a format specifier %n which stores the number of bytes {read, written} so far into the int pointed to by the current argument. Also in System V.xx and ANSI, sprintf returns the total number of bytes written (not the pointer to the string like 4.2 BSD does). -- Michael Meissner, Data General Uucp: ...mcnc!rti-sel!dg_rtp!meissner It is 11pm, do you know what your sendmail and uucico are doing?
stubbs@ncr-sd.UUCP (04/04/87)
In article <978@ames.UUCP> lamaster@pioneer.UUCP (Hugh LaMaster) writes: > >My question is this: How likely is it that a compiler itself will be able to >detect the case when it can use an instruction like this and generate code >automatically to use it. In this case, these instructions would be used in the hand coded strcopy and strcmp C library routines, so compilers need not be concerned. My guess is they are still a win, because these libraries are used a bunch. Jan Stubbs ....sdcsvax!ncr-sd!stubbs 619 485-3052 NCR Corporation Advanced Development 16550 W. Bernardo Drive MS4010 San Diego, CA. 92127
pec@necntc.UUCP (04/06/87)
How likely is it that a compiler can make use of string copy instructions? NEC's re-targeting of the portable C compiler to the V60 architecture frequently takes advantage of one of the V60's string copy instructions. It uses a string copy for structure assignments! Of course, these instructions are also used in the library routines. Another use that comes to mind is when loader zeros out a BSS area...
paulsc@orca.UUCP (04/06/87)
In article <7843@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >> ... I think it is time for some *real* optimizing compilers that can detect >> when they are compiling a benchmark and optimize for the appropriate >> instruction! :-) :-) :-) >Don't laugh! It's not unknown for an optimizing-compiler group to put extra >attention into making common benchmarks run fast. I'm told that there is at >least one compiler that has a specialized loop-unrolling optimization whose >major purpose is to quietly turn the "looping" version of Linpack into the >"unrolled" version. It isn't always a conscious effort to trick customers into buying the compiler. The compiler group might only have a handful of "test programs" for their compiler. If the compiler does any peep hole optimizations, the compiler group probably compiles whatever test programs they have to see how their compiler might generate better code. If any of their test programs are benchmarks, they will be spending time making the benchmarks run fast. Paul Scherf, Tektronix, Box 1000, MS 61-033, Wilsonville, OR, USA tektronix!orca!paulsc
eugene@pioneer.UUCP (04/07/87)
In article <2344@orca.TEK.COM> paulsc@orca.UUCP (Paul Scherf) writes: > >It isn't always a conscious effort to trick customers into >buying the compiler. The compiler group might only have a >handful of "test programs" for their compiler. > >Paul Scherf, Tektronix, Box 1000, MS 61-033, Wilsonville, OR, USA This is part of the point of my research: to get everyone writing compilers to have a large consistent set of programs to check optimizations just like we have programs (validation suites) to check conformance to language standards. Hope, I can get the various companies interested in a set of open performance measurement tool. From the Rock of Ages Home for Retired Hackers: --eugene miya NASA Ames Research Center eugene@ames-aurora.ARPA "You trust the `reply' command with all those different mailers out there?" "Send mail, avoid follow-ups. If enough, I'll summarize." {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene
am@cl-jenny.UUCP (04/08/87)
In article <15292@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >One unique feature of the Am29000 architecture is a special instruction.... > ... [ to speed up detecting null bytes in words]. You might be interested to know that such detection of null bytes in words can be done in 3 or 4 instructions on almost any hardware (nay even in C). (Code that follows relies on x being a 32 bit unsigned (or 2's complement int with overflow ignored)...) #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) Then if e is an expression without side effects (e.g. variable) has_nullbyte_(e) is nonzero iff the value of e has a null byte. (One can view this as explicit programming of the Manchester carry chain present in many processors which is hidden from the instruction level). On Acorn's ARM risc machine using this feature as Brian Case suggests speeded up dhrystone by 20-25%, which may comment on dhrystone or risc. This gives some 5-6K dhrystones on the slow clocked pre-production 3 micron chips. (Wonder what state of the art fabrication can give....)
karl@haddock.UUCP (04/11/87)
In article <486@danews.ATT.COM> lvc@danews.ATT.COM (Larry Cipriani) writes: >In article <15317@amdcad.UUCP>, rpw3@amdcad.UUCP (Rob Warnock) writes: >> I find myself doing: >> char *p = &buf[0]; >> int n; >> strcpy(p, "piece"); p += strlen(p); >> strcpy(p, "another piece"); p += strlen(p); >> sprintf(p, "format%d%s", args...); p+= strlen(p); >> strcpy(p, "still another piece"); p += strlen(p); >> etc... > >It seems to me that the code would be far more readable and faster >if it were written as: p = cat(p1, p2, ..., (char*)0); Or, to better mimic the original semantics, p = cat(buf, p1, ...) with no malloc involved. (Yeah, the malloc version is useful too; I've got one of my own.) However, this is only useful if each operation is a strcpy or strcat; it doesn't work if there are intermixed one-char stores (*p++ = c) or more complex stuff (the embedded sprintf in the original example). Of course, one could always write the whole thing with sprintf, using "%s%s%s" format instead of strcat(strcat(strcpy(...))), but sometimes it's better to use the low-level stuff. Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
stevew@karau.UUCP (04/11/87)
In article <486@danews.ATT.COM>, lvc@danews.ATT.COM (Larry Cipriani) writes: > In article <15317@amdcad.UUCP>, rpw3@amdcad.UUCP (Rob Warnock) writes: > > I find myself doing: > > char *p = &buf[0]; > > int n; > > strcpy(p, "piece"); p += strlen(p); > > strcpy(p, "another piece"); p += strlen(p); > > etc... > It seems to me that the code would be far more readable and faster > if it were written as: > p = cat(p1, p2, ..., (char*)0); > where all the p's are pointers to characters. Cat will malloc > enough space to store a concatenation of all the pn's and a '\0'. Another possible solution would be: cat(p, n, p1, p2, ..., (char*)0); Where p == destination string, n == amount of space available, p1 == the first pointer, etc... You might have two versions one with n, one without. The debate is now what it should return: 1. pointer to the initial string. 2. the number of characters cat()'ed. 3. a pointer to the terminating null.
jack@mcvax.UUCP (04/15/87)
In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes: >You might be interested to know that such detection of null bytes in words >can be done in 3 or 4 instructions on almost any hardware (nay even in C). >(Code that follows relies on x being a 32 bit unsigned (or 2's complement >int with overflow ignored)...) > #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) >Then if e is an expression without side effects (e.g. variable) > has_nullbyte_(e) >is nonzero iff the value of e has a null byte. I was so impressed by this new trick (well, to *me* it is new:-) that I immedeately decided to try it. my Whitechapel MG-1, a 32016 based machine, the results were impressive. I coded strcpy() using this methods, and the results were great. Break-even with normal strcpy() at 4-char strings, performance slightly worse with 5/6/7-char strings, and getting better and better from there on. For strings with length 4N (N>=4) performance was twice that from old strcpy(). This is the routine: #define hasnull(x) ((x-0x01010101) & ~(x) & 0x80808080) strcpy(at,f) long *at; register long *f; { register long d; register long *t = at; register char *fc, *tc; do { d = *f++; if( !hasnull(d) ) { *t++ = d; continue; } tc = (char *)t; fc = (char *)(f-1); while( *tc++ = *fc++); return; } while(1); return(at); } Coding in assembler caused a 30% decrease in time for small (10-char) strings (less registers to save, t/tc and f/fc in the same reg, etc). Something I haven't explained yet is that unaligned strings give the *same* performance. Maybe the extra fetches are noise wrt the number of instruction fetches? Note that the 32016 is a 32 bit machine with a 16 bit bus, so that is probably why I found twice the speed, in stead of four times. Anyway, the next thing I thought of is "Wow! This is *great* for strcmp() on big-endians. Comparing 4 bytes in one go through the loop!". But, of course, I don't have a big-endian handy. Anyone care to try this? -- Jack Jansen, jack@cwi.nl (or jack@mcvax.uucp) The shell is my oyster.
greg@utcsri.UUCP (04/17/87)
In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes: > #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) >Then if e is an expression without side effects (e.g. variable) > has_nullbyte_(e) >is nonzero iff the value of e has a null byte. If one of the bytes contains 0x80, then has_nullbyte() will say 'yes'. This can be circumvented by a more thorough test after this one to see if there really is a null there. Jack Jensen's subroutine does this apparently by accident; there is a 'while( *tc++ = *fc++ );' after the test finds a 'null' so the only effect of having a 0x80 byte will be to revert to the normal strcpy for the rest of the string. Someone posted a similar but usually better test on comp.arch ( I think ) a little while ago: #define has_nullbyte(e) ((e+0x7efefeff)^e)&0x81010100) != 0x81010100 This one is only 'fooled' by an 0x80 in the most significant byte. which makes the following test much simpler ( a sign test ). In either case ( especially in this one with two identical constants ) it helps a lot if the constants are in registers while the loop is running. When using C, you can do this explicitly: register k1=0x7efefeff, k2=0x81010100; -- ---------------------------------------------------------------------- Greg Smith University of Toronto UUCP: ..utzoo!utcsri!greg Have vAX, will hack...
bjorn@alberta.UUCP (04/22/87)
In article <4605@utcsri.UUCP>, greg@utcsri.UUCP writes: >In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes: >> #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) >>Then if e is an expression without side effects (e.g. variable) >> has_nullbyte_(e) >>is nonzero iff the value of e has a null byte. > >If one of the bytes contains 0x80, then has_nullbyte() will >say 'yes'. This can be circumvented by a more thorough test >after this one to see if there really is a null there. You're mistaken the "has_nullbyte_(x)" defined above works as advertised for all 32 bit x. That is to say it returns a nonzero result if and only if x contains a null byte. >Someone posted a similar but usually better test on comp.arch ( I think ) >a little while ago: > >#define has_nullbyte(e) ((e+0x7efefeff)^e)&0x81010100) != 0x81010100 > >This one is only 'fooled' by an 0x80 in the most significant byte. >which makes the following test much simpler ( a sign test ). You are right this one does not always tell the truth. Besides it's a lot more effort. However for tricks like these to work reliably (especially on systems with memory protection) you had best get help from your local linker and 'malloc'. Otherwise one day a program is going to read off the edge of an accessible hunk of memory and flush your program straight down the drain. Making sure that no data item ends closer than n-1 bytes (if you're reading n bytes at time) from a boundary (between accessible and inaccessible memory) fixes this. Bjorn R. Bjornsson {mnetor,ihnp4}!alberta!bjorn
greg@utcsri.UUCP (04/23/87)
In article <299@pembina.UUCP> bjorn@alberta.UUCP (Bjorn R. Bjornsson) writes: >In article <4605@utcsri.UUCP>, greg@utcsri.UUCP writes: >>In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes: >>> #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) >>>Then if e is an expression without side effects (e.g. variable) >>> has_nullbyte_(e) >>>is nonzero iff the value of e has a null byte. >> >>If one of the bytes contains 0x80, then has_nullbyte() will [BUNK] >>say 'yes'. This can be circumvented by a more thorough test [BUNK] >>after this one to see if there really is a null there. [BUNK] > >You're mistaken the "has_nullbyte_(x)" defined above works >as advertised for all 32 bit x. That is to say it returns >a nonzero result if and only if x contains a null byte. Right. I apologize for any inconvenience I may have caused anybody. I also apologize for posting bunk in a sleepy haze and then vanishing for a week. > >>Someone posted a similar but usually better test on comp.arch ( I think ) >>a little while ago: >> >>#define has_nullbyte(e) ((e+0x7efefeff)^e)&0x81010100) != 0x81010100 >> >>This one is only 'fooled' by an 0x80 in the most significant byte. >>which makes the following test much simpler ( a sign test ). > >You are right this one does not always tell the truth. Besides >it's a lot more effort. I seems to be about the same, in the loop. Below is the subtract version, and the add version on an NS32k: movd 0(r0),r1 subd r2,r1 bicd 0(r0),r1 andd r3,r1 ;r2 & r3 contain constants cmpqd $0,r1 ;need this on ns32k movd 0(r0),r1 addd r2,r1 xord 0(r0),r1 andd r3,r1 cmpd r3,r1 I can't remember why I thought it was more efficient. I guess it is if you haven't got 'bic'. ( and if 'and' doesn't set the Z flag :-) The original poster of this method was describing a RISC which had a 'compare registers and branch if equal' op. > >However for tricks like these to work reliably (especially on >systems with memory protection) you had best get help from >your local linker and 'malloc'. Otherwise one day a program >is going to read off the edge of an accessible hunk of memory >and flush your program straight down the drain. Making sure >that no data item ends closer than n-1 bytes (if you're reading >n bytes at time) from a boundary (between accessible and >inaccessible memory) fixes this. Not generally a problem. The whole point of this is that the 32-bit accesses are constrained to 32-bit boundaries to make them fast. Furthermore, in most systems, your access-controlled segments or whatever are rounded up to a size at least as round as the machine word. (The MMU does not usually deal with address lines a0,a1 on a 32-bit machine). Thus if it is legal to read the first byte in a word, it is legal to read the whole word, and thus the last byte. -- ---------------------------------------------------------------------- Greg Smith University of Toronto UUCP: ..utzoo!utcsri!greg Have vAX, will hack...
am@cl.cam.ac.uk (Alan Mycroft) (04/27/87)
In article <4605@utcsri.UUCP> greg@utcsri.UUCP (Gregory Smith) writes: >In article <693@jenny.cl.cam.ac.uk> am@cl.cam.ac.uk (Alan Mycroft) writes: >> #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) >>Then if e is an expression without side effects (e.g. variable) >> has_nullbyte_(e) >>is nonzero iff the value of e has a null byte. > >If one of the bytes contains 0x80, then has_nullbyte() will >say 'yes'. This can be circumvented by a more thorough test >after this one to see if there really is a null there. Untrue. It would be true if I had mistakenly used ^ instead of &~.