stone@mars.mpr.ca (Darren Stone) (03/16/91)
Does anyone know if there's a better (!?) way to compare strings than with strcmp()? I'm looking for any speed advantage I can get! Thanx in advance... - Darren -
wirzenius@cc.helsinki.fi (Lars Wirzenius) (03/17/91)
In article <1991Mar15.142821@mars.mpr.ca>, stone@mars.mpr.ca (Darren Stone) writes: > Does anyone know if there's a better (!?) way to compare > strings than with strcmp()? > I'm looking for any speed advantage I can get! For non-portability, the fastest way would probably be to write inline assembly code using the machine's string compare instruction. Of course, this instruction doesn't exist in all machines, but if it does, it's probably quite fast. Another way (hopefully somewhat more portable) to go is to be clever, and to avoid string comparisons by modifying the algorithm so they aren't needed (or are needed less frequently). After that you may be able to improve on the comparison method. One way is to avoid calling strcmp if the first characters differ: #define STRCMP(s,t) (*(s) == *(t) ? strcmp((s), (t)) : *(s) - *(t)) (or something along those lines, this is untested). If your strings are mostly going to differ in the first few bytes, you could pre-compute a special value based on those bytes, and compare it first: #define BITS_IN_CHAR 8 struct string { char *str; long value; }; void compute_value(struct string *s) { int i; s->value = 0; for (i = 0; s->str[i] != '\0' && i < sizeof(long); ++i) { s->value <<= BITS_IN_CHAR; s->value |= s->str[i]; } if (i < sizeof(long)) s->value <<= BITS_IN_CHAR * (sizeof(long) - i); } (The above could be optimized, but you get the idea.) If the pre-computed values differ, you know the strings differ in the first sizeof(long) bytes, so there's no need to call strcmp. Furthermore, if you are careful to arrange the first character as the most significant byte of the value (and so on), you can even find out which string (according to the byte values, no locale collation sequence here) is greater. Of course, if you need the locale collation sequence, you have to use strcmp. -- Lars Wirzenius wirzenius@cc.helsinki.fi
gwyn@smoke.brl.mil (Doug Gwyn) (03/17/91)
In article <1991Mar15.142821@mars.mpr.ca> stone@mars.mpr.ca (Darren Stone) writes: >Does anyone know if there's a better (!?) way to compare >strings than with strcmp()? >I'm looking for any speed advantage I can get! #define StrEq( a, b ) (*(a) == *(b) && strcmp( a, b ) == 0) /* UNSAFE */
dave@cs.arizona.edu (Dave Schaumann) (03/17/91)
In article <15486@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes: |In article <1991Mar15.142821@mars.mpr.ca> stone@mars.mpr.ca (Darren Stone) writes: |>Does anyone know if there's a better (!?) way to compare |>strings than with strcmp()? |>I'm looking for any speed advantage I can get! | |#define StrEq( a, b ) (*(a) == *(b) && strcmp( a, b ) == 0) /* UNSAFE */ By "unsafe", I assume you mean it will likely crash on a NULL pointer. I wouldn't be a bit suprised if strcmp() chokes on a NULL pointer, too. But the main point I want to make is that this really buys you little unless you are working under 2 conditions: - strcmp() is not inlined - function calls are expensive I think you'll have to agree that any professional C compiler worth the name would almost certainly inline *at least* the str???() functions; given an environment where function calls are expensive. Also, if you tend to compare equal successfully a lot, this will actually be slower. -- Dave Schaumann | dave@cs.arizona.edu | Short .sig's rule!
gvr@cs.brown.edu (George V. Reilly) (03/17/91)
In article <1991Mar15.142821@mars.mpr.ca> stone@mars.mpr.ca (Darren Stone) writes:
%
% Does anyone know if there's a better (!?) way to compare
% strings than with strcmp()?
%
% I'm looking for any speed advantage I can get!
If you're comparing strings for equality, you could use the
following macro:
#define STREQ(a,b) (*(a) == *(b) && (*(a) == '\0' || strcmp((a)+1,(b)+1) == 0))
which only calls strcmp() if the two strings agree in the first
character and if they're not zero-length strings. Thus, for most
string comparisons, strcmp will not be called. Unless the
implementors of your libraries really botched it, strcmp()
should already be extremely efficient.
You should be able to work out STRLT, STRLE, STRNE, STRGE, and STRGT
(if you need them) based on the above example.
________________
George V. Reilly `Toad of Toad Hall' gvr@cs.brown.edu +1 (401) 863-7684
uunet!brunix!gvr gvr@browncs.bitnet Box 1910, Brown U, Prov, RI 02912
gwyn@smoke.brl.mil (Doug Gwyn) (03/17/91)
In article <1193@caslon.cs.arizona.edu> dave@cs.arizona.edu (Dave Schaumann) writes: >|#define StrEq( a, b ) (*(a) == *(b) && strcmp( a, b ) == 0) /* UNSAFE */ >By "unsafe", I assume you mean it will likely crash on a NULL pointer. No, I mean that it is an "unsafe function-like macro", meaning that its arguments will be evaluated other than once each, unlike the situation for a genuine function invocation. This is something that the user of the macro needs to be aware of. I take it for granted that a null pointer should not be provided as an argument to a function expecting an object pointer, unless it is specifically advertised as doing something useful in that case. >Also, if you tend to compare equal successfully a lot, this will actually >be slower. In most applications, comparing unequal would be far more frequent. This optimization is a win in almost all situations, even when the compiler generates in-line code for strcmp(). (Note, however, that one release of Gould's UTX-32 C compiler generated bad code if an argument to StrEq() was a string literal; the base registers got used up unnecessarily. I think this was fixed in a later release.) If you happen to know that there is at least one character before the terminator in each string, you can further optimize: #define StrEq1( a, b ) (*(a) == *(b) && strcmp( (a) + 1, (b) + 1 ) == 0) /* UNSAFE */ However, I prefer the first form, which works wherever strcmp() does, so long as the multiple evaluation of its arguments is not an issue.
mark@urz.unibas.ch (03/18/91)
In article <1991Mar15.142821@mars.mpr.ca>, stone@mars.mpr.ca (Darren Stone) writes: > > Does anyone know if there's a better (!?) way to compare > strings than with strcmp()? > Have you already looked at what code an optimizing compiler produces for STRCMP ? It may be a very good inline code (no subroutine call) if the compiler is clever. G.Mark
drh@duke.cs.duke.edu (D. Richard Hipp) (03/18/91)
In article <15486@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes: >In article <1991Mar15.142821@mars.mpr.ca> stone@mars.mpr.ca (Darren Stone) writes: >>Does anyone know if there's a better (!?) way to compare >>strings than with strcmp()? >>I'm looking for any speed advantage I can get! > >#define StrEq( a, b ) (*(a) == *(b) && strcmp( a, b ) == 0) /* UNSAFE */ StrEq does not duplicate the semantics of strcmp. Recall that strcmp returns either negative, zero, or positive depending on whether "a" is less than, equal to, or greater than "b". StrEq, on the other hand, returns only true or false depending on whether or not "a" is equal to "b". This will make a big difference if you are using StrEq to sort, or to do a binary search.
consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) (03/19/91)
Somebody said part of the problem with string comparisons is that if the most of the strings are equivelant, there will be alot of processor time. Wouldn't it be good then, to include in the macro, something to compare the actual POINTERS? If the pointers are the same then the two strings have no CHOICE but to be equivelant. This would really cut down the time under certain situations. -Rob Konigsberg
gwyn@smoke.brl.mil (Doug Gwyn) (03/19/91)
In article <669305557@juliet.cs.duke.edu> drh@duke.cs.duke.edu (D. Richard Hipp) writes: >StrEq does not duplicate the semantics of strcmp. Nor was it meant to.
peter@ficc.ferranti.com (Peter da Silva) (03/19/91)
In article <1193@caslon.cs.arizona.edu> dave@cs.arizona.edu (Dave Schaumann) writes: > I think you'll have to agree that any professional C compiler worth the name > would almost certainly inline *at least* the str???() functions; given an > environment where function calls are expensive. By "professional" I assume you mean "optimised for running dhrystone", right? -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
gwyn@smoke.brl.mil (Doug Gwyn) (03/19/91)
In article <1991Mar18.174207.7377@bingvaxu.cc.binghamton.edu> consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) writes: >Wouldn't it be good then, to include in the macro, something to compare >the actual POINTERS? If the pointers are the same then the two strings >have no CHOICE but to be equivelant. Nice try, but in the vast majority of applications the strings being compared, even in cases where they match, are contained in different storage locations. E.g. if ( StrEq( buffer, "EOF" ) ) ...
barton@cbnewsk.att.com (jim.m.barton) (03/19/91)
In article <15486@smoke.brl.mil>, gwyn@smoke.brl.mil (Doug Gwyn) writes: > In article <1991Mar15.142821@mars.mpr.ca> stone@mars.mpr.ca (Darren Stone) writes: > >Does anyone know if there's a better (!?) way to compare > >strings than with strcmp()? > >I'm looking for any speed advantage I can get! > > #define StrEq( a, b ) (*(a) == *(b) && strcmp( a, b ) == 0) /* UNSAFE */ It is doubtful that there is one *general* way of optimizing strcmp() for all machines. Most strcmp() implementations are written in machine language and have already gone through some serious optimization analysis. Also, since comparing NULL-terminated character strings is such a common operation, many (non-RISC) machines today have some direct support for comparing strings in their instruction sets. This means that doing a strcmp() can be done in only few instructions on many machines, rather than a character-by-character comparison loop. Although there is probably no *general* way of optimizing strcmp(), you might be able to do optimizations specific to your application by taking advantage of the type of data and expected frequency of data used in your application. The above macro, StrEq(), MIGHT be faster if most of the strings being compared tend to differ in their first character. When two strings differ in their initial character, StrEQ(), will save having to do a strcmp() call. Note, however, when a pair of strings do NOT differ in their first character, a strcmp() must be done along with the added overhead, (albeit small), of the extra character comparison and the logical AND operation. STrEq() might actually be SLOWER than a simple strcmp(), depending on your application's data and your machine, (the overhead of doing the extra operations and the execution time for strcmp()). IMHO, I doubt that StrEq() will acheive any significant optimization for MOST cases - it may be either slower or be faster by an amount that will be too small to measure. A more fruitful approach might be to examine your data more closely to see if some of the strcmp()'s could be avoided. For example, if many of the string's being compared are actually part a restricted set of values rather than arbitrary strings, you might be able to represent them internally as enumerated values, (where integer comparisons could be done), rather than as strings. Also, for strings that have fixed lengths or known lengths, you could do a memcmp() rather strcmp() - although any speedup here would certainly be machine-dependent. Jim Barton
rsalz@bbn.com (Rich Salz) (03/19/91)
In <1991Mar18.174207.7377@bingvaxu.cc.binghamton.edu> consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) writes: | Wouldn't it be good then, to include in the macro, |something to compare the actual POINTERS? If the pointers are the |same then the two strings have no CHOICE but to be equivelant. This |would really cut down the time under certain situations. If you're careful about the circumstances under which you use it, you might want to do something like this: #define EQ(p, q) \ ((p) == (q) || ((p)[0] == (q)[0] && strcmp((p), (q)) == 0)) I did this in Coda ("File distribution alternative to rdist", comp.sources.unix Volume 21) which builds lots of structures with shared strings. I got a slight performance gain (1 or 2%, I think). It's the difference between (eq ... ) and (equal ...), right-p? /r$ -- Please send comp.sources.unix-related mail to rsalz@uunet.uu.net. Use a domain-based address or give alternate paths, or you may lose out.
wirzenius@cc.helsinki.fi (Lars Wirzenius) (03/19/91)
In article <1991Mar18.174207.7377@bingvaxu.cc.binghamton.edu>, consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) writes: > Somebody said part of the problem with string comparisons is that if > the most of the strings are equivelant, there will be alot of > processor time. Wouldn't it be good then, to include in the macro, > something to compare the actual POINTERS? If the pointers are the > same then the two strings have no CHOICE but to be equivelant. This > would really cut down the time under certain situations. No, two strings may be equivalent, even if they are stored in different places (this applies to string constants too, if they appear several times). For example: char *s = "hello"; char *t = "hello"; s and t may or may not point to the same character, you can't depend on either behaviour. Another example: char s[10] = "hello"; char t[10]; strcpy(t, s); Now the contents (up to and including the terminating '\0', but no longer of course) of s and t are the same, but the addresses of the two variables are different. Of course, if two pointers to char are equivalent, the strings will be the same, but this isn't the only occurence, nor do I think it is all that common. -- Lars Wirzenius wirzenius@cc.helsinki.fi
steve@taumet.com (Stephen Clamage) (03/19/91)
peter@ficc.ferranti.com (Peter da Silva) writes: >In article <1193@caslon.cs.arizona.edu> dave@cs.arizona.edu (Dave Schaumann) writes: >> I think you'll have to agree that any professional C compiler worth the name >> would almost certainly inline *at least* the str???() functions; given an >> environment where function calls are expensive. >By "professional" I assume you mean "optimised for running dhrystone", right? I wouldn't assume so. Some of the string functions, notably strcpy and strlen, have shorter bodies than calling sequences. A compiler writer would choose to emit these functions inline because the code is both smaller and faster. -- Steve Clamage, TauMetric Corp, steve@taumet.com
torek@elf.ee.lbl.gov (Chris Torek) (03/20/91)
In article <15510@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes: >... in the vast majority of applications the strings being >compared, even in cases where they match, are contained in different >storage locations. E.g. > if ( StrEq( buffer, "EOF" ) ) ... Of course, you can build yourself a `string pool' system, in which each distinct string appears exactly once, and then two strings match iff their pointers match . . . but this merely offloads the `compare for truly equal' into the string pool system. If you need to do many compares, this is probably worthwhile anyway. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) (03/20/91)
In article <15510@smoke.brl.mil>, gwyn@smoke.brl.mil (Doug Gwyn) writes: |> In article <1991Mar18.174207.7377@bingvaxu.cc.binghamton.edu> consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) writes: |> >Wouldn't it be good then, to include in the macro, something to compare |> >the actual POINTERS? If the pointers are the same then the two strings |> >have no CHOICE but to be equivelant. |> |> Nice try, but in the vast majority of applications the strings being |> compared, even in cases where they match, are contained in different |> storage locations. E.g. |> if ( StrEq( buffer, "EOF" ) ) ... Yeah, I figured. I just thought I'd give it a try. -Rob Konigsberg
scs@adam.mit.edu (Steve Summit) (03/20/91)
In article <1991Mar19.034802.24340@cbnewsk.att.com> barton@cbnewsk.att.com (jim.m.barton) writes: >A more fruitful approach might be to examine your data more closely to >see if some of the strcmp()'s could be avoided. For example, if many >of the string's being compared are actually part a restricted set of >values rather than arbitrary strings, you might be able to represent them >internally as enumerated values, (where integer comparisons could be done)... Indeed. When it's important to do so, reducing the number of explicit string comparisons is not hard, even when the strings are neither restricted nor known in advance. Hashing is a very powerful and surprisingly easy technique to use. I was just writing a diff-like program the other day. By representing strings not with individual char *'s, but rather with instances of this structure: struct chunk { char *c_ptr; unsigned int c_hash; }; , and by keeping the c_hash field filled in with a hash value derived from the string pointed to by c_ptr, I could compare strings for equality with this macro: #define Chunkeq(cp1, cp2) ((cp1)->c_hash == (cp2)->c_hash && \ strcmp((cp1)->c_ptr, (cp2)->c_ptr) == 0) When you have a lot of strings to compare, the time spent computing hash values, and the extra space to store them, can be well worth it. You can also write great little symbol table implementations using hashing, and the hash values don't even have to take up extra space (they're implicit in the bucket or slot indices). (These techniques are closely related to the "string pool" modules mentioned by Doug and Chris. In fact, a string pool would almost certainly be written using hashing for string insertion.) See any algorithms book for much more on hashing; it's got nothing particularly to do with C. (Please don't ask "what's the best string hash function?", which is an FAQ, though not on the list yet.) Steve Summit scs@adam.mit.edu
marc@arnor.UUCP (Marc Auslander) (03/20/91)
A correct compiler cannot inline strcmp unless it is given extralingual permission to do so - for example in a pragma or by defining some reserved symbol or by some other means. This is because it is legal to combine a program which uses strcmp with another which defines it to by anything at all! For example, the Risc System/6000 xlc compiler inlines __strcmp and __strcpy. (In Ansi C, its ok for the compiler to do funny things with symbols which start with "__"). Then, string.h contains defines of strcmp and strcpy to the __ versions which are done if __STR__ is defined. From the user point of view, you define __STR__ if you want in line string functions. On user approach is to add -D__STR__ to the CFLAGS in his make file. (If you're going to the trouble of inlining, its a good idea to do it in a way which lets you get the advantage in benchmarks!) -- Marc Auslander <marc@ibm.com>
flint@gistdev.gist.com (Flint Pellett) (03/21/91)
The original poster apparently is only interested in an equal/not equal result from strcmp(), not the before/equal/after result that strcmp() actually returns. I'd like to expand the question a bit: I'm curious if anyone has a method that actually speeds up strcmp() in the situtation where you have to get the lexicographic ordering information back. (Such as when the strcmp() is used as the basis for comparison in a bsearch().) -- Flint Pellett, Global Information Systems Technology, Inc. 1800 Woodfield Drive, Savoy, IL 61874 (217) 352-1165 uunet!gistdev!flint or flint@gistdev.gist.com
henry@zoo.toronto.edu (Henry Spencer) (03/21/91)
In article <1991Mar19.034802.24340@cbnewsk.att.com> barton@cbnewsk.att.com (jim.m.barton) writes: >... The above macro, StrEq(), MIGHT be faster if most >of the strings being compared tend to differ in their first character... >IMHO, I doubt that StrEq() will acheive any significant optimization for >MOST cases... Sorry, there is considerable experimental evidence that it does. The fact is, most string comparisons fail on the very first character, and even a compiler-inlined strcmp() -- not all that common in the Unix world yet, thanks to our geriatric compilers -- is generally a more expensive way of discovering this. -- "[Some people] positively *wish* to | Henry Spencer @ U of Toronto Zoology believe ill of the modern world."-R.Peto| henry@zoo.toronto.edu utzoo!henry
henry@zoo.toronto.edu (Henry Spencer) (03/21/91)
In article <1991Mar16.211154.5568@cc.helsinki.fi> wirzenius@cc.helsinki.fi (Lars Wirzenius) writes: >For non-portability, the fastest way would probably be to write inline >assembly code using the machine's string compare instruction... Actually, not necessarily so. Even on machines which have a string-compare instruction, that instruction is often relatively slow. It can be quicker to do it yourself. There are also some fiendish tricks that can be played to considerably beat the performance of the obvious compare loop, and of any string-compare instruction which basically just implements that loop. However, you'll have to wait a bit longer to hear about them, since the summer Usenix conference rejected my paper... -- "[Some people] positively *wish* to | Henry Spencer @ U of Toronto Zoology believe ill of the modern world."-R.Peto| henry@zoo.toronto.edu utzoo!henry
gwyn@smoke.brl.mil (Doug Gwyn) (03/21/91)
In article <11111@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
-Of course, you can build yourself a `string pool' system, in which each
-distinct string appears exactly once, and then two strings match iff their
-pointers match . . . but this merely offloads the `compare for truly
-equal' into the string pool system.
-If you need to do many compares, this is probably worthwhile anyway.
Gee, great minds think alike!
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/21/91)
In article <1991Mar20.174452.4246@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <1991Mar19.034802.24340@cbnewsk.att.com> barton@cbnewsk.att.com (jim.m.barton) writes: >>... The above macro, StrEq(), MIGHT be faster if most >>of the strings being compared tend to differ in their first character... >>IMHO, I doubt that StrEq() will achieve any significant optimization for >>MOST cases... >Sorry, there is considerable experimental evidence that it does. The fact >is, most string comparisons fail on the very first character, and even a An easy method for verifying that two strings typically differ in their first character run the following (dumb) sorting program. For the first 10k words out of /usr/dict/words scrambled up in essentially random order, it says 89% of the comparisons failed on their first character. This is not much of a test, granted, and there _will_ be some cases (e.g. sorting nearly-sorted lists of words) where comparisons _wont_ be decided by the first characters. However, even in these latter cases the overhead of an extra `compare' versus the overhead of the strcmp loop will be small -- please feel free to experimentally verify this. (As a back of the envelope calculation we can say that if it costs S instructions for the strcmp code and C instruction(s) for the compare then on average we will perform C + .11 S instructions. We can then say ``how much would C have to be to make it worth using only S?'' We find break even at C = .89 S. Therefore, provided the compare sequence (including jumps and register loads) on your machine is less than 89% of a (typical) call/inline of strcmp then USE THE COMPARE). As a slight side note there are certain contexts where string comparisons can be made MUCH more efficient than a character-by-character compare. The well-known `string search' algorithms that essentially compare _rare_ characters within each string (e.g. if one string contains a `z' then first search for `z' in the other string -- if found then match the rest) & then adjust pointers by an appropriate amount (rather than just incrementing them) depending on the outcome of each character comparison is one example. I remember reading an article in Software Practice and Experience that compared such a statistical method with string-matching INSTRUCTIONS (I _think_ it was on an IBM 370 so that dates me doesnt it?) and the algorithm was found to be SUBSTANTIALLY faster (like an order-of-magnitude, unless I'm mistaken) than the hardware string compare. -kym /* Dumb sort program that figgas out how many strings (mis)match on first character -- it says 11% match for /usr/dict/words. */ #include <stdio.h> #include <string.h> #define MAXWORD 10000 char *word[MAXWORD]; int nw=0; int nmatch1=0; int ncmp=0; main(){ init(); scramble(); sort(); printf("fraction match first char=%g\n",(double)nmatch1/ncmp); } init() { int i; char buf[100]; for(i=0;i<MAXWORD;i++){ gets(buf); word[i]=strcpy(malloc(strlen(buf)+1),buf); } } scramble() { int i; for(i=0;i<MAXWORD;i++){ int j=rand()%MAXWORD; char *tmp=word[i]; word[i]=word[j]; word[j]=tmp; } } sort() { int i,j,k; char *tmp; for(i=0;i<MAXWORD;i++){ j=i; tmp=word[j]; for(k=j+1;k<MAXWORD;k++) if(cmp(word[k],tmp)<0) j=k,tmp=word[j]; word[j]=word[i]; word[i]=tmp; } } cmp(s1,s2) char *s1,*s2; { ++ncmp; if(*s1==*s2) ++nmatch1; return strcmp(s1,s2); }
psmith@iies.ecn.purdue.edu (Paul F Smith) (03/21/91)
>In article barton@cbnewsk.att.com (jim.m.barton) writes: > > struct chunk > { > char *c_ptr; > unsigned int c_hash; > }; > ... > >You can also write great little symbol table implementations >using hashing, and the hash values don't even have to take up ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >extra space (they're implicit in the bucket or slot indices). ^^^^^^^^^^^!!! > > Steve Summit > scs@adam.mit.edu OK, I won't ask about a string hash function :-), but could you please elaborate on how your symbol table works. I just implemented a hash table that keeps the hash value around (like your struct above), which I then use in scanning the list at the given bucket to minimize strcmp()s. But, I can't see how you can get rid of it and still reduce the number of strcmp()s. Please describe, Thanks! -- ------------------------------------------------------------------ Paul F. Smith - ADPC - Purdue University psmith@ecn.purdue.edu <someday I'll think of something profound to put here, maybe>
scs@adam.mit.edu (Steve Summit) (03/21/91)
In article <1991Mar20.020957.9180@athena.mit.edu>, I wrote: >You can also write great little symbol table implementations >using hashing, and the hash values don't even have to take up >extra space (they're implicit in the bucket or slot indices). In article <1991Mar20.213615.11223@noose.ecn.purdue.edu> psmith@iies.ecn.purdue.edu (Paul F Smith) writes: >...could you >please elaborate on how your symbol table works. I just implemented a >hash table that keeps the hash value around (like your struct above), >which I then use in scanning the list at the given bucket to minimize >strcmp()s. But, I can't see how you can get rid of it and still reduce >the number of strcmp()s. Please describe, Thanks! Well, I didn't say the number of strcmp's was reduced to 1. Hashing has done its work (i.e. eliminated a lot of strcmp's and associated searching) when it has computed the right bucket to look in; there aren't supposed to be large numbers of collisions to scan over. I just call strcmp for each entry in the bucket. The code is tiny, so I'll show it: struct symtabent { char *st_name; struct type *st_type; somethingorother st_value; struct symtabent *st_next; }; struct symtab { int st_hashsize; struct symtabent *st_hashtab[1]; /* dynamically extended */ /* ("unwarranted chumminess with the compiler") */ }; stinsert(step, stp) struct symtabent *step; struct symtab *stp; { unsigned int h = hash(step->st_name) % stp->st_hashsize; /* simpleminded; doesn't check for duplicates */ step->st_next = stp->st_hashtab[h]; stp->st_hashtab[h] = step; } struct symtabent * stlookup(name, stp) char *name; struct symtab *stp; { unsigned int h = hash(name) % stp->st_hashsize; register struct symtabent *step; for(step = stp->st_hashtab[h]; step != NULL; step = step->st_next) { if(strcmp(step->st_name, name) == 0) return step; } return NULL; } Steve Summit scs@adam.mit.edu
gwyn@smoke.brl.mil (Doug Gwyn) (03/21/91)
In article <MARC.91Mar20080239@marc.watson.ibm.com> marc@arnor.UUCP (Marc Auslander) writes: >A correct compiler cannot inline strcmp unless it is given >extralingual permission to do so - for example in a pragma or by >defining some reserved symbol or by some other means. This is because >it is legal to combine a program which uses strcmp with another which >defines it to by anything at all! Conforming implementations ARE allowed to in-line the standard library functions. They must also provide external functions for most of them, in case some application decides it needs a pointer to one of those functions, etc. It is true that a strictly conforming program is allowed to #define strcmp, under certain circumstances, but that is not an issue. It is the use of "strcmp" as an identifier with external linkage that an in-lining compiler must key on. It is also possible for <string.h> to #define strcmp __strcmp and have the compiler in-line __strcmp instead of strcmp. This may make it slightly easier for the implementation to support access to the external-function definition of strcmp mentioned in my first paragraph.
sef@kithrup.COM (Sean Eric Fagan) (03/21/91)
In article <15537@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes: >It is also possible for <string.h> to #define strcmp __strcmp and have >the compiler in-line __strcmp instead of strcmp. This may make it >slightly easier for the implementation to support access to the >external-function definition of strcmp mentioned in my first paragraph. ANSI C still has the requirement that #undef strcmp (after including <string.h>) give the user the version from the library, doesn't it? (A footnone in one of the sections; I don't have my copy with me, so I can't give explicit details.) In any event, that footnote seemed to indicate pretty well how the committee thought intrinsics should be handled... -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
rh@smds.UUCP (Richard Harter) (03/21/91)
In article <11111@dog.ee.lbl.gov>, torek@elf.ee.lbl.gov (Chris Torek) writes: > In article <15510@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes: > >... in the vast majority of applications the strings being > >compared, even in cases where they match, are contained in different > >storage locations. E.g. > > if ( StrEq( buffer, "EOF" ) ) ... > Of course, you can build yourself a `string pool' system, in which each > distinct string appears exactly once, and then two strings match iff their > pointers match . . . but this merely offloads the `compare for truly > equal' into the string pool system. Just so. In one of our programs where there was a great deal of manipulation of strings we did a string "object" package (C not C++). For each unique string under management there is a struct which looks something like struct string_object { char *c; /* The string itself */ int n; /* Its length */ int rc; /* Reference count */ int sz; /* Allocated space for string */ struct string_object *link; /* Data access */ }; The pool used a hash table with short buckets (the link field). In the case in question there are many more references to existing strings (and multiple instances) then there are references to new strings so using pointers into the pool is a big win. The only object primitives are add_string and delete_string. Delete_string decrements a reference count and does the appropriate when the count goes to zero. Add_string has to do the compare, which is more expensive than a simple compare because it has to compute the hash index (which only needs to be very quick and dirty). If the string exists the reference count is upped; otherwise a new object is created (actually taken off a free list.) The size field is there because the strings are copied into the struct. This is more overhead, but it is worth it in this case, because the strings in question are often substrings. The upshot is that there is very little allocation/deallocation which is much more expensive than string compares. It would be interesting to hear how other people have handled this sort of problem. -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.
psmith@iies.ecn.purdue.edu (Paul F Smith) (03/22/91)
In article <1991Mar20.235613.20480@athena.mit.edu> scs@adam.mit.edu writes: >In article <1991Mar20.213615.11223@noose.ecn.purdue.edu> psmith@iies.ecn.purdue.edu (Paul F Smith) writes: >>...could you >>please elaborate on how your symbol table works. I just implemented a >>hash table that keeps the hash value around (like your struct above), >>which I then use in scanning the list at the given bucket to minimize >>strcmp()s. But, I can't see how you can get rid of it and still reduce >>the number of strcmp()s. Please describe, Thanks! > >Well, I didn't say the number of strcmp's was reduced to 1. >Hashing has done its work (i.e. eliminated a lot of strcmp's and >associated searching) when it has computed the right bucket to >look in; there aren't supposed to be large numbers of collisions >to scan over. Ok. You were just talking about the normal effect of a hash table. I used a hash table, but for each entry I keep the full hash value to try and keep the # of strcmp()s close to 1. Like this... > >for(step = stp->st_hashtab[h]; step != NULL; step = step->st_next) > { > if(strcmp(step->st_name, name) == 0) Instead I do: (compute "hash(name)" once at the beginning) if((step->hashval == hash(name)) && strcmp(step->st_name, name) == 0) > return step; > } > I thought you had achieved this property without keep the hashvalue with the symbol entry. (wouldn't that be nice!) I think this reduction in strcmp()s is worth the space. Does anyone know of a reason that it isn't? (other than not having enough space :-) Thanks for the reply! -- ------------------------------------------------------------------ Paul F. Smith - ADPC - Purdue University psmith@ecn.purdue.edu <someday I'll think of something profound to put here, maybe>
willcr@bud.sos.ivy.isc.com (Will Crowder) (03/22/91)
In article <1991Mar20.213615.11223@noose.ecn.purdue.edu>, psmith@iies.ecn.purdue.edu (Paul F Smith) writes: |> >In article barton@cbnewsk.att.com (jim.m.barton) writes: |> > |> > struct chunk |> > { |> > char *c_ptr; |> > unsigned int c_hash; |> > }; |> > |> ... |> > |> >You can also write great little symbol table implementations |> >using hashing, and the hash values don't even have to take up |> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |> >extra space (they're implicit in the bucket or slot indices). |> ^^^^^^^^^^^!!! |> > |> > Steve Summit |> > scs@adam.mit.edu |> |> OK, I won't ask about a string hash function :-), but could you |> please elaborate on how your symbol table works. I just implemented a |> hash table that keeps the hash value around (like your struct above), |> which I then use in scanning the list at the given bucket to minimize |> strcmp()s. But, I can't see how you can get rid of it and still reduce |> the number of strcmp()s. Please describe, Thanks! (Since we're talking about one of my favorite data structures, hashing with open chaining, I'll take a shot at this one.) Hashing with open chaining is a method of saving information such that it can be looked up quickly. The basic idea is to deal with hash collisions by maintaining a list (usually a linked-list) of data with identical hash values. Then each hash "bucket" points to the list of data having the same hash value. To look up a key, they key is hashed and the chain having that hash value is scanned for the data. Since all data in a given chain have the same hash value, there is no need to save the hash value anywhere. One optimization might be to further hash the data in each chain, obviously with a different hashing algorithm, and then rehash the key and scan the chain for data having the same new hash value. In this case, you would have to save the new hash value along with the data. You could also have chains on top of chains, etc., etc., but a simple single-level algorithm with one set of buckets is usually efficient enough. |> ------------------------------------------------------------------ |> Paul F. Smith - ADPC - Purdue University psmith@ecn.purdue.edu |> <someday I'll think of something profound to put here, maybe> Hope this helps, Will -------------------------------------------------------------------------------- Will Crowder, MTS | "I belong to no organized politcal party. (willcr@ivy.isc.com) | I'm a democrat." INTERACTIVE Systems Corp. | -- Will Rogers
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/22/91)
In article <1151@gistdev.gist.com> flint@gistdev.gist.com (Flint Pellett) writes: > I'm curious if > anyone has a method that actually speeds up strcmp() in the situtation > where you have to get the lexicographic ordering information back. Unrolling the loop produces good results on many machines. You can also zoom through the strings comparing words, then compare characters if necessary (i.e., when your byte order is backwards) to produce a result. It works best if you know the string lengths beforehand and have each string starting on a word boundary. Piercarlo can explain a lot more. ---Dan
henry@zoo.toronto.edu (Henry Spencer) (03/22/91)
In article <MARC.91Mar20080239@marc.watson.ibm.com> marc@arnor.UUCP (Marc Auslander) writes: >A correct compiler cannot inline strcmp unless it is given >extralingual permission to do so - for example in a pragma or by >defining some reserved symbol or by some other means. This is because >it is legal to combine a program which uses strcmp with another which >defines it to by anything at all! In ANSI C it is *not* legal for a program to redefine the standard library functions (not as external identifiers anyway), so the inlining is, subject to some care about fine points, legitimate. -- "[Some people] positively *wish* to | Henry Spencer @ U of Toronto Zoology believe ill of the modern world."-R.Peto| henry@zoo.toronto.edu utzoo!henry
torek@elf.ee.lbl.gov (Chris Torek) (03/22/91)
[This is a correction to a previous article, which should have been stamped out before you read it, because of the Supersedes header, but I suspect there are news sites out there that do not implement Supersedes. The correction is just an `='-for-`-' typo caused by runaway fingers. Thanks to David Barnett for catching it early.] In article <1151@gistdev.gist.com> flint@gistdev.gist.com (Flint Pellett) writes: >... I'm curious if anyone has a method that actually speeds up strcmp() >in the situtation where you have to get the lexicographic ordering >information back. For a particular application (my own personal version of Emacs, actually), I found that comparing the first two characters `inline' gave the best results. /* * Find an entry in a table. Create should be true iff we should create * a new entry if the name is not in the table. The names of any new * entries are saved with savestr(). */ struct tentry * FindEntry(t, name, create) struct table *t; register char *name; int create; { register struct tentry **ents, *te; register int i, h, m, l; if (name == NULL) return (NULL); if (t->t_sorted <= 0) SortEntries(t); ents = t->t_ents; /* binary search */ h = t->t_size - 1; l = 0; while (h >= l) { te = ents[m = (h + l) >> 1]; /* first two characters done inline for speed */ if ((i = te->te_name[0] - name[0]) == 0 && name[0] != 0) { if ((i = te->te_name[1] - name[1]) == 0) i = strcmp(te->te_name + 1, name + 1); } if (i > 0) h = m - 1; else if (i < 0) l = m + 1; else /* found */ return (te); } if (create) { if ((name = savestr(name)) != NULL && (te = insert(t, name, &ents[l])) != NULL) return (te); if (name != NULL) free(name); error("out of memory allocating entry in %s", t->t_name); } return (NULL); } SortEntries uses a clever trick: it does an insertion sort if the table is `almost' sorted, otherwise it does a quicksort. Much of this speed-hacking is due to *measured* results showing that my Emacs spent most of its `startup' time building tables and reading the current directory. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
xwang@gmuvax2.gmu.edu (Xiang-Min Wang) (03/24/91)
In article <1991Mar18.174207.7377@bingvaxu.cc.binghamton.edu> consp06@bingsuns.cc.binghamton.edu (Robert Konigsberg) writes: >Somebody said part of the problem with string comparisons is that if >the most of the strings are equivelant, there will be alot of >processor time. Wouldn't it be good then, to include in the macro, >something to compare the actual POINTERS? If the pointers are the >same then the two strings have no CHOICE but to be equivelant. This >would really cut down the time under certain situations. > > -Rob Konigsberg This solution can only be used for a limited situation. This solution could lead to the aliasing problem. xwang
browns@iccgcc.decnet.ab.com (Stan Brown) (03/26/91)
In article <MARC.91Mar20080239@marc.watson.ibm.com>, marc@arnor.UUCP (Marc Auslander) writes: > A correct compiler cannot inline strcmp unless it is given > extralingual permission to do so - for example in a pragma or by > defining some reserved symbol or by some other means. This is because > it is legal to combine a program which uses strcmp with another which > defines it to by anything at all! Assuming that by 'correct compiler' you mean one that conforms to the ANSI standard, I have to disagree with your statement. Sec 4.1.6 of the standard specifically allows compilers to inline the standard library functions. My opinions are mine: I don't speak for any other person or company. email (until 91/4/30): browns@iccgcc.decnet.ab.com Stan Brown, Oak Road Systems, Cleveland, Ohio, USA +1 216 371 0043